Saturday, May 7, 2022

[FIXED] Scrapy: How to store scraped data in different json files within one crawler run?

May 07, 2022 python, scrapy, web-scraping No comments

Issue

I'm using generic spiders with a list of multiple urls in the start_urls field.

Is it possible to export one json file for each URL?

As far as I know it's only possible to set one path to one specific output file.

Any ideas how to solve this are rewarded!

EDIT: This is my spider class:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'my_spider'
    start_urls =  start_urls = ['www.domain1.com','www.domain2.com', 
   'www.domain3.com']


    custom_settings = {
                'FEED_EXPORT_ENCODING': 'utf-8',
                'DEPTH_LIMIT': '1',
                'FEED_URI': 'file:///C:/path/to/result.json',
    }

    rules = (
        Rule(LinkExtractor(allow=r"abc"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        all_text = response.xpath("//p/text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url,
        }

Solution

First option

You can save the items in the spider as Scrapy tutorial for example:

import scrapy
import json

DICT = {
    'https://quotes.toscrape.com/page/1/': 'domain1.json',
    'https://quotes.toscrape.com/page/2/': 'domain2.json',
}


class MydomainSpider(scrapy.Spider):
    name = "mydomain"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        filename = DICT[response.url]
        with open(filename, 'w') as fp:
            json.dump({"content": response.body.decode("utf-8")}, fp)

The DICT variable is just for specifying the JSON filename but you can use the domain as the filename too.

Second option

You can try using process_item in pipelines.py as follow:

from scrapy.exporters import JsonItemExporter


class SaveJsonPipeline:
    def process_item(self, item, spider):
       filename = item['filename']
       del item['filename']
       JsonItemExporter(open(filename, "wb")).export_item(item)
       return item

item['filename'] is for save the filename for each start_url. You need to set the items.py too, for example:

import scrapy


class MydomainItem(scrapy.Item):
    filename = scrapy.Field()
    content = scrapy.Field()

your spider:

import scrapy
from ..items import MydomainItem


DICT = {
    'https://quotes.toscrape.com/page/1/': 'domain1.json',
    'https://quotes.toscrape.com/page/2/': 'domain2.json',
}


class MydomainSpider(scrapy.Spider):
    name = 'mydomain'
    allowed_domains = ['mydomain.com']
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        item = MydomainItem()
        item["filename"] = DICT[response.url]
        item["content"] = response.body.decode("utf-8")
        yield item

Before running you need to add the pipeline in your settings.

ITEM_PIPELINES = {
    'myproject.pipelines.SaveJsonPipeline': 300,
}

Answered By - Brenda S.

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 7, 2022

[FIXED] Scrapy: How to store scraped data in different json files within one crawler run?

Issue

Solution

First option

Second option

0 comments:

Post a Comment

Popular Posts

Labels