Issue
I'm using generic spiders with a list of multiple urls in the start_urls
field.
Is it possible to export one json
file for each URL?
As far as I know it's only possible to set one path to one specific output file.
Any ideas how to solve this are rewarded!
EDIT: This is my spider class:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = start_urls = ['www.domain1.com','www.domain2.com',
'www.domain3.com']
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
'DEPTH_LIMIT': '1',
'FEED_URI': 'file:///C:/path/to/result.json',
}
rules = (
Rule(LinkExtractor(allow=r"abc"), callback='parse_item', follow=True),
)
def parse_item(self, response):
all_text = response.xpath("//p/text()").getall()
yield {
"text": " ".join(all_text),
"url": response.url,
}
Solution
First option
You can save the items in the spider as Scrapy tutorial for example:
import scrapy
import json
DICT = {
'https://quotes.toscrape.com/page/1/': 'domain1.json',
'https://quotes.toscrape.com/page/2/': 'domain2.json',
}
class MydomainSpider(scrapy.Spider):
name = "mydomain"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
filename = DICT[response.url]
with open(filename, 'w') as fp:
json.dump({"content": response.body.decode("utf-8")}, fp)
The DICT
variable is just for specifying the JSON filename but you can use the domain as the filename too.
Second option
You can try using process_item
in pipelines.py
as follow:
from scrapy.exporters import JsonItemExporter
class SaveJsonPipeline:
def process_item(self, item, spider):
filename = item['filename']
del item['filename']
JsonItemExporter(open(filename, "wb")).export_item(item)
return item
item['filename']
is for save the filename for each start_url
. You need to set the items.py
too, for example:
import scrapy
class MydomainItem(scrapy.Item):
filename = scrapy.Field()
content = scrapy.Field()
your spider:
import scrapy
from ..items import MydomainItem
DICT = {
'https://quotes.toscrape.com/page/1/': 'domain1.json',
'https://quotes.toscrape.com/page/2/': 'domain2.json',
}
class MydomainSpider(scrapy.Spider):
name = 'mydomain'
allowed_domains = ['mydomain.com']
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
item = MydomainItem()
item["filename"] = DICT[response.url]
item["content"] = response.body.decode("utf-8")
yield item
Before running you need to add the pipeline in your settings.
ITEM_PIPELINES = {
'myproject.pipelines.SaveJsonPipeline': 300,
}
Answered By - Brenda S.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.