Issue
I am finding list of webpage links from a start url and then finding all links by following into those pages. I am currently saving these as a list in self.links, and I want to know how to save it as csv or json file after scraping is done. My goal is to call a new function to process data on each followed pages.
import scrapy
from scrapy.linkextractors import LinkExtractor
class MySpider(scrapy.Spider):
name = "myspider"
links = []
start_urls = ["https://books.toscrape.com/"]
# Define the `parse` method. This method will be called for each page that the spider crawls.
def parse(self, response):
to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
# to_allow = self.current
le = LinkExtractor(deny=to_avoid)
ex_links = le.extract_links(response)
for href in ex_links:
# print(href.url)
url = response.urljoin(href.url)
if self.current in url:
self.links.append(url)
# print(url)
yield response.follow(url, callback = self.parse)
I tried using another parse_landing_page(self, response) function and yielded it in the parse() function. But didnt work
Solution
Scrapy has this functionality built in as Feed Exports. In order to use the feature all you have to do is yield a dictionary from your parse method and then specify where to save the contents on the command line or in the settings for your spider.
For example:
import scrapy
from scrapy.linkextractors import LinkExtractor
class MySpider(scrapy.Spider):
name = "myspider"
links = []
start_urls = ["https://books.toscrape.com/"]
custom_settings = {
"FEEDS": {
"items.csv": {
"format": "csv",
"fields": ["link"],
}
}
}
# Define the `parse` method. This method will be called for each page that the spider crawls.
def parse(self, response):
to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
# to_allow = self.current
le = LinkExtractor(deny=to_avoid)
ex_links = le.extract_links(response)
for href in ex_links:
# print(href.url)
url = response.urljoin(href.url)
if self.current in url:
yield {'link': url}
yield response.follow(url, callback = self.parse)
Or instead of using the custom settings you could just use the -o
option on the command line:
scrapy crawl myspider -o items.csv
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.