Friday, December 1, 2023

[FIXED] Scrapy: How to save a list from our spider class in a file when scraping

December 01, 2023 python, scrapy No comments

Issue

I am finding list of webpage links from a start url and then finding all links by following into those pages. I am currently saving these as a list in self.links, and I want to know how to save it as csv or json file after scraping is done. My goal is to call a new function to process data on each followed pages.

import scrapy
from scrapy.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):
    name = "myspider"
    links = []
    start_urls = ["https://books.toscrape.com/"]
   
    # Define the `parse` method. This method will be called for each page that the spider crawls.

    def parse(self, response):
        to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
        # to_allow = self.current
        le = LinkExtractor(deny=to_avoid)
        ex_links = le.extract_links(response)
        for href in ex_links:
            # print(href.url)
            url = response.urljoin(href.url)
            if self.current in url:
                self.links.append(url)
                # print(url)
                yield response.follow(url, callback = self.parse)

I tried using another parse_landing_page(self, response) function and yielded it in the parse() function. But didnt work

Solution

Scrapy has this functionality built in as Feed Exports. In order to use the feature all you have to do is yield a dictionary from your parse method and then specify where to save the contents on the command line or in the settings for your spider.

For example:

import scrapy
from scrapy.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):
    name = "myspider"
    links = []
    start_urls = ["https://books.toscrape.com/"]
    custom_settings = {
        "FEEDS": {
            "items.csv": {
                "format": "csv",
                "fields": ["link"],
            }
        }
    }
   
    # Define the `parse` method. This method will be called for each page that the spider crawls.

    def parse(self, response):
        to_avoid = ['tel','facebook','twitter','instagram','privacy','terms','contact','java','cookies','policies','google','mail']
        # to_allow = self.current
        le = LinkExtractor(deny=to_avoid)
        ex_links = le.extract_links(response)
        for href in ex_links:
            # print(href.url)
            url = response.urljoin(href.url)
            if self.current in url:
                yield {'link': url}
                yield response.follow(url, callback = self.parse)

Or instead of using the custom settings you could just use the -o option on the command line:

scrapy crawl myspider -o items.csv

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 1, 2023

[FIXED] Scrapy: How to save a list from our spider class in a file when scraping

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels