Thursday, September 30, 2021

[FIXED] How to get results out of yield and save to file?

September 30, 2021 python, python-3.x, scrapy No comments

Issue

I am practicing to use the Scrapy webcrawler package and have a 2 part question because I am struggling a little bit to know what to do next:

I have a script called spider4Techcrunch.py which contains the following code:

import scrapy
from scrapy import cmdline

class TCSpider(scrapy.Spider):
    name = "techcrunch"

    def start_requests(self):
        urls = [
            "https://techcrunch.com/"
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        SET_SELECTOR = ".post-block__title"
        output = "--BEGIN OUTPUT--"
        print(output)
        for data in response.css(SET_SELECTOR):
            print('--BEGIN DATA--')
            print(data)
            TITLE_SELECTOR = "a ::text"
            URL_SELECTOR = "a ::attr(href)"
            yield {
                'title': data.css(TITLE_SELECTOR).extract_first(),
                'url':data.css(URL_SELECTOR).extract_first(),
            }


scrapy.cmdline.execute("scrapy runspider spider4Techcrunch.py".split())

When I execute the code, everything is working perfectly and returning results. Where I am struggling is with the yield command. How do I extract the text results of the KEY: VALUE pairs for "title": "url" from the yield command so that I can save the results into a text file line by line?

The last line in my code, is this a proper way to execute it? It just looks weird to me that I am executing the scrapy code by calling itself within the same file. Is there a best practice/better way? (Keeping in mind, I would like this class to be reusable for multiple URLS.)
```
scrapy.cmdline.execute("scrapy runspider spider4Techcrunch.py".split())
```

Solution

Preferred way to run scrapy application as script - docs
You can use one of built-in feed exporters
In your case it solution will be like this (for scrapy version 2.1):

import scrapy
from scrapy.crawler import CrawlerProcess

    class TCSpider(scrapy.Spider):
        name = "techcrunch"

        def start_requests(self):
            urls = [
                "https://techcrunch.com/"
            ]

            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            SET_SELECTOR = ".post-block__title"
            output = "--BEGIN OUTPUT--"
            print(output)
            for data in response.css(SET_SELECTOR):
                print('--BEGIN DATA--')
                print(data)
                TITLE_SELECTOR = "a ::text"
                URL_SELECTOR = "a ::attr(href)"
                yield {
                    'title': data.css(TITLE_SELECTOR).extract_first(),
                    'url':data.css(URL_SELECTOR).extract_first(),
                }

    process = CrawlerProcess(settings={
        "FEEDS": {
            "items.json": {"format": "json"},
            #"items.jl": {"format": "jsonlines"},

        },
    })

process.crawl(TCSpider)
process.start()

Answered By - Georgiy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, September 30, 2021

[FIXED] How to get results out of yield and save to file?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels