Sunday, January 23, 2022

[FIXED] why scrapy pipline doesnot get call from parser method?

January 23, 2022 python, scrapy No comments

Issue

I created a simple scrapy project that scrape a web page and save the data to the postgresql. I can get all the scraped data in my parse method but the pipline doesnot get called to save the data to the database. Here is my spider parse method.

    def parse(self, response):
        links = response.css('a::attr(href)').getall()
        if links is not None:
            for link in links:
                yield response.follow(link, callback=self.parse)
        else:
            loader = ItemLoader(item=TestItem(), selector=response)
            quote = response.css('div.quote p::text').get()
            loader.add_value('quote', title)

            yield loader.load_item()

Here is the TestItem

class TestItem(scrapy.Item):
    quote = scrapy.Field()

Here is the pipline

class TestPipeline:
    def process_item(self, item, spider):
            logging.log(logging.INFO, item)
            print(item)
            quote = Quote(text=item.quote)
            db.session.add(quote)
            db.session.commit()
        return item

Finaly the pipline is called in setting

ITEM_PIPELINES = {
   'Test.pipelines.TestPipeline': 300,
}

Any help is welcome. I printed the item in pipline. Here is the ouput.

2021-07-16 09:54:33 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: Quote)
2021-07-16 09:54:33 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-16 09:54:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-16 09:54:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'Quote',
 'NEWSPIDER_MODULE': 'Quote.spiders',
 'SPIDER_MODULES': ['Quote.spiders']}
2021-07-16 09:54:33 [scrapy.extensions.telnet] INFO: Telnet Password: 88264e27a8108b1f
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-07-16 09:54:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-07-16 09:54:34 [scrapy.middleware] INFO: Enabled item pipelines:
['Quote.pipelines.QuotePipeline']
2021-07-16 09:54:34 [scrapy.core.engine] INFO: Spider opened
2021-07-16 09:54:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-16 09:54:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-07-16 09:54:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.azquotes.com/quotes/topics/inspirational.html/> from <GET http://azquotes.com/quotes/topics/inspirational.html/>
2021-07-16 09:54:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.azquotes.com/quotes/topics/inspirational.html/> (referer: None)
2021-07-16 09:54:36 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-16 09:54:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 494,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 23417,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'elapsed_time_seconds': 2.771838,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 7, 16, 0, 54, 36, 828890),
 'httpcompression/response_bytes': 107443,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2021, 7, 16, 0, 54, 34, 57052)}
2021-07-16 09:54:36 [scrapy.core.engine] INFO: Spider closed (finished)

Solution

It is being called.

from ..items import TestItem

class Something(scrapy.Spider):
    name = "something"
    start_urls = ['http://azquotes.com/quotes/topics/inspirational.html']

    def parse(self, response):
        links = response.css('a::attr(href)').getall()
        #if links is not None:
        #    for link in links:
        #        yield response.follow(link, callback=self.parse)
        #else:
        loader = ItemLoader(item=TestItem(), selector=response)
        quote = response.css('div.quote p::text').get()
        title = "some_title"
        loader.add_value('quote', title)

        yield loader.load_item()

will print (I added 'print("called")' to process_item function):

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.azquotes.com/quotes/topics/inspirational.html> (referer: None)
[root] INFO: {'quote': ['some_title']}
{'quote': ['some_title']}
called
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.azquotes.com/quotes/topics/inspirational.html>
{'quote': ['some_title']}

When you yield an item the 'process_item' function is called, but you're getting some errors because you try to follow some non-existing pages (like <a href="#">).

From what I'm seeing you're trying to scrape all the quotes from the site, if I'm correct then this is what you need to do:

Make a function that will scrape all quotes from just one of the pages.
Create crawling rules for the links you want to follow (more info here.)
Change to crawl spider:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from ..items import TestItem

class Something(CrawlSpider):
    name = "something"
    start_urls = ['http://azquotes.com/quotes/topics/inspirational.html']

    rules = (Rule(LinkExtractor(........),)
    def parse(self, response):
........
........
........

Answered By - SuperUser

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 23, 2022

[FIXED] why scrapy pipline doesnot get call from parser method?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels