Friday, November 26, 2021

[FIXED] pagination problem - unable to understand the log

November 26, 2021 python, scrapy, web-crawler, web-scraping No comments

Issue

I can't see any error in the log I got, but only 108 elements are scraped although there are far more items to be scraped. So, I guess it can be a problem with pagination. But no idea how to solve it.

Here is the shortened log I got:

2020-10-22 11:59:17 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: digi_allbooks)
2020-10-22 11:59:17 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.5 (default, Aug  5 2020, 09:44:06) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.19041-SP0
2020-10-22 11:59:17 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_MAX_DELAY': 120, 'AUTOTHROTTLE_START_DELAY': 60, 'BOT_NAME': 'digi_allbooks', 'FEED_FORMAT': 'xml', 'FEED_URI': '99-08-01.xml', 'NEWSPIDER_MODULE': 'digi_allbooks.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['digi_allbooks.spiders']}
2020-10-22 11:59:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2583fc44c0155dc4
2020-10-22 11:59:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-10-22 11:59:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'digi_allbooks.middlewares.UserAgentRotatorMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-22 11:59:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-22 11:59:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-22 11:59:17 [scrapy.core.engine] INFO: Spider opened
2020-10-22 11:59:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-22 11:59:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-22 11:59:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/robots.txt> (referer: None)
2020-10-22 11:59:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-22 11:59:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-22 11:59:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.digikala.com/search/category-book/>
{'title': 'کتاب 1984 اثر جورج اورول نشر شاهدخت پاییز', 'star': 4.6, 'discounted_percent': 69, 'discounted_price': 19900, 'original_price': 65000, 'discounted_amount': 45100, 'url': 'https://www.digikala.com/product/dkp-2824939/کتاب-1984-اثر-جورج-اورول-نشر-شاهدخت-پاییز'}

2020-10-22 11:59:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.digikala.com/search/category-book/?sortby=4&pageno=2>
{'title': 'کتاب ملت عشق اثر الیف شافاک', 'star': 4.4, 'discounted_percent': 43, 'discounted_price': 39900, 'original_price': 70000, 'discounted_amount': 30100, 'url': 'https://www.digikala.com/product/dkp-565603/کتاب-ملت-عشق-اثر-الیف-شافاک'}
2020-10-22 11:59:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.digikala.com/search/category-book/?sortby=4&pageno=2>
{'title': 'کتاب زنان زیرک اثر شری آرگو', 'star': 4.5, 'discounted_percent': 16, 'discounted_price': 29400, 'original_price': 35000, 'discounted_amount': 5600, 'url': 'https://www.digikala.com/product/dkp-413298/کتاب-زنان-زیرک-اثر-شری-آرگو'}
2020-10-22 11:59:21 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-22 11:59:21 [scrapy.extensions.feedexport] INFO: Stored xml feed (108 items) in: 99-08-01.xml
2020-10-22 11:59:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2133,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 270896,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 22, 8, 29, 21, 140213),
 'item_scraped_count': 108,
 'log_count/DEBUG': 113,
 'log_count/INFO': 10,
 'offsite/filtered': 1,
 'request_depth_max': 3,
 'response_received_count': 4,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2020, 10, 22, 8, 29, 17, 564090)}
2020-10-22 11:59:21 [scrapy.core.engine] INFO: Spider closed (finished)

And here is my shortened spider:

class AllbooksSpider(scrapy.Spider):
    name = 'allbooks'
    allowed_domains = ['www.digikala.com']
    
    def start_requests(self):
        yield scrapy.Request(url= 'https://www.digikala.com/search/category-book',
            callback= self.parse)

    def parse(self, response):
        original_price=0
        
        try:
            for product in response.xpath("//ul[@class='c-listing__items js-plp-products-list']/li"):
                title= product.xpath(".//div/div[2]/div/div/a/text()").get()
                if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
                    original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
                    discounted_amount= original_price-discounted_price
                else:
                    original_price= print("not available")
                    discounted_amount= print("not available") 
                yield{
                    'title':title,
                    'discounted_amount': discounted_amount
                    }
             next_page= response.xpath('//*[@class="c-pager__item"]/../following-sibling::*//@href').extract_first()
             if next_page:
                yield scrapy.Request(response.urljoin(next_page))
         except AttributeError:
             logging.error("The element didn't exist")

Can you help me to understand what is the problem and how to solve it?

Thank you!!

Solution

The problem is your next page link selector.
Your current code finds the first pagination link that is not currently active and then follows the one after it.

As a result, these are the links your spider tries to follow:

https://www.digikala.com/search/category-book/?sortby=4&pageno=3
https://www.digikala.com/search/category-book/?sortby=4&pageno=2
javascript:

One way to fix this is following the link after the active one (@class="c-pager__item is-active").

Another way, resulting in simpler code, is following every pagination link and letting the dupefilter do its job:

for link in response.css(".c-pager__item"):
    yield response.follow(link)

Answered By - stranac

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] pagination problem - unable to understand the log

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels