Friday, January 26, 2024

[FIXED] Scrapy not scraping next page

January 26, 2024 python, scrapy No comments

Issue

New to spiders and my crawler won't scrape next page. After the first page data, my crawl log says 'DEBUG Crawled DEBUG: Crawled (200) <GET https://reedsy.com/robots.txt> (referer: None)' twice, then the next line is [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://reedsy.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates).

Thanks in advance for the help!

import scrapy

class PublisherSpider(scrapy.Spider):
    name = 'mycrawler'
    start_urls = ['https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=']
   
    def parse(self, response):
        for publishers in response.css('div.panel-body'):
            publisher = publishers.css('h3.text-heavy::text').get()
            url = publishers.css('a.text-blue::attr(href)').get()
            if publisher and url:
                yield {"Publisher": publisher.strip(), "url": url}
                
        next_page = response.css('a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

Along with shown code, I've tried:

next_page = response.css('a').attrib['href']
yield response.follow(next_page, callback = self.parse, dont_filter = True)
next_page = response.css('a::attr(href)').extract()
next_page = response.css('a::attr(href)').extract_first()

Solution

Your next_page css selector isn't specific enough. Currently it just grabs the first link tag that it finds on the page. Using an xpath expression you can target the rel attribute of the actual next page link at the bottom of the page.

For example:

import scrapy

class PublisherSpider(scrapy.Spider):
    name = 'mycrawler'
    start_urls = ['https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=']

    def parse(self, response):
        for publishers in response.css('div.panel-body'):
            publisher = publishers.css('h3.text-heavy::text').get()
            url = publishers.css('a.text-blue::attr(href)').get()
            if publisher and url:
                yield {"Publisher": publisher.strip(), "url": url}
        next_page = response.xpath('//a[@rel="next"]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

OUTPUT

{'Publisher': 'Akashic Books', 'url': 'http://www.akashicbooks.com/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Chicago Review Press', 'url': 'https://www.chicagoreviewpress.com/information-for-authors--amp--agents-pages-100.php'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Atria Publishing Group', 'url': 'https://www.atriabooks.biz/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Yale University Press', 'url': 'https://yalebooks.yale.edu/about-us/editors#submissions'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Kensington Publishing', 'url': 'https://www.kensingtonbooks.com/pages/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Third World Press Foundation', 'url': 'https://thirdworldpressfoundation.org/submit-a-manuscript-2/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Dafina', 'url': 'https://www.kensingtonbooks.com/pages/submissions/'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'University of Illinois Press', 'url': 'https://www.press.uillinois.edu/authors/proposal.html'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Arsenal Pulp Press', 'url': 'https://arsenalpulp.com/About-Arsenal-Pulp-Press/Submission-Guidelines'}
2023-05-03 22:13:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'University of Georgia Press', 'url': 'https://ugapress.org/resources/frequently-asked-questions/'}
2023-05-03 22:13:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=> (refe
rer: https://blog.reedsy.com/publishers/african-american/?accepts_submissions=true&formats=&publisher_size=)
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Rosen Publishing', 'url': 'https://www.rosenpublishing.com/faqs'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Peepal Tree', 'url': 'https://peepaltreepress.submittable.com/submit'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'RedBone Press', 'url': 'https://www.redbonepress.com/pages/frontpage'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Just Us Books', 'url': 'https://justusbooks.com/pages/resource-center/submission-guidelines.html'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Good2Go Publishing', 'url': 'https://www.good2gopublishing.com/submissions'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Royalty Publishing House', 'url': 'https://www.royaltypublishinghouse.com/submissions/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Black Classic Press', 'url': 'http://www.blackclassicbooks.com/manuscript-submission/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Noemi Press', 'url': 'http://www.noemipress.org/contest/'}
2023-05-03 22:13:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.reedsy.com/publishers/african-american/page/2/?accepts_submissions=true&formats=&publisher_size=>
{'Publisher': 'Wayne State University Press', 'url': 'https://www.wsupress.wayne.edu/authors'}

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 26, 2024

[FIXED] Scrapy not scraping next page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels