Sunday, June 26, 2022

[FIXED] Runtime Request URL change not working scrapy

June 26, 2022 python-3.x, scrapy No comments

Issue

I have written a script in Python using Scrapy. The code runs to fetch all the pages that exist containing the code. It works fine on the first page load when scrapy is started and as per the script logic gets us page no. 2. But after loading page 2 I am unable to get xpath of the new page loaded so I can move ahead this way and get all the web-page numbers.

Sharing the code snippet.

import scrapy
from scrapy import Spider

class PostsSpider(Spider):

   name = "posts"
   start_urls = [
    'https://www.boston.com/category/news/'
   ]

def parse(self, response):
    print("first time")
    print(response)
    results = response.xpath("//*[contains(@id, 'load-more')]/@data-next-page").extract_first()
    print(results)
    if results is not None:
        for result in results:
            page_number = 'page/' + result
            new_url = self.start_urls[0] + page_number
            print(new_url)
            yield scrapy.Request(url=new_url, callback=self.parse)
    else:
        print("last page")

Solution

It is because the page doesn't create new get requests when it loads the next page, it makes an ajax call to an api that returns json.

I made some adjustments to your code so it should work properly now. I am assuming that there is something other than the next page number you are trying to extract from each page, so I wrapped the html string into a scrapy.Slector class so you can use Xpath and such on it. This script will crawl alot of pages really fast, so you might want to adjust your settings to slow it down too.

import scrapy
from scrapy import Spider
from scrapy.selector import Selector

class PostsSpider(Spider):

    name = "posts"
    ajaxurl = "https://www.boston.com/wp-json/boston/v1/load-more?taxonomy=category&term_id=779&search_query=&author=&orderby=&page=%s&_wpnonce=f43ab1aae4&ad_count=4&redundant_ids=25129871,25130264,25129873,25129799,25128140,25126233,25122755,25121853,25124456,25129584,25128656,25123311,25128423,25128100,25127934,25127250,25126228,25126222"
    start_urls = [
        'https://www.boston.com/category/news/'
    ]

    def parse(self, response):
        new_url = None
        try:
            json_result = response.json()

            html = json_result['data']['html']
            selector = Selector(text=html, type="html")
            # ... do some xpath stuff with selector.xpath.....
            new_url = self.ajaxurl % json_result["data"]["nextPage"]
        except:
            results = response.xpath("//*[contains(@id, 'load-more')]/@data-next-page").extract_first()
            if results is not None:
                for result in results:
                    new_url = self.ajaxurl % result
        if new_url:
            print(new_url)
            yield scrapy.Request(url=new_url, callback=self.parse)

Answered By - alexpdev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 26, 2022

[FIXED] Runtime Request URL change not working scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels