Saturday, November 6, 2021

[FIXED] Python Scrapy returns different url

November 06, 2021 python, scrapy, url, web, web-scraping No comments

Issue

I am trying to scrape booking.com with scrapy. The problem occurs when I try to implement pagination. I'm trying to get URL to the next page, but scrapy retrieves me different URL (I get it through shell), which resulst in "page not found" when I try to paste into Chrome. And when I try to put it into JSON, it doesn't retrieve any URL for pagination. Anyone has any suggestions? Maybe I should shorten the first URL.

I tried to set a canonicalize=False rule, but it didn't do anything.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BookingSpider(scrapy.Spider):
    name = "BookingScrape"
    start_urls = [
        'https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&lang=en-gb&sid=163b31478fa340d233204d1dcbb259ec&sb=1&src=searchresults&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fsearchresults.en-gb.html%3Faid%3D304142%3Blabel%3Dgen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB%3Bsid%3D163b31478fa340d233204d1dcbb259ec%3Btmpl%3Dsearchresults%3Bcheckin_month%3D9%3Bcheckin_monthday%3D10%3Bcheckin_year%3D2019%3Bcheckout_month%3D9%3Bcheckout_monthday%3D12%3Bcheckout_year%3D2019%3Bclass_interval%3D1%3Bdest_id%3D15754%3Bdest_type%3Dlandmark%3Bdtdisc%3D0%3Bfrom_sf%3D1%3Bgroup_adults%3D2%3Bgroup_children%3D0%3Binac%3D0%3Bindex_postcard%3D0%3Blabel_click%3Dundef%3Blandmark%3D15754%3Bno_rooms%3D1%3Boffset%3D0%3Bpostcard%3D0%3Broom1%3DA%252CA%3Bsb_price_type%3Dtotal%3Bshw_aparth%3D1%3Bslp_r_match%3D0%3Bsrc%3Dsearchresults%3Bsrc_elem%3Dsb%3Bsrpvid%3Da3bf35ea467d01b9%3Bss%3DKensington%2520High%2520Street%3Bss_all%3D0%3Bssb%3Dempty%3Bsshis%3D0%3Bssne%3DKensington%2520High%2520Street%3Bssne_untouched%3DKensington%2520High%2520Street%26%3B&ss=Kensington+High+Street&is_ski_area=0&ssne=Kensington+High+Street&ssne_untouched=Kensington+High+Street&landmark=15754&checkin_year=2019&checkin_month=9&checkin_monthday=10&checkout_year=2019&checkout_month=9&checkout_monthday=12&group_adults=2&group_children=0&no_rooms=1&from_sf=1']

    rules = (
        Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
    )

    def parse(self, response):
        for hotel in response.css("h3.sr-hotel__title"):
            yield {
                'hotel_name': hotel.css("span.sr-hotel__name::text").extract_first(),
                'link': hotel.css("h3.sr-hotel__title a::attr(href)").extract_first(),
                'pagination' : hotel.css('li.bui-pagination__item bui-pagination__next-arrow a::attr(href)').extract_first()
            }


        for a in response.css('li.bui-pagination__item.bui-pagination__next-arrow a'):
            yield response.follow(a, callback=self.parse)

URL received through shell and which doesn't take me to next page against expected:

# Recieved:
https://www.booking.com/searchresults.en-gb.html" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_207" title="Next page">\n<svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewbox="0 0 128 128" aria-hidden="true"><path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z

#Expected:
https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&sid=163b31478fa340d233204d1dcbb259ec&tmpl=searchresults&checkin_month=9&checkin_monthday=10&checkin_year=2019&checkout_month=9&checkout_monthday=12&checkout_year=2019&class_interval=1&dest_id=15754&dest_type=landmark&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&landmark=15754&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=searchresults&src_elem=sb&srpvid=7246436ad3b000a5&ss=Kensington%20High%20Street&ss_all=0&ssb=empty&sshis=0&ssne=Kensington%20High%20Street&ssne_untouched=Kensington%20High%20Street&rows=15&offset=15

Solution

I had to change my user agent name to Chrome in settings. So the web page I am scraping would not block me. Joseph, thank you for your solution.

Answered By - Karolis Pakalnis

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 6, 2021

[FIXED] Python Scrapy returns different url

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels