Issue
I am trying to scrape booking.com
with scrapy. The problem occurs when I try to implement pagination. I'm trying to get URL to the next page, but scrapy retrieves me different URL (I get it through shell), which resulst in "page not found" when I try to paste into Chrome. And when I try to put it into JSON, it doesn't retrieve any URL for pagination. Anyone has any suggestions? Maybe I should shorten the first URL.
I tried to set a canonicalize=False
rule, but it didn't do anything.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BookingSpider(scrapy.Spider):
name = "BookingScrape"
start_urls = [
'https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&lang=en-gb&sid=163b31478fa340d233204d1dcbb259ec&sb=1&src=searchresults&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fsearchresults.en-gb.html%3Faid%3D304142%3Blabel%3Dgen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB%3Bsid%3D163b31478fa340d233204d1dcbb259ec%3Btmpl%3Dsearchresults%3Bcheckin_month%3D9%3Bcheckin_monthday%3D10%3Bcheckin_year%3D2019%3Bcheckout_month%3D9%3Bcheckout_monthday%3D12%3Bcheckout_year%3D2019%3Bclass_interval%3D1%3Bdest_id%3D15754%3Bdest_type%3Dlandmark%3Bdtdisc%3D0%3Bfrom_sf%3D1%3Bgroup_adults%3D2%3Bgroup_children%3D0%3Binac%3D0%3Bindex_postcard%3D0%3Blabel_click%3Dundef%3Blandmark%3D15754%3Bno_rooms%3D1%3Boffset%3D0%3Bpostcard%3D0%3Broom1%3DA%252CA%3Bsb_price_type%3Dtotal%3Bshw_aparth%3D1%3Bslp_r_match%3D0%3Bsrc%3Dsearchresults%3Bsrc_elem%3Dsb%3Bsrpvid%3Da3bf35ea467d01b9%3Bss%3DKensington%2520High%2520Street%3Bss_all%3D0%3Bssb%3Dempty%3Bsshis%3D0%3Bssne%3DKensington%2520High%2520Street%3Bssne_untouched%3DKensington%2520High%2520Street%26%3B&ss=Kensington+High+Street&is_ski_area=0&ssne=Kensington+High+Street&ssne_untouched=Kensington+High+Street&landmark=15754&checkin_year=2019&checkin_month=9&checkin_monthday=10&checkout_year=2019&checkout_month=9&checkout_monthday=12&group_adults=2&group_children=0&no_rooms=1&from_sf=1']
rules = (
Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
)
def parse(self, response):
for hotel in response.css("h3.sr-hotel__title"):
yield {
'hotel_name': hotel.css("span.sr-hotel__name::text").extract_first(),
'link': hotel.css("h3.sr-hotel__title a::attr(href)").extract_first(),
'pagination' : hotel.css('li.bui-pagination__item bui-pagination__next-arrow a::attr(href)').extract_first()
}
for a in response.css('li.bui-pagination__item.bui-pagination__next-arrow a'):
yield response.follow(a, callback=self.parse)
URL received through shell and which doesn't take me to next page against expected:
# Recieved:
https://www.booking.com/searchresults.en-gb.html" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_207" title="Next page">\n<svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewbox="0 0 128 128" aria-hidden="true"><path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z
#Expected:
https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&sid=163b31478fa340d233204d1dcbb259ec&tmpl=searchresults&checkin_month=9&checkin_monthday=10&checkin_year=2019&checkout_month=9&checkout_monthday=12&checkout_year=2019&class_interval=1&dest_id=15754&dest_type=landmark&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&landmark=15754&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=searchresults&src_elem=sb&srpvid=7246436ad3b000a5&ss=Kensington%20High%20Street&ss_all=0&ssb=empty&sshis=0&ssne=Kensington%20High%20Street&ssne_untouched=Kensington%20High%20Street&rows=15&offset=15
Solution
I had to change my user agent name to Chrome in settings. So the web page I am scraping would not block me. Joseph, thank you for your solution.
Answered By - Karolis Pakalnis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.