Issue
I'm trying to extract the list of cities on tripadvisor on this page: https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html
Whilst using scrapy only and xpaths. What I have tried:
def parse(self, response):
cities = response.xpath('//div[@id="LOCATION_LIST"]')
for links in cities:
loader = ItemLoader(AdvisorItem(), selector=links)
loader.add_xpath('cities', './/ul[@class="geoList"]/li/span[@class="state"]//text()')
loader.add_xpath('cities_url', './/ul[@class="geoList"]/li/a//@href')
yield loader.load_item()
This only returns one result and it's west yorkshire which turns out not to be on that page! So I'm unsure where it's getting it from. How do I get the right xpath for the links and cities name for all the links in that page?
Solution
You can try to select correct xpath locator thus way:
//*[@class="geoList"]/li
It will select list of elements
".//a/text()"
and
".//a/@href/text()"
They will select each city name and each link
Implementation in scrapy as an example:
script:
import scrapy
class TripSpider(scrapy.Spider):
name = 'trip'
allowed_domains = ["tripadvisor.co.uk"]
start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html']
def parse(self, response):
cities = response.xpath('//*[@class="geoList"]/li')
for city in cities:
url = city.xpath(".//a/@href").get()
abs_url= f'https://www.tripadvisor.co.uk{url}'
yield {
'city': city.xpath(".//a/text()").get(),
'link': abs_url}
Output:
{'city': 'Bradford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186408-Bradford_West_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Plymouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186258-Plymouth_Devon_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Southend-on-Sea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g503790-Southend_on_Sea_Essex_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Swansea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186466-Swansea_Swansea_County_South_Wales_Wales.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Aberdeen Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186487-Aberdeen_Aberdeenshire_Scotland.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Coventry Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186403-Coventry_West_Midlands_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Portsmouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186298-Portsmouth_Hampshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Kingston-upon-Hull Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186317-Kingston_upon_Hull_East_Riding_of_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Oxford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186361-Oxford_Oxfordshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Isle of Wight Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186308-Isle_of_Wight_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Doncaster Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187067-Doncaster_South_Yorkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Reading Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186363-Reading_Berkshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Cambridge Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186225-Cambridge_Cambridgeshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Milton Keynes Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187055-Milton_Keynes_Buckinghamshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Derby Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187048-Derby_Derbyshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Stockport Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g528793-Stockport_Greater_Manchester_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Northampton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186349-Northampton_Northamptonshire_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Bolton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187053-Bolton_Greater_Manchester_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Bath Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186370-Bath_Somerset_England.html'}
2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
{'city': 'Preston Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187062-Preston_Lancashire_England.html'}
2021-12-08 23:58:50 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-08 23:58:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 345,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 103132,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 3.084321,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 8, 17, 58, 50, 809225),
'httpcompression/response_bytes': 384303,
'httpcompression/response_count': 1,
'item_scraped_count': 20,
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.