Issue
I wrote a script to get the data from a website. I have issue with collecting the website URL since the @href is the redirect link. How can I convert the redirect URL to the actual website it's redirecting to?
import scrapy
import logging
class AppSpider(scrapy.Spider):
name = 'app'
allowed_domains = ['www.houzz.in']
start_urls = ['https://www.houzz.in/professionals/searchDirectory?topicId=26721&query=Design-Build+Firms&location=Mumbai+City+District%2C+India&distance=100&sort=4']
def parse(self, response):
lists = response.xpath('//li[@class="hz-pro-search-results__item"]/div/div[@class="hz-pro-search-result__info"]/div/div/div/a')
for data in lists:
link = data.xpath('.//@href').get()
yield scrapy.Request(url=link, callback=self.parse_houses, meta={'Links': link})
next_page = response.xpath('(//a[@class="hz-pagination-link hz-pagination-link--next"])[1]/@href').extract_first()
if next_page:
yield response.follow(response.urljoin(next_page), callback=self.parse)
def parse_houses(self, response):
link = response.request.meta['Links']
firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()
yield {
'Links': link,
'Firm_name': firm_name,
'Name': name,
'Phone': phone,
'Website': website
}
Solution
You must to have do a request to that target URL to see where it leads to
In your case, you can do simply the HEAD
request, that will not load any body of target URL so that will save bandwidth and increase speed of your script as well
def parse_houses(self, response):
link = response.request.meta['Links']
firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()
yield Request(url=website,
method="HEAD",
callback=self.get_final_link,
meta={'data':
{
'Links': link,
'Firm_name': firm_name,
'Name': name,
'Phone': phone,
'Website': website
}
}
)
def get_final_link(self, response):
data = response.meta['data']
data['website'] = response.headers['Location']
yield data
If your goal is to get the website, that actual website link is available in source-code of each listing as well, you can grab it by regex, no need to visit the encrypted url
def parse_houses(self, response):
link = response.request.meta['Links']
firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
website = re.findall(r"\"url\"\: \"(.*?)\"", response.text)[0]
Answered By - Umair Ayub
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.