Tuesday, November 9, 2021

[FIXED] How to extract the website URL from the redirect link with Scrapy Python

November 09, 2021 python, scrapy, web-scraping No comments

Issue

I wrote a script to get the data from a website. I have issue with collecting the website URL since the @href is the redirect link. How can I convert the redirect URL to the actual website it's redirecting to?

import scrapy
import logging


class AppSpider(scrapy.Spider):
    name = 'app'
    allowed_domains = ['www.houzz.in']
    start_urls = ['https://www.houzz.in/professionals/searchDirectory?topicId=26721&query=Design-Build+Firms&location=Mumbai+City+District%2C+India&distance=100&sort=4']

    def parse(self, response):
        lists = response.xpath('//li[@class="hz-pro-search-results__item"]/div/div[@class="hz-pro-search-result__info"]/div/div/div/a')
        for data in lists:
            link = data.xpath('.//@href').get()

            yield scrapy.Request(url=link, callback=self.parse_houses, meta={'Links': link})

        next_page = response.xpath('(//a[@class="hz-pagination-link hz-pagination-link--next"])[1]/@href').extract_first()
        if next_page:
            yield response.follow(response.urljoin(next_page), callback=self.parse)

    def parse_houses(self, response):
        link = response.request.meta['Links']

        firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
        name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
        phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
        website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()

        yield {
            'Links': link,
            'Firm_name': firm_name,
            'Name': name,
            'Phone': phone,
            'Website': website
        }

Solution

You must to have do a request to that target URL to see where it leads to

In your case, you can do simply the HEAD request, that will not load any body of target URL so that will save bandwidth and increase speed of your script as well

def parse_houses(self, response):
    link = response.request.meta['Links']

    firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
    name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
    phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
    website = response.xpath('(//div[@class="hz-profile-header__contact-info text-right mrm"]/a)[2]/@href').get()

    yield Request(url=website, 
        method="HEAD", 
        callback=self.get_final_link,
        meta={'data': 
                {
                'Links': link,
                'Firm_name': firm_name,
                'Name': name,
                'Phone': phone,
                'Website': website
            }
        }
        )


def get_final_link(self, response):
    data = response.meta['data']
    data['website'] = response.headers['Location']
    yield data

If your goal is to get the website, that actual website link is available in source-code of each listing as well, you can grab it by regex, no need to visit the encrypted url

def parse_houses(self, response):
    link = response.request.meta['Links']

    firm_name = response.xpath('//div[@class="hz-profile-header__title"]/h1/text()').get()
    name = response.xpath('//div[@class="profile-meta__val"]/text()').get()
    phone = response.xpath('//div[@class="hz-profile-header__contact-info text-right mrm"]/a/span/text()').get()
    website = re.findall(r"\"url\"\: \"(.*?)\"", response.text)[0]

Answered By - Umair Ayub

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 9, 2021

[FIXED] How to extract the website URL from the redirect link with Scrapy Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels