Thursday, September 30, 2021

[FIXED] unable to scrape elements using link extractor rule using scrapy

September 30, 2021 python, scrapy, web-scraping No comments

Issue

I am trying to scrape this website, I want the address and contact details but I don't know why i am getting None as output, the data I want is present in the response but i can't scrape it please tell me where I am doing wrong, I have wasted plenty of time, just stuck

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MobilesSpider(CrawlSpider):
    name = 'mobiles'
    allowed_domains = ['www.vcsdata.com']
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'

    def set_user_agent(self, request, response):
        request.headers['User-Agent'] = self.user_agent
        return request

    def start_requests(self):
        yield scrapy.Request(url='https://www.vcsdata.com/companies_gurgaon.html',
                             headers={
                                 'User_Agent': self.user_agent
                             })
    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//div/a[@class="text-dark"]')), callback='parse_item', follow=True, process_request='set_user_agent'),
    )

    def parse_item(self, response):
        data = response.url
        print(data)
        address = response.xpath('/html/body/div/section[2]/div/div/div[1]/div[2]/div[2]/div/div/div[1]/h6/text()').get()
        print(address)

Solution

You might have some mistakes in your XPath selector. However, I would advise you to avoid using a complete XPath from document root. Although it works, it is quite fragile, as even a minor change in HTML would break your parsing. By using // instead, you will have a shorter and more reliable selector, eg.

response.xpath('//h6[contains(., "Address")]/text()').get()

Also, instead of having a set_user_agent method, you could define the User-Agent in Scrapy settings (eg. in settings.py file or using the custom_settings properties) :

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'

Answered By - Thiago Curvelo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, September 30, 2021

[FIXED] unable to scrape elements using link extractor rule using scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels