Saturday, October 15, 2022

[FIXED] Scrapy is not scraping the whole page but only some part of it

October 15, 2022 scrapy, selenium, web-crawler, web-scraping No comments

Issue

I am trying to scrape the British Petroleum website from scraping job profile. Initially the bot was not allowing it to scrape but after I initialized ROBOTSTXT_OBEY = False it started working but now it is not scraping whole page. Below is my code:

import scrapy class exxonmobilSpider(scrapy.Spider): name = "bp" start_urls=['https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+scientist']

def parse(self, response):
    name=response.xpath('//h3[@class="Hit_hitTitle__3MFk3"]')
    print(name)
    print(len(name))[enter image description here][1]

As you can see in image that xpath gives that h3 tag but when I run the code I am getting empty list. Later I cross checked by printing all the li or div tag and then counting the total number of tags, I found out that only half or some of the tags were getting scraped. Anyone has any idea why scrapy is scraping only some part of the page but not full page. Attaching the comparison image too. enter image description here You Can see the total number of li tags are 55 But now check the length of the response variable "name".enter image description here

Solution

In the hope that OP will include a minimal reproducible example in his next question, here is a way of getting those jobs. Bear in mind jobs are being pulled from an API by Javascript in page, so you need to either use splash/scrapy-playwright, either scrape the API directly. We will do the latter. API url is being obtained from browser's Dev tools - Network tab.

import scrapy


class BpscrapeSpider(scrapy.Spider):
    name = 'bpscrape'
    allowed_domains = ['algolianet.com', 'bp.com']
    def start_requests(self):
        headers = {
            'x-algolia-application-id': 'RF87OIMXXP',
            'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
            'Origin': 'https://www.bp.com',
            'content-type': 'application/x-www-form-urlencoded',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
        }

        api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
        payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
        yield scrapy.Request(
            url=api_url,
            headers=headers,
            body=payload,
            callback= self.parse,
            method="POST")
    def parse(self, response):
        data = response.json()['results'][0]['hits']
        for x in data:
            yield x

Run with scrapy crawl bpscrape -o bpdsjobs.json to get a json file with all 26 jobs. You will need to do some data cleaning, as that json response is quite comprehensive, and contains a lot of html tags etc.

For Scrapy documentation, please see https://docs.scrapy.org/en/latest/

Answered By - Barry the Platipus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 15, 2022

[FIXED] Scrapy is not scraping the whole page but only some part of it

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels