Issue
I am trying to scrape the British Petroleum website from scraping job profile. Initially the bot was not allowing it to scrape but after I initialized ROBOTSTXT_OBEY = False it started working but now it is not scraping whole page. Below is my code:
import scrapy class exxonmobilSpider(scrapy.Spider): name = "bp" start_urls=['https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+scientist']
def parse(self, response):
name=response.xpath('//h3[@class="Hit_hitTitle__3MFk3"]')
print(name)
print(len(name))[enter image description here][1]
As you can see in image that xpath gives that h3 tag but when I run the code I am getting empty list. Later I cross checked by printing all the li or div tag and then counting the total number of tags, I found out that only half or some of the tags were getting scraped. Anyone has any idea why scrapy is scraping only some part of the page but not full page. Attaching the comparison image too. enter image description here You Can see the total number of li tags are 55 But now check the length of the response variable "name".enter image description here
Solution
In the hope that OP will include a minimal reproducible example in his next question, here is a way of getting those jobs. Bear in mind jobs are being pulled from an API by Javascript in page, so you need to either use splash/scrapy-playwright, either scrape the API directly. We will do the latter. API url is being obtained from browser's Dev tools - Network tab.
import scrapy
class BpscrapeSpider(scrapy.Spider):
name = 'bpscrape'
allowed_domains = ['algolianet.com', 'bp.com']
def start_requests(self):
headers = {
'x-algolia-application-id': 'RF87OIMXXP',
'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
'Origin': 'https://www.bp.com',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x
Run with scrapy crawl bpscrape -o bpdsjobs.json
to get a json file with all 26 jobs.
You will need to do some data cleaning, as that json response is quite comprehensive, and contains a lot of html tags etc.
For Scrapy documentation, please see https://docs.scrapy.org/en/latest/
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.