Issue
Good Afternoon all
I have run into a small issue trying to scrape data from job posting site, I am also somewhat newer to python and scrapy as a whole.
I have a script that I am running to extract data from various indeed postings. The crawler seems to complete with no errors, though will not extract data from sites that respond with either a 301 or 302 error code.
I have pasted the script and log at bottom
Any help would be appreciated
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
handle_httpstatus_list = [True]
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
posting_url = "https://indeed.com" + posting_link
job_location = job.xpath('div//@data-rc-loc').extract_first()
yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url, 'job_location':job_location})
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
absolute_next_url = "https://indeed.com" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_page(self, response):
posting_url = response.meta.get('posting_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield{'title':title,
'posting_url':posting_url,
'job_name':job_name,
'job_location': job_location,
'job_description_1':job_description_1,
'posted_on_date':posted_on_date,
'job_description_2':job_description_2,
'job_location':job_location
}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-29 12:37:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1860897,
'downloader/request_count': 1616,
'downloader/request_method_count/GET': 1616,
'downloader/response_bytes': 13605809,
'downloader/response_count': 1616,
'downloader/response_status_count/200': 360,
'downloader/response_status_count/301': 758,
'downloader/response_status_count/302': 498,
'dupefilter/filtered': 9,
'elapsed_time_seconds': 28.657843,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 9, 29, 19, 37, 53, 776779),
'item_scraped_count': 337,
'log_count/DEBUG': 1954,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 54546432,
'memusage/startup': 54546432,
'request_depth_max': 20,
'response_received_count': 360,
'robotstxt/request_count': 3,
'robotstxt/response_count': 3,
'robotstxt/response_status_count/200': 3,
'scheduler/dequeued': 1612,
'scheduler/dequeued/memory': 1612,
'scheduler/enqueued': 1612,
'scheduler/enqueued/memory': 1612,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2019, 9, 29, 19, 37, 25, 118936)}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Spider closed (finished)
[1]: https://i.stack.imgur.com/6MOMC.png
Solution
I just ran a quick test of your scraper and it seems to me that it's actually working as it's supposed to.
EDIT: In an attempt to make my explanation more clear; you cannot scrape 301 or 302 redirects, because they are just that; redirects. If you request a URL that gets redirected, Scrapy automatically handles that for you and scrapes the data from the page that you get redirected to. It is the final destination from the redirect that will give you the 200 response.
If you follow the logic I have presented below, you will see that Scrapy requests the URL http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3, but gets redirected to https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3. It is this final page that you will be able to scrape. (You can try it yourself by clicking the initial URL and comparing that to the final URL you end up on)
Just to repeat myself, you will not be able to scrape anything from the 301 and 302 redirects (there is nothing there to scrape), only the final page that gets 200 response.
I have attached a suggested version of your scraper that saves both the requested URL and the scraped URL. Everything looks fine to me, your scraper works as it is supposed to. (However, note that indeed.com will only server you up to 19 pages of search results, which limits you to 190 scraped items)
I hope this makes better sense now.
Here is one example from the output, starting with the original request:
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3> from <GET http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
This gets redirected with a 301 redirect to the next link:
2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> from <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>
Which again gets redirected with a 302 redirect to the next link:
2019-09-30 10:37:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> (referer: None)
And finally, we can scrape the data:
2019-09-30 10:37:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3>
{'title': 'General Manager', 'posting_url': 'https://indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3', 'job_name': 'General Manager', 'job_location': 'Plano, TX 75024', 'job_description_1': [], 'posted_on_date': ' - 30+ days ago', 'job_description_2': None}
So the data is scraped from the final page that received a 200 response. Note that in the scraped item, the posting_url
is the one passed in with the meta
attribute, and not the actual scraped url. This may be what you want, but if you want to save the actual url that was scraped, then you should use posting_url = response.url
instead. EDIT: See suggested update below
Suggested code update:
import scrapy
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["indeed.com"]
start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]
def parse(self, response):
jobs = response.xpath('//div[@class="title"]')
for job in jobs:
title = job.xpath('a//@title').extract_first()
posting_link = job.xpath('a//@href').extract_first()
referer_url = "https://indeed.com" + posting_link
yield scrapy.Request(url=referer_url,
callback=self.parse_page,
meta={'title': title,
'referer_url': referer_url,
}
)
relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if relative_next_url:
absolute_next_url = "https://indeed.com" + relative_next_url
yield scrapy.Request(absolute_next_url, callback=self.parse)
else:
self.logger.info('No more pages found.')
def parse_page(self, response):
referer_url = response.meta.get('referer_url')
title = response.meta.get('title')
job_location = response.meta.get('job_location')
posting_url = response.url
job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title"]/text()').extract_first()
job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description icl-u-xs-mt--md "]/text()').extract_first()
yield {'title': title,
'posting_url': posting_url,
'referer_url': referer_url,
'job_name': job_name,
'job_location': job_location,
'job_description_1': job_description_1,
'posted_on_date': posted_on_date,
'job_description_2': job_description_2,
'job_location': job_location
}
Answered By - Tor Stava
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.