Friday, December 3, 2021

[FIXED] Unable to scrape stats code 301 and 302 using scrapy

December 03, 2021 data-science, python, scrapy, web-crawler No comments

Issue

Good Afternoon all

I have run into a small issue trying to scrape data from job posting site, I am also somewhat newer to python and scrapy as a whole.

I have a script that I am running to extract data from various indeed postings. The crawler seems to complete with no errors, though will not extract data from sites that respond with either a 301 or 302 error code.

I have pasted the script and log at bottom

Any help would be appreciated

import scrapy
from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["indeed.com"]
    start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]

    def parse(self, response):
        handle_httpstatus_list = [True]
        jobs = response.xpath('//div[@class="title"]')

        for job in jobs:
            title = job.xpath('a//@title').extract_first()
            posting_link = job.xpath('a//@href').extract_first()
            posting_url = "https://indeed.com" + posting_link
            job_location = job.xpath('div//@data-rc-loc').extract_first()


            yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url, 'job_location':job_location})

        relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        absolute_next_url = "https://indeed.com" + relative_next_url

        yield Request(absolute_next_url, callback=self.parse)


    def parse_page(self, response):
        posting_url = response.meta.get('posting_url')
        title = response.meta.get('title')
        job_location = response.meta.get('job_location')

        job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]/text()').extract_first()
        job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
        posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
        job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs  jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
        job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description  icl-u-xs-mt--md  "]/text()').extract_first()

        yield{'title':title,
            'posting_url':posting_url,
            'job_name':job_name,
            'job_location': job_location,
            'job_description_1':job_description_1,
            'posted_on_date':posted_on_date,
            'job_description_2':job_description_2,
            'job_location':job_location
             }

2019-09-29 12:37:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-29 12:37:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1860897,
 'downloader/request_count': 1616,
 'downloader/request_method_count/GET': 1616,
 'downloader/response_bytes': 13605809,
 'downloader/response_count': 1616,
 'downloader/response_status_count/200': 360,
 'downloader/response_status_count/301': 758,
 'downloader/response_status_count/302': 498,
 'dupefilter/filtered': 9,
 'elapsed_time_seconds': 28.657843,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 9, 29, 19, 37, 53, 776779),
 'item_scraped_count': 337,
 'log_count/DEBUG': 1954,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 54546432,
 'memusage/startup': 54546432,
 'request_depth_max': 20,
 'response_received_count': 360,
 'robotstxt/request_count': 3,
 'robotstxt/response_count': 3,
 'robotstxt/response_status_count/200': 3,
 'scheduler/dequeued': 1612,
 'scheduler/dequeued/memory': 1612,
 'scheduler/enqueued': 1612,
 'scheduler/enqueued/memory': 1612,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2019, 9, 29, 19, 37, 25, 118936)}
2019-09-29 12:37:53 [scrapy.core.engine] INFO: Spider closed (finished)


  [1]: https://i.stack.imgur.com/6MOMC.png

Solution

I just ran a quick test of your scraper and it seems to me that it's actually working as it's supposed to.

EDIT: In an attempt to make my explanation more clear; you cannot scrape 301 or 302 redirects, because they are just that; redirects. If you request a URL that gets redirected, Scrapy automatically handles that for you and scrapes the data from the page that you get redirected to. It is the final destination from the redirect that will give you the 200 response.

If you follow the logic I have presented below, you will see that Scrapy requests the URL http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3, but gets redirected to https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3. It is this final page that you will be able to scrape. (You can try it yourself by clicking the initial URL and comparing that to the final URL you end up on)

Just to repeat myself, you will not be able to scrape anything from the 301 and 302 redirects (there is nothing there to scrape), only the final page that gets 200 response.

I have attached a suggested version of your scraper that saves both the requested URL and the scraped URL. Everything looks fine to me, your scraper works as it is supposed to. (However, note that indeed.com will only server you up to 19 pages of search results, which limits you to 190 scraped items)

I hope this makes better sense now.

Here is one example from the output, starting with the original request:

2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3> from <GET http://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>

This gets redirected with a 301 redirect to the next link:

2019-09-30 10:37:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> from <GET https://www.indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3>

Which again gets redirected with a 302 redirect to the next link:

2019-09-30 10:37:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> (referer: None)

And finally, we can scrape the data:

2019-09-30 10:37:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/viewjob?jk=69995bf12d9f2f9a&from=serp&vjs=3> 
{'title': 'General Manager', 'posting_url': 'https://indeed.com/rc/clk?jk=69995bf12d9f2f9a&fccid=b87e01ade6c824ee&vjs=3', 'job_name': 'General Manager', 'job_location': 'Plano, TX 75024', 'job_description_1': [], 'posted_on_date': ' - 30+ days ago', 'job_description_2': None}

So the data is scraped from the final page that received a 200 response. Note that in the scraped item, the posting_url is the one passed in with the meta attribute, and not the actual scraped url. This may be what you want, but if you want to save the actual url that was scraped, then you should use posting_url = response.url instead. EDIT: See suggested update below

Suggested code update:

import scrapy


class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["indeed.com"]
    start_urls = ["https://www.indeed.com/jobs?q=%22owner+operator%22&l=dallas"]

    def parse(self, response):
        jobs = response.xpath('//div[@class="title"]')

        for job in jobs:
            title = job.xpath('a//@title').extract_first()
            posting_link = job.xpath('a//@href').extract_first()
            referer_url = "https://indeed.com" + posting_link

            yield scrapy.Request(url=referer_url,
                                 callback=self.parse_page,
                                 meta={'title': title,
                                       'referer_url': referer_url,
                                       }
                                 )

        relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        if relative_next_url:
            absolute_next_url = "https://indeed.com" + relative_next_url
            yield scrapy.Request(absolute_next_url, callback=self.parse)
        else:
            self.logger.info('No more pages found.')


    def parse_page(self, response):
        referer_url = response.meta.get('referer_url')
        title = response.meta.get('title')
        job_location = response.meta.get('job_location')
        posting_url = response.url

        job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]/text()').extract_first()
        job_description_1=response.xpath('//div[@class="jobsearch-jobDescriptionText"]/ul').extract()
        posted_on_date= response.xpath('//div[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
        job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs  jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()
        job_description_2=response.xpath('//div[@class="jobsearch-JobComponent-description  icl-u-xs-mt--md  "]/text()').extract_first()

        yield {'title': title,
               'posting_url': posting_url,
               'referer_url': referer_url,
               'job_name': job_name,
               'job_location': job_location,
               'job_description_1': job_description_1,
               'posted_on_date': posted_on_date,
               'job_description_2': job_description_2,
               'job_location': job_location
               }

Answered By - Tor Stava

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 3, 2021

[FIXED] Unable to scrape stats code 301 and 302 using scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels