Issue
I'm sure this question has been asked, but I can't find the answer for the life of me. What I'm after is the scrapy equivalent of response.history
from the requests
package. I have a dataframe with an ID and url. For each url, I want to return the final response, final_url, and the response history. Using requests
, the code looks like this:
import requests
import pandas as pd
import numpy as np
def get_responses(url):
try:
r = requests.head(url, allow_redirects = True)
r.raise_for_status()
return [r.status_code, r.url, '; '.join([str(resp.status_code) for resp in r.history])]
except requests.exceptions.HTTPError as errh:
return [errh, None, None]
except requests.exceptions.ConnectionError as errc:
return ['error connecting', None, None]
except requests.exceptions.Timeout as errt:
return ['timeout error', None, None]
except requests.exceptions.RequestException as err:
return ['oops: something else', None, None]
df[['response', 'response_url', 'response_history']] = [get_responses(x) for x in df['formatted_url'].values]
Output looks like this, which is what I want:
id | formatted_url | response | response_url | response_history |
---|---|---|---|---|
1 | http://WWW.BARHARBORINFO.COM | 200 | https://www.visitbarharbor.com/ | 301; 301 |
The problem is that I have a few thousand websites I'm trying to get this info for and this takes too long to run. Enter scrapy. I've figured out how to get the same output from scrapy with the exception of the response history. Is there a way to do this in scrapy?
Solution
you can definitely do this, and it looks like it's even already supported by Scrapy's Redirect Middleware, because if it wasn't, it shouldn't be too hard to override the redirect middleware to enable this functionality.
As you can see on the middleware code:
redirected.meta['redirect_times'] = redirects
redirected.meta['redirect_ttl'] = ttl - 1
redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + [request.url]
redirected.meta['redirect_reasons'] = request.meta.get('redirect_reasons', []) + [reason]
that redirected
is the response that the middleware is creating, which is ultimately going to be passed to the request callback you should be controlling.
From what I see the response.status
is passed as redirect_reason
, so basically from your callback method you can get:
response.meta['redirect_reasons']
and you'll be able to get the "redirect history" for that particular request.
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.