Issue
i'm trying to catch if a request that i send to many websites, has been redirected or not. first let me give you some example Data.
redirected_urls = [
"http://www.tagesschau.de/inland/vw-schalte-hapke-101.html",
"http://de.reuters.com/article/deutschland-volkswagen-idDEKCN10V0H3"
]
healthy_urls = [
"http://www.focus.de/finanzen/news/wirtschaftsticker/machtkampf-zwischen-vw-und-zulieferern-stoppt-autoproduktion_id_5842241.html",
"https://www.bild.de/news/aktuelles/news/vw-kuendigt-harte-gangart-gegen-lieferstopp-47400500.bild.html"
]
redirected_df = pd.DataFrame({'URL': redirected_urls})
healthy_df = pd.DataFrame({'URL': healthy_urls})
so in redirected_df are the links that actually get redirected, however the other dataframe is not redirected. As mentioned in this post i tried to set allow_redirects=False
then realized all the links that i'm using get redirected somehow, although i get to see the actual news article. So the response code for all is 200, meaning successful connection. Then i checked response.history
for almost all of the link i get [<Response [301]>]
. Using BeautifulSoup(response._content).find('link', {'rel': 'canonical'})
they all have values.
then i would like to save this info in my dataframe like this _df.at[k,'Is_Redirected']= 1 if response.history else 0
. For all links as mentioned above i get 1 (True).
The code that i'm using:
def send_two_requests(_url):
try:
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"}
response = requests.get(_url,headers=headers,allow_redirects=True, timeout=10)
return response
except:
return func_timeout.func_timeout(timeout=5, func=send_request, args=[_url])
for k,link in enumerate(_df['url']):
response = send_two_requests(_df.at[k,'url'])
if response is not None:
_df.at[k,'Is_Redirected']= 1 if response.history else 0
is there anyway i can distinguish the actual links that work, and the ones that get redirected?
Solution
I just wanted to see if the Links get redirected or not. I noticed that many URLs that i have change in a minor way, for example most of them are using http
while the actual website is in Https
therefor it would flag as redirected, however it's actually not redirected and i'm seeing the same page. So i tried to check the differences between the Response URL and the URL that i have with difflib
Library. If the Links are 99% similar then i don't consider this a redirected URL.
if difflib.SequenceMatcher(None, response.url, _df.at[k,'URL']).ratio()>=0.99:
_df.at[k,'is_redirected']=0
else:
_df.at[k,'is_redirected']=1
Answered By - Mostafa Bouzari
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.