Tuesday, October 4, 2022

[FIXED] urllib.error.HTTPError: HTTP Error 302

October 04, 2022 beautifulsoup, html-parser, python-3.x, urllib No comments

Issue

I am trying to Parse a website using Python3.6 using the HTML parser, but it throws ab error as follows:

urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found The code I wrote is as below: {

from urllib.request import urlopen as uo
from bs4 import BeautifulSoup
import ssl

# Ignore SSL Certification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter--')
html = uo(url,context = ctx).read()

soup = BeautifulSoup(html,"html.parser")

print(soup)
#retrieve all the anchor tags
#tags = soup('a')

}

Can someone tell me why is it throwing this error , what it means and how to solve this error?

Solution

As stated in the comments:

That site sets a cookie and then redirects to /Home.aspx.

To avoid the loop of redirects on this site, you must have 24 chars ASP.NET_SessionId cookie set.

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders.append(('Cookie', 'ASP.NET_SessionId=garbagegarbagegarbagelol'))
f = opener.open("http://apnakhata.raj.nic.in/")
html = f.read()

However, I'd just use requests.

import requests

r = requests.get('http://apnakhata.raj.nic.in/')
html = r.text

It saves cookies to a RequestsCookieJar by default. After the initial request, only one redirect happens. You can see it here:

>>> r.history[0]
[<Response [302]>]

>>> r.history[0].cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value='ph0chopmjlpi1dg0f3xtbacu', port=None, port_specified=False, domain='apnakhata.raj.nic.in', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

To scrape the page, you can use requests_html created by the same author.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://apnakhata.raj.nic.in/')

Getting links is extremely easy:

>>> r.html.absolute_links
{'http://apnakhata.raj.nic.in/',
'http://apnakhata.raj.nic.in/Cyberlist.aspx',
...
'http://apnakhata.raj.nic.in/rev_phone.aspx'}

Answered By - radzak

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] urllib.error.HTTPError: HTTP Error 302

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels