Issue
I am trying to Parse a website using Python3.6 using the HTML parser, but it throws ab error as follows:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found The code I wrote is as below: {
from urllib.request import urlopen as uo
from bs4 import BeautifulSoup
import ssl
# Ignore SSL Certification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter--')
html = uo(url,context = ctx).read()
soup = BeautifulSoup(html,"html.parser")
print(soup)
#retrieve all the anchor tags
#tags = soup('a')
}
Can someone tell me why is it throwing this error , what it means and how to solve this error?
Solution
As stated in the comments:
That site sets a cookie and then redirects to /Home.aspx.
To avoid the loop of redirects on this site, you must have 24 chars ASP.NET_SessionId
cookie set.
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders.append(('Cookie', 'ASP.NET_SessionId=garbagegarbagegarbagelol'))
f = opener.open("http://apnakhata.raj.nic.in/")
html = f.read()
However, I'd just use requests
.
import requests
r = requests.get('http://apnakhata.raj.nic.in/')
html = r.text
It saves cookies to a RequestsCookieJar by default. After the initial request, only one redirect happens. You can see it here:
>>> r.history[0]
[<Response [302]>]
>>> r.history[0].cookies
<RequestsCookieJar[Cookie(version=0, name='ASP.NET_SessionId', value='ph0chopmjlpi1dg0f3xtbacu', port=None, port_specified=False, domain='apnakhata.raj.nic.in', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>
To scrape the page, you can use requests_html
created by the same author.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://apnakhata.raj.nic.in/')
Getting links is extremely easy:
>>> r.html.absolute_links
{'http://apnakhata.raj.nic.in/',
'http://apnakhata.raj.nic.in/Cyberlist.aspx',
...
'http://apnakhata.raj.nic.in/rev_phone.aspx'}
Answered By - radzak
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.