Issue
I have written a python program for web scraping in the jupyter notebook:
from bs4 import BeautifulSoup
import requests
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
r = requests.get(url)
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
This gives me error:
IndexError
---> 16 for t in tr_elements[0] IndexError: list index out of range
How to solve this error?
I am also attaching the link for the jupyter notebook: https://github.com/chirayupd/Mumbai_Neighbourhood_Analytics/blob/main/Neighbourhood.ipynb
Solution
A couple of things.
Firstly, you could just grab the whole table with pandas and .columns
to retrieve the headers.
import pandas as pd
df = pd.read_html('https://mumbai7.com/postal-codes-in-mumbai/')[0]
print(list(df.columns)) # print list(df.columns)
print(df) # print df
Secondly, with requests
you need an appropriate user-agent header e.g.
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
Thirdly, tr_elements[0]
will look at first row, you can then add on an additional call to retrieve the th
elements from that row; so a re-write might look like the following:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://mumbai7.com/postal-codes-in-mumbai/', headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
soup = bs(r.content, 'lxml')
tr_elements = soup.select('tr')
col = []
i = 0
for th in tr_elements[0].select('th'): # header row
i+=1
name = th.get_text()
print('%d:"%s"'%(i,name)) # print ('%d:"%s"'%(i,name))
col.append((name,[]))
Now, you can abbreviate that somewhat as follows, by using a type selector to pull only the th
elements within the single table on the page, combined with enumerate (from 1) to remove the need for your counter variable:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://mumbai7.com/postal-codes-in-mumbai/', headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
soup = bs(r.content, 'lxml')
for i, th in enumerate(soup.select('th'), 1):
print(i, th.text)
print()
Answered By - QHarr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.