Wednesday, February 2, 2022

[FIXED] list index out of range error while performing web-scraping algorithm

February 02, 2022 jupyter-notebook, python, web-scraping No comments

Issue

I have written a python program for web scraping in the jupyter notebook:

from bs4 import BeautifulSoup 
import requests 
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
r = requests.get(url)
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

This gives me error:

IndexError
---> 16 for t in tr_elements[0] IndexError: list index out of range

How to solve this error?

I am also attaching the link for the jupyter notebook: https://github.com/chirayupd/Mumbai_Neighbourhood_Analytics/blob/main/Neighbourhood.ipynb

Solution

A couple of things.

Firstly, you could just grab the whole table with pandas and .columns to retrieve the headers.

import pandas as pd

df = pd.read_html('https://mumbai7.com/postal-codes-in-mumbai/')[0]
print(list(df.columns)) # print list(df.columns)
print(df) # print df

Secondly, with requests you need an appropriate user-agent header e.g.

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})

Thirdly, tr_elements[0] will look at first row, you can then add on an additional call to retrieve the th elements from that row; so a re-write might look like the following:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://mumbai7.com/postal-codes-in-mumbai/', headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
soup = bs(r.content, 'lxml')
tr_elements = soup.select('tr')

col = []
i = 0

for th in tr_elements[0].select('th'): #  header row
    i+=1
    name = th.get_text()
    print('%d:"%s"'%(i,name)) # print ('%d:"%s"'%(i,name))
    col.append((name,[]))

Now, you can abbreviate that somewhat as follows, by using a type selector to pull only the th elements within the single table on the page, combined with enumerate (from 1) to remove the need for your counter variable:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://mumbai7.com/postal-codes-in-mumbai/', headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'})
soup = bs(r.content, 'lxml')

for i, th in enumerate(soup.select('th'), 1):
    print(i, th.text)
    print()

Answered By - QHarr

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 2, 2022

[FIXED] list index out of range error while performing web-scraping algorithm

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels