Sunday, January 14, 2024

[FIXED] How to fix cannot set rows in Mismatched columns error, in my web scraping code?

January 14, 2024 beautifulsoup, dataframe, pandas, python, web-scraping No comments

Issue

from bs4 import BeautifulSoup
import requests #importing beautifulsoup and requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html') #storing beautiful soup in soup variable

print(soup) # to see what it displays

soup.find('table') # finding every tag labelled table

soup.find('table', class_ = 'wikitable sortable') # trying to specify tables

table = soup.find_all('table')[0] # this is the one I want

print(table) # to see what it displays

table.find_all('th') # find all the th (column headings) tags in the table.

world_titles = table.find_all('th') # so i can just type world_titles instead of table.find_all('th') all the time

world_table_titles = [title.text.strip() for title in world_titles] # removing /n and making data clean

print(world_table_titles) # seeing what it displays

import pandas as pd # importing pandas

df = pd.DataFrame(columns = world_table_titles) # making a dataframe

df

column_data = table.find_all('tr') # finding rows within my table

for row in column_data[2:]: # [2:] because the first two just displayed []
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data] #clean version of row_data
    
    length = len(df)  
    df.loc[length] = individual_row_data
    print(individual_row_data)

I was trying to scrape the table from a wikipedia site, then got this error. What have I done wrong? and how can I fix it?

I tried to find solutions on youtube and online but found no help.

Solution

you could get your dataframe with pandas.read_html() and can make adjustments on it:

import pandas as pd

#read in the first table
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]

# clean the column headers
df.columns = [' '.join(e for e in set(c) if not 'Unnamed' in e) for c in df.columns]

# check the result
df

However if you wanna go with beautifulsoup directly - Check your selection of world_titles the ResultSet is much longer as you might think. Following is much closer to your expectation:

world_titles = table.tr.find_all('th')

But you will run into another issue, so there are columns without any text, that will lead to lists with different length. So check out:

...
world_titles = table.tr.find_all('th')
world_table_titles = [title.text.strip() for title in world_titles]     

column_data = table.find_all('tr')

data = []

for row in column_data[2:]: # [2:] because the first two just displayed []
    row_data = row.find_all('td')
    data.append([data.text.strip() for data in row_data])

pd.DataFrame(data,columns=world_table_titles[1:])

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 14, 2024

[FIXED] How to fix cannot set rows in Mismatched columns error, in my web scraping code?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels