Issue
from bs4 import BeautifulSoup
import requests #importing beautifulsoup and requests
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html') #storing beautiful soup in soup variable
print(soup) # to see what it displays
soup.find('table') # finding every tag labelled table
soup.find('table', class_ = 'wikitable sortable') # trying to specify tables
table = soup.find_all('table')[0] # this is the one I want
print(table) # to see what it displays
table.find_all('th') # find all the th (column headings) tags in the table.
world_titles = table.find_all('th') # so i can just type world_titles instead of table.find_all('th') all the time
world_table_titles = [title.text.strip() for title in world_titles] # removing /n and making data clean
print(world_table_titles) # seeing what it displays
import pandas as pd # importing pandas
df = pd.DataFrame(columns = world_table_titles) # making a dataframe
df
column_data = table.find_all('tr') # finding rows within my table
for row in column_data[2:]: # [2:] because the first two just displayed []
row_data = row.find_all('td')
individual_row_data = [data.text.strip() for data in row_data] #clean version of row_data
length = len(df)
df.loc[length] = individual_row_data
print(individual_row_data)
I was trying to scrape the table from a wikipedia site, then got this error. What have I done wrong? and how can I fix it?
I tried to find solutions on youtube and online but found no help.
Solution
you could get your dataframe with pandas.read_html()
and can make adjustments on it:
import pandas as pd
#read in the first table
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]
# clean the column headers
df.columns = [' '.join(e for e in set(c) if not 'Unnamed' in e) for c in df.columns]
# check the result
df
However if you wanna go with beautifulsoup
directly - Check your selection of world_titles
the ResultSet
is much longer as you might think. Following is much closer to your expectation:
world_titles = table.tr.find_all('th')
But you will run into another issue, so there are columns without any text, that will lead to lists with different length. So check out:
...
world_titles = table.tr.find_all('th')
world_table_titles = [title.text.strip() for title in world_titles]
column_data = table.find_all('tr')
data = []
for row in column_data[2:]: # [2:] because the first two just displayed []
row_data = row.find_all('td')
data.append([data.text.strip() for data in row_data])
pd.DataFrame(data,columns=world_table_titles[1:])
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.