Friday, August 19, 2022

[FIXED] Wikitable scrapping using python

August 19, 2022 beautifulsoup, pandas, python No comments

Issue

I was trying to scrape all the tables in the following link of Wikipedia in general to get the episode number and name. But it stops near the first table and doesn't move around with the second one. I need some light on it.

wiki_link : https://en.wikipedia.org/wiki/One_Piece_(season_20)#Episode_list

But the given data in table looks like this :

Basically I am trying to fetch the data in the rows with respect to the columns [ No.Overall [n1] & Title[n2] ]

[i.e 892 The Land of Wano! To the Samurai Country where Cherry Blossoms Flutter! ]

*Required output in CSV like:

the code:

from bs4 import BeautifulSoup
from pandas.plotting import table
import requests
 
url = "https://en.wikipedia.org/wiki/One_Piece_(season_20)#Episode_list"

response = requests.get(url)

soup = BeautifulSoup(response.text,'lxml')

table = soup.find('table',{'class':'wikitable plainrowheaders wikiepisodetable'}).tbody
rows = table.find_all('tr')
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
#print(len(rows))
#print(table)

df = pd.DataFrame(columns=columns)
print(df)

for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    print(len(tds))
    if len(tds)==6:
        values1 = [tds[0].text, tds[1].text, tds[2].text, tds[3].text.replace('\n',''.replace('\xa0',''))]
        epi = values1[0]
        title = values1[1].split('Transcription:')
        titles = title[0]
        print(f'{epi}|{titles}')
    else:
        values2 = [td.text.replace('\n',''.replace('\xa0','')) for td in tds]

Solution

If you like to go with your BeautifulSoup approach, select all the <tr> with class vevent and iterate the ResultSet to create a list of dicts that you can use to create a dataframe, ...:

[
    {
        'No.overall':r.th.text,
        'Title':r.select('td:nth-of-type(2)')[0].text.split('Transcription:')[0]
    } 
    for r in soup.select('tr.vevent')
]

Example

from bs4 import BeautifulSoup
import requests
 
url = "https://en.wikipedia.org/wiki/One_Piece_(season_20)#Episode_list"
soup = BeautifulSoup(requests.get(url).text)

pd.DataFrame(
    [
        {
            'No.overall':r.th.text,
            'Title':r.select('td:nth-of-type(2)')[0].text.split('Transcription:')[0]
        } 
        for r in soup.select('tr.vevent')
    ]
)

Output

No.overall	Title
892	"The Land of Wano! To the Samurai Country where Cherry Blossoms Flutter!"
893	"Otama Appears! Luffy vs. Kaido's Army!"
894	"He'll Come! The Legend of Ace in the Land of Wano!"
895	"Side Story! The World's Greatest Bounty Hunter, Cidre!"
896	"Side Story! Clash! Luffy vs. the King of Carbonation!"

...

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, August 19, 2022

[FIXED] Wikitable scrapping using python

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels