Issue
I am trying to produce a dataframe of two columns. The first column is to contain the name of football leagues. The second column is to contain the names of teams in those leagues.
I can scrape and parse the data but because there are multiple team names for each league names I keep getting ValueError: arrays must all be same length
.
Here's my code:
league_names = soup.find_all(class_='panel-title')
team_names = soup.find_all('a', class_="odds")
a = [data.text.strip() for data in league_names]
b = [data.text.strip() for data in team_names]
df = pd.DataFrame({'league_names':a, 'team_names':b}, columns=['league_names','team_names'])
Here's the desired output:
league_names | team_names |
---|---|
Albania Championship | Dinamo Tirana - Skenderbeu Korce |
Albania Championship | KF Teuta - FK Egnatia |
Albania Championship | Vllaznia Shkoder - FK Kukesi |
Here's a screenshot of the html (code itself is below but I can't seem to paste it correctly even following these instructions).
html:
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
Solution
To avoid multiple lists with maybe different length, try to change your scraping strategy. Based on your examples select all <a>
with class odds
from the panels and combine them with its previous <h4>
:
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
Example
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
pd.DataFrame(data)
Output
league | teams |
---|---|
Albania Championship | Dinamo Tirana - Skenderbeu Korce |
Albania Championship | KF Teuta - FK Egnatia |
Albania Championship | Vllaznia Shkoder - FK Kukesi |
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.