Issue
Hi guys,
I am basically new to coding in general so bare with me.
I am trying to retrieve the table headers for this table: https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1
First i tried with pandas but i could not get my data so i learned about beautifull soup and tried my luck with it.
The problem is that some headers are text and i could get the info pretty easily using this:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
response = requests.get(url, headers=headers)
response.content
soup = bs(response.content, 'html.parser')
soup.prettify().splitlines()
tabela_equipa = soup.find('table', {'class': 'items'} )
headers_tabela = [th.text.encode("utf-8") for th in tabela_equipa.select("tr th")]
print(headers_tabela)
Output: [b'#', b'player', b'Age', b'Nat.', b'In squad', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'PPG', b'\xc2\xa0']
The thing is that most of those headers are icons and the info i need is actually in the span title, and there is where my problem resides, because i am not being able to find anywhere how to get all that info in order to build my table headers so then i can scrape the rest of the table.
Anyone knows a way of doing it? been trying for 4 days without success before posting here.
Then i tried to get all the spans using this code:
thead = soup.thead
Theaders = thead.find_all('span')
print(Theaders)
Output:
[<span class="icons_sprite icon-einsaetze-table-header sort-link-icon" title="Appearances"> </span>, <span class="icons_sprite icon-tor-table-header sort-link-icon" title="Goals"> </span>, <span class="icons_sprite icon-vorlage-table-header sort-link-icon" title="Assists"> </span>, <span class="icons_sprite icon-gelbekarte-table-header sort-link-icon" title="Yellow cards"> </span>, <span class="icons_sprite icon-gelbrotekarte-table-header sort-link-icon" title="Second yellow cards"> </span>, <span class="icons_sprite icon-rotekarte-table-header sort-link-icon" title="Red cards"> </span>, <span class="icons_sprite icon-einwechslungen-table-header sort-link-icon" title="Substitutions on"> </span>, <span class="icons_sprite icon-auswechslungen-table-header sort-link-icon" title="Substitutions off"> </span>, <span class="icons_sprite icon-minuten-table-header sort-link-icon" title="Minutes played"> </span>]
Getting close i thought as i could see all the info i needed was there. But then i hit the wall, i can get one span title but not all in a list:
thead = soup.thead Theaders = thead.find('span')['title'] print(Theaders)
Output: Appearances
thead = soup.thead
Theaders = thead.find_all('span')['title']
print(Theaders)
Output:
---> 23 Theaders = thead.find_all('span')['title']
24 print(Theaders)
TypeError: list indices must be integers or slices, not str
and even then i will run into the problem of it not being in the same order as it was on the original table.
Maybe i am just being dumb but any help would be much aprecciated
Solution
from bs4 import BeautifulSoup
import requests
url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
html_headers = soup.find_all('a', {'class': 'sort-link'})
headers_list = []
for i in html_headers:
if i.find('span') == None:
headers_list.append(i.get_text())
else:
headers_list.append(i.find('span')['title'])
print(headers_list)
Answered By - LLaP
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.