Friday, November 11, 2022

[FIXED] How can i get the table headers if they are not all text with python and beautifulsoup?

November 11, 2022 beautifulsoup, python, python-requests, web-scraping No comments

Issue

Hi guys,

I am basically new to coding in general so bare with me.

I am trying to retrieve the table headers for this table: https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1

First i tried with pandas but i could not get my data so i learned about beautifull soup and tried my luck with it.

The problem is that some headers are text and i could get the info pretty easily using this:

from bs4 import BeautifulSoup as bs
import requests

url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1'

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

response = requests.get(url, headers=headers)

response.content

soup = bs(response.content, 'html.parser')

soup.prettify().splitlines()

tabela_equipa = soup.find('table', {'class': 'items'} )

headers_tabela = [th.text.encode("utf-8") for th in tabela_equipa.select("tr th")]

print(headers_tabela)

Output: [b'#', b'player', b'Age', b'Nat.', b'In squad', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'\xc2\xa0', b'PPG', b'\xc2\xa0']

The thing is that most of those headers are icons and the info i need is actually in the span title, and there is where my problem resides, because i am not being able to find anywhere how to get all that info in order to build my table headers so then i can scrape the rest of the table.

Anyone knows a way of doing it? been trying for 4 days without success before posting here.

Then i tried to get all the spans using this code:

thead = soup.thead
Theaders = thead.find_all('span')
print(Theaders)

Output:

[<span class="icons_sprite icon-einsaetze-table-header sort-link-icon" title="Appearances"> </span>, <span class="icons_sprite icon-tor-table-header sort-link-icon" title="Goals"> </span>, <span class="icons_sprite icon-vorlage-table-header sort-link-icon" title="Assists"> </span>, <span class="icons_sprite icon-gelbekarte-table-header sort-link-icon" title="Yellow cards"> </span>, <span class="icons_sprite icon-gelbrotekarte-table-header sort-link-icon" title="Second yellow cards"> </span>, <span class="icons_sprite icon-rotekarte-table-header sort-link-icon" title="Red cards"> </span>, <span class="icons_sprite icon-einwechslungen-table-header sort-link-icon" title="Substitutions on"> </span>, <span class="icons_sprite icon-auswechslungen-table-header sort-link-icon" title="Substitutions off"> </span>, <span class="icons_sprite icon-minuten-table-header sort-link-icon" title="Minutes played"> </span>]

Getting close i thought as i could see all the info i needed was there. But then i hit the wall, i can get one span title but not all in a list:

thead = soup.thead Theaders = thead.find('span')['title'] print(Theaders)

Output: Appearances

thead = soup.thead
Theaders = thead.find_all('span')['title']
print(Theaders)

Output:

---> 23 Theaders = thead.find_all('span')['title']
     24 print(Theaders)

TypeError: list indices must be integers or slices, not str

and even then i will run into the problem of it not being in the same order as it was on the original table.

Maybe i am just being dumb but any help would be much aprecciated

Solution

from bs4 import BeautifulSoup
import requests

url = 'https://www.transfermarkt.co.uk/manchester-united-fc/leistungsdaten/verein/985/reldata/%262022/plus/1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')

html_headers = soup.find_all('a', {'class': 'sort-link'})

headers_list = []
for i in html_headers:
    if i.find('span') == None:
        headers_list.append(i.get_text())
    else:
        headers_list.append(i.find('span')['title'])

print(headers_list)

Answered By - LLaP

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 11, 2022

[FIXED] How can i get the table headers if they are not all text with python and beautifulsoup?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels