Friday, January 5, 2024

[FIXED] How to get only the text inside a HTML table which is between 'a' tags?

January 05, 2024 beautifulsoup, html, python, python-requests, web-scraping No comments

Issue

I'm right now doing a hobby project but I can't figure it out how to sort out that problem. I try to do web scraping on a page. The HTML code looks like that:

<table class="wikitable mw-collapsible mw-made-collapsible" width="100%">
<tbody>
<tr>
<th style="background-color: #8B0000; color: white">Quirk
</th>
<th style="background-color: #8B0000; color: white">Usage
</th>
<th style="background-color: #8B0000; color: white">User(s)
</th>
<th style="background-color: #8B0000; color: white"><span class="mw-collapsible-toggle mw-collapsible-toggle-default mw-collapsible-toggle-expanded" role="button" tabindex="0" aria-expanded="true"><a class="mw-collapsible-text">Collapse</a></span>Type
</th></tr>
<tr style="">
<th><a href="/wiki/Acid" title="Acid">Acid</a>
</th>
<th><a href="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest?cb=20221024220200" class="image"><img alt="Acid.png" src="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest/scale-to-width-down/150?cb=20221024220200" decoding="async" loading="lazy" data-image-name="Acid.png" data-image-key="Acid.png" data-src="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest/scale-to-width-down/150?cb=20221024220200" class=" lazyloaded" width="150" height="84"></a>
</th>
<th><a href="/wiki/Mina_Ashido" title="Mina Ashido">Mina Ashido</a>
</th>
<th>Emitter
...

Heres my code too:

import requests
from bs4 import BeautifulSoup

url = 'https://myheroacademia.fandom.com/wiki/Quirk#List_of_Quirks'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')

tables = soup.find_all('table', {'class': 'wikitable'})
quirks_table = tables[0]

if quirks_table:
    format_string = '{:<40} | {:<60} | {:<30}'
    print(format_string.format('Quirk Name', 'Users', 'Type'))
    print('-' * 135)

    rows = quirks_table.find_all('tr')
    for row in rows[1:]:
        columns = row.find_all('td')
        if len(columns) == 4:
            quirk_elem = columns[0].find('a')
            quirk_name = quirk_elem.text.split('\n')[0].strip()

            user_elem = columns[1].find('a')
            user_name = user_elem.text.split('\n')[0].strip()

            quirk_type_elem = columns[3].find('a')
            quirk_type = quirk_type_elem.text.split('\n')[0].strip()

            print(format_string.format(quirk_name, user_name, quirk_type))
        else:
            print('Could not find all columns in row:', row)
else:
    print('Could not find quirks table')

I try to get only the text which is between the 'a' tags in HTML but the result looks like that:

Could not find all columns in row: <tr>
<th><a href="/wiki/Zoom" title="Zoom">Zoom</a>
</th>
<th><a class="image" href="https://static.wikia.nocookie.net/bokunoheroacademia/images/7/72/Zoom_Anime.gif/revision/latest?cb=20170525025842"><img alt="Zoom Anime.gif" class="lazyload" data-image-key="Zoom_Anime.gif" data-image-name="Zoom Anime.gif" data-src="https://static.wikia.nocookie.net/bokunoheroacademia/images/7/72/Zoom_Anime.gif/revision/latest/scale-to-width-down/150?cb=20170525025842" decoding="async" height="84" loading="lazy" src="data:image/gif;base64,R0lGODlhAQABAIABAAAAAP///yH5BAEAAAEALAAAAAABAAEAQAICTAEAOw%3D%3D" width="150"/></a>
</th>
<th><a href="/wiki/Mei_Hatsume" title="Mei Hatsume">Mei Hatsume</a>
</th>
<th>Mutant
</th></tr>

What am I need to modify in my code to get only the text between 'a' tags?

Solution

Main issue here is that there is no <td> in your <tr> so row.find_all('td') is always None and would never match your if-statement, so switch to row.find_all('th') instead.

To avoid error you also have to change

quirk_type_elem = columns[3].find('a')

quirk_type_elem = columns[3]

because there is no <a> in the <th> of the type.

You could also use another strategy - css selectors and stripped_strings() could help you to select more specific and get your goal, if pattern is still the same:

for e in soup.select('h2:has(#List_of_Quirks) + table tr:has(a)'):
    s = list(e.stripped_strings)
    data.append(
        dict(zip(['q_name','u_name','q_type'],[s[0],','.join(s[1:-1]),s[-1]] ))
    )

Example

import requests
from bs4 import BeautifulSoup

url = 'https://myheroacademia.fandom.com/wiki/Quirk#List_of_Quirks'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
data = []
for e in soup.select('h2:has(#List_of_Quirks) + table tr:has(a)'):
    s = list(e.stripped_strings)
    data.append(
        dict(zip(['q_name','u_name','q_type'],[s[0],','.join(s[1:-1]),s[-1]] ))
    )
data

Output

[{'q_name': 'Acid', 'u_name': 'Mina Ashido', 'q_type': 'Emitter'},
 {'q_name': 'Acid Sweat', 'u_name': 'Masaru Bakugo', 'q_type': 'Emitter'},
 {'q_name': 'Air Cannon',
  'u_name': 'Unknown former user,All For One,Tomura Shigaraki',
  'q_type': 'Emitter'},
 {'q_name': 'Air Walk',
  'u_name': 'Unknown former user,All For One,(Formerly),Lady Nagant',
  'q_type': 'Emitter'},...]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 5, 2024

[FIXED] How to get only the text inside a HTML table which is between 'a' tags?

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels