Issue
I'm right now doing a hobby project but I can't figure it out how to sort out that problem. I try to do web scraping on a page. The HTML code looks like that:
<table class="wikitable mw-collapsible mw-made-collapsible" width="100%">
<tbody>
<tr>
<th style="background-color: #8B0000; color: white">Quirk
</th>
<th style="background-color: #8B0000; color: white">Usage
</th>
<th style="background-color: #8B0000; color: white">User(s)
</th>
<th style="background-color: #8B0000; color: white"><span class="mw-collapsible-toggle mw-collapsible-toggle-default mw-collapsible-toggle-expanded" role="button" tabindex="0" aria-expanded="true"><a class="mw-collapsible-text">Collapse</a></span>Type
</th></tr>
<tr style="">
<th><a href="/wiki/Acid" title="Acid">Acid</a>
</th>
<th><a href="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest?cb=20221024220200" class="image"><img alt="Acid.png" src="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest/scale-to-width-down/150?cb=20221024220200" decoding="async" loading="lazy" data-image-name="Acid.png" data-image-key="Acid.png" data-src="https://static.wikia.nocookie.net/bokunoheroacademia/images/1/18/Acid.png/revision/latest/scale-to-width-down/150?cb=20221024220200" class=" lazyloaded" width="150" height="84"></a>
</th>
<th><a href="/wiki/Mina_Ashido" title="Mina Ashido">Mina Ashido</a>
</th>
<th>Emitter
...
Heres my code too:
import requests
from bs4 import BeautifulSoup
url = 'https://myheroacademia.fandom.com/wiki/Quirk#List_of_Quirks'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable'})
quirks_table = tables[0]
if quirks_table:
format_string = '{:<40} | {:<60} | {:<30}'
print(format_string.format('Quirk Name', 'Users', 'Type'))
print('-' * 135)
rows = quirks_table.find_all('tr')
for row in rows[1:]:
columns = row.find_all('td')
if len(columns) == 4:
quirk_elem = columns[0].find('a')
quirk_name = quirk_elem.text.split('\n')[0].strip()
user_elem = columns[1].find('a')
user_name = user_elem.text.split('\n')[0].strip()
quirk_type_elem = columns[3].find('a')
quirk_type = quirk_type_elem.text.split('\n')[0].strip()
print(format_string.format(quirk_name, user_name, quirk_type))
else:
print('Could not find all columns in row:', row)
else:
print('Could not find quirks table')
I try to get only the text which is between the 'a' tags in HTML but the result looks like that:
Could not find all columns in row: <tr>
<th><a href="/wiki/Zoom" title="Zoom">Zoom</a>
</th>
<th><a class="image" href="https://static.wikia.nocookie.net/bokunoheroacademia/images/7/72/Zoom_Anime.gif/revision/latest?cb=20170525025842"><img alt="Zoom Anime.gif" class="lazyload" data-image-key="Zoom_Anime.gif" data-image-name="Zoom Anime.gif" data-src="https://static.wikia.nocookie.net/bokunoheroacademia/images/7/72/Zoom_Anime.gif/revision/latest/scale-to-width-down/150?cb=20170525025842" decoding="async" height="84" loading="lazy" src="%3D%3D" width="150"/></a>
</th>
<th><a href="/wiki/Mei_Hatsume" title="Mei Hatsume">Mei Hatsume</a>
</th>
<th>Mutant
</th></tr>
What am I need to modify in my code to get only the text between 'a' tags?
Solution
Main issue here is that there is no <td>
in your <tr>
so row.find_all('td')
is always None
and would never match your if-statement
, so switch to row.find_all('th')
instead.
To avoid error you also have to change
quirk_type_elem = columns[3].find('a')
to
quirk_type_elem = columns[3]
because there is no <a>
in the <th>
of the type.
You could also use another strategy - css selectors
and stripped_strings()
could help you to select more specific and get your goal, if pattern is still the same:
for e in soup.select('h2:has(#List_of_Quirks) + table tr:has(a)'):
s = list(e.stripped_strings)
data.append(
dict(zip(['q_name','u_name','q_type'],[s[0],','.join(s[1:-1]),s[-1]] ))
)
Example
import requests
from bs4 import BeautifulSoup
url = 'https://myheroacademia.fandom.com/wiki/Quirk#List_of_Quirks'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
data = []
for e in soup.select('h2:has(#List_of_Quirks) + table tr:has(a)'):
s = list(e.stripped_strings)
data.append(
dict(zip(['q_name','u_name','q_type'],[s[0],','.join(s[1:-1]),s[-1]] ))
)
data
Output
[{'q_name': 'Acid', 'u_name': 'Mina Ashido', 'q_type': 'Emitter'},
{'q_name': 'Acid Sweat', 'u_name': 'Masaru Bakugo', 'q_type': 'Emitter'},
{'q_name': 'Air Cannon',
'u_name': 'Unknown former user,All For One,Tomura Shigaraki',
'q_type': 'Emitter'},
{'q_name': 'Air Walk',
'u_name': 'Unknown former user,All For One,(Formerly),Lady Nagant',
'q_type': 'Emitter'},...]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.