Issue
I am trying to scrape the text of some elements in a table using requests
and BeautifulSoup
, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find()
doesn't appear to work, just prints blank lists. soup.select()
however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find()
not work as expected here?
Solution
While .find()
deals only with the first occurence of an element, .select()
/ .find_all()
will give you a ResultSet
you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id
and close to your initial approach the <tr>
also by its id
while using css selector
and the [id^="row"]
that represents id attribute whose value starts with row
. In addition I used .stripped_strings
to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr>
in <tbody>
of tag with id countriesTable
:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html()
that works with beautifulsoup
under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name | ISO 2 | |
---|---|---|
0 | Afghanistan | AF |
1 | Ă…land Islands | AX |
2 | Albania | AL |
3 | Algeria | DZ |
4 | American Samoa | AS |
5 | Andorra | AD |
...
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.