Issue
I'm trying to build a simple web scrapping tool.
Right now I'm having an issue extracting data from each row because <tr>
header is missing.
(Only <tr>
header is missing, and <\tr>
header is still there)
Below is my code
from bs4 import BeautifulSoup
import requests
url = "https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/"
data = requests.get(url).text
print(data)
It's missing a header, and only exists for each row
<tbody>
((THERES SUPPOSED TO BE A <tr> TAG HERE))))!!!
<td class="fav"><img alt="favorite icon" src="/img/fav.svg?v2" data-id="2"></td>
</td><td class="rank-td td-right" data-sort="1">1
</td><td class="name-td">
<div class="logo-container"><img loading="lazy" class="company-logo" alt="Apple logo" src="/img/company-logos/64/AAPL.png" data-img-path="/img/company-logos/64/AAPL.png" data-img-dark-path="/img/company-logos/64/AAPL.D.png"></div>
<div class="name-div"><a href="/apple/marketcap/"><div class="company-name">Apple</div>
<div class="company-code"><span class="rank d-none"></span>AAPL</div>
</a></div></td><td class="td-right" data-sort="2891576508416">$2.891 T</td><td class="td-right" data-sort="18592">$185.92</td><td data-sort="18" class="rh-sm"><span class="percentage-green"><svg class="a" viewBox="0 0 12 12"><path d="M10 8H2l4-4 4 4z"></path></svg>0.18%</span></td><td class="p-0 sparkline-td red"><svg><path d="M0,21 5,18 10,22 15,14 20,16 25,12 30,8 35,14 40,11 45,3 50,3 55,4 60,8 65,6 70,10 75,11 80,13 85,13 90,14 95,14 100,13 105,16 110,16 115,31 120,34 125,39 130,41 135,31 140,32 145,30 150,31 155,30" /></svg></td><td>🇺🇸 <span class="responsive-hidden">USA</span></td>
</tr>
Thank you!
I tried following
soup = BeautifulSoup(data, "lxml")
table = soup.find("table")
# print(table)
rows = table.find_all("tr")
but it doesn't work, because again, <tr>
header is missing
Solution
The issue is the HTML of the page is malformed. So to parse it like a browser does use html5lib
parser:
import requests
from bs4 import BeautifulSoup
url = 'https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/'
soup = BeautifulSoup(requests.get(url).content, 'html5lib')
for tr in soup.table.select('tr'):
tds = [t for td in tr.select('td') if (t:=td.get_text(strip=True, separator=' '))]
if len(tds) == 6:
print(*tds, sep='\t')
Prints:
1 Apple AAPL $2.891 T $185.92 0.18% 🇺🇸 USA
2 Microsoft MSFT $2.887 T $388.47 1.00% 🇺🇸 USA
3 Visa V $542.91 B $264.17 0.05% 🇺🇸 USA
4 JPMorgan Chase JPM $488.72 B $169.05 0.73% 🇺🇸 USA
5 UnitedHealth UNH $482.35 B $521.51 3.37% 🇺🇸 USA
6 Walmart WMT $434.31 B $161.32 0.13% 🇺🇸 USA
7 Johnson & Johnson JNJ $390.91 B $162.39 0.77% 🇺🇸 USA
8 Procter & Gamble PG $354.94 B $150.60 0.06% 🇺🇸 USA
...
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.