Saturday, January 27, 2024

[FIXED] Only <tr> tag is missing using requests library

January 27, 2024 beautifulsoup, python, web-scraping No comments

Issue

I'm trying to build a simple web scrapping tool. Right now I'm having an issue extracting data from each row because <tr> header is missing. (Only <tr> header is missing, and <\tr> header is still there)

Below is my code

from bs4 import BeautifulSoup
import requests

url = "https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/"
data = requests.get(url).text
print(data)

It's missing a header, and only exists for each row

<tbody>
((THERES SUPPOSED TO BE A <tr> TAG HERE))))!!!
<td class="fav"><img alt="favorite icon" src="/img/fav.svg?v2" data-id="2"></td>
</td><td class="rank-td td-right" data-sort="1">1
</td><td class="name-td">
<div class="logo-container"><img loading="lazy" class="company-logo" alt="Apple logo" src="/img/company-logos/64/AAPL.png" data-img-path="/img/company-logos/64/AAPL.png" data-img-dark-path="/img/company-logos/64/AAPL.D.png"></div>
<div class="name-div"><a href="/apple/marketcap/"><div class="company-name">Apple</div>
<div class="company-code"><span class="rank d-none"></span>AAPL</div>
</a></div></td><td class="td-right" data-sort="2891576508416">$2.891 T</td><td class="td-right" data-sort="18592">$185.92</td><td data-sort="18" class="rh-sm"><span class="percentage-green"><svg class="a" viewBox="0 0 12 12"><path d="M10 8H2l4-4 4 4z"></path></svg>0.18%</span></td><td class="p-0 sparkline-td red"><svg><path d="M0,21 5,18 10,22 15,14 20,16 25,12 30,8 35,14 40,11 45,3 50,3 55,4 60,8 65,6 70,10 75,11 80,13 85,13 90,14 95,14 100,13 105,16 110,16 115,31 120,34 125,39 130,41 135,31 140,32 145,30 150,31 155,30" /></svg></td><td>🇺🇸 <span class="responsive-hidden">USA</span></td>
</tr>

Thank you!

I tried following

soup = BeautifulSoup(data, "lxml")
table = soup.find("table")
# print(table)
rows = table.find_all("tr")

but it doesn't work, because again, <tr> header is missing

Solution

The issue is the HTML of the page is malformed. So to parse it like a browser does use html5lib parser:

import requests
from bs4 import BeautifulSoup

url = 'https://companiesmarketcap.com/dow-jones/largest-companies-by-market-cap/'

soup = BeautifulSoup(requests.get(url).content, 'html5lib')

for tr in soup.table.select('tr'):
    tds = [t for td in tr.select('td') if (t:=td.get_text(strip=True, separator=' '))]
    if len(tds) == 6:
        print(*tds, sep='\t')

Prints:

1       Apple AAPL      $2.891 T        $185.92 0.18%   🇺🇸 USA
2       Microsoft MSFT  $2.887 T        $388.47 1.00%   🇺🇸 USA
3       Visa V  $542.91 B       $264.17 0.05%   🇺🇸 USA
4       JPMorgan Chase JPM      $488.72 B       $169.05 0.73%   🇺🇸 USA
5       UnitedHealth UNH        $482.35 B       $521.51 3.37%   🇺🇸 USA
6       Walmart WMT     $434.31 B       $161.32 0.13%   🇺🇸 USA
7       Johnson & Johnson JNJ   $390.91 B       $162.39 0.77%   🇺🇸 USA
8       Procter & Gamble PG     $354.94 B       $150.60 0.06%   🇺🇸 USA

...

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] Only <tr> tag is missing using requests library

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels