Issue
I am using the same code in Python to extract the results of matches of this federation and it works on all pages except in this Group 3 (two example links below)
The code looks for all the tables with the soup.find_all("table", class_="table-sm")
and it returns a variable number of tables depending on the number of matches on the page that begin at table 4 and that I use to get the results.
When the same code is applied to Group 3, it only returns 2: the second one contains all content below the head of the page.
import requests
from bs4 import BeautifulSoup
url_gr_2 = 'https://rfetm.es/resultados/2022-2023/view.php?liga=Mg==&grupo=2&subgrupo=S&jornada=1&sexo=M'
url_gr_3 = 'https://rfetm.es/resultados/2022-2023/view.php?liga=Mg==&grupo=3&subgrupo=S&jornada=1&sexo=M'
req = requests.get(url_gr_2)
soup = BeautifulSoup(req, features="lxml")
data = soup.find_all("table", class_="table-sm")
print(len(data))
# returns 9
req = requests.get(url_gr_3)
soup = BeautifulSoup(req, features="lxml")
data = soup.find_all("table", class_="table-sm")
print(len(data))
# returns 2
I was using a Chrome selector extension and the console to test the structure of the page of Group 3 and it returns 8 elements on the page, but not in bs4.
I was also trying to iterate over the HTML elements of the data[1]
in the case of the Group 3 but bs4 do not return any element.
I tried to compare the HTML of both pages but I was not able to find any significative difference.
What I expected is to be able to extract the results of this league as I do with all the rest of the leagues in this page.
Solution
Use html.parser
or html5lib
as a parser for this page. lxml
is more strict and it doesn't parse this page as a browser does.
from bs4 import BeautifulSoup
url = 'https://rfetm.es/resultados/2022-2023/view.php?liga=Mg==&grupo=3&subgrupo=S&jornada=1&sexo=M'
soup = BeautifulSoup(requests.get(url).content, 'html.parser') # <-- use `html.parser`
data = soup.find_all("table", class_="table-sm")
print(len(data))
Prints:
8
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.