Issue
I am trying to extract the table from this website:
https://serviciosede.mineco.gob.es/indeco/reports/verSerieGraf.aspx/?codigo=230400&frec=-1
The table starts with column headers "Fecha" and "Valor". And, the following HTML excerpt is what I used to figure out the table:
<table cellspacing="0" cellpadding="0" cols="3" border="0" style="border-collapse:collapse;" class="A43f22eb6dc4f401d849b49a4e6cd447e46">
Despite that, when running the following lines of code, it returns None.
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver4 = webdriver.Chrome()
goe_url = "https://serviciosede.mineco.gob.es/indeco/reports/verSerieGraf.aspx/?codigo=230400&frec=-1"
driver4.maximize_window()
driver4.get(goe_url)
time.sleep(5)
content4 = driver4.page_source.encode('utf-8').strip()
soup4 = BeautifulSoup(content4, "html.parser")
my_table = soup4.find('table',{'class':'A6dc892817a884dcfbbecd587b047171f46'})
print(my_table)
Does anybody know how I can fix this and what code would write so that I won't get None when printing my_table? Maybe I am querying the wrong HTML element. I am not sure where the problem is.
Thanks in advance.
Solution
There's a .NET
backend on this website and you can mimic the request without selenium
.
Here's how:
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
# Used to parse the payload into a dictionary
# print(json.dumps(dict(parse_qsl(paylaod)), indent=4))
url = "https://serviciosede.mineco.gob.es/indeco/reports/verSerieGraf.aspx/?codigo=230400&frec=-1"
payload_data = {
"__EVENTTARGET": "ReportViewer1$_ctl9$Reserved_AsyncLoadTarget",
"__VIEWSTATE": "",
"__VIEWSTATEGENERATOR": "4B866612",
"__EVENTVALIDATION": "",
"ReportViewer1:_ctl11": "standards",
"ReportViewer1:AsyncWait:HiddenCancelField": "False",
"ReportViewer1:ToggleParam:collapse": "false",
"ReportViewer1:_ctl7:collapse": "false",
"ReportViewer1:_ctl9:VisibilityState:_ctl0": "None",
"ReportViewer1:_ctl9:ReportControl:_ctl4": "100"
}
with requests.Session() as s:
soup = BeautifulSoup(s.get(url).text, "lxml")
# Get the viewstate and eventvalidation values first
payload_data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
payload_data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
# Now the table should be in the source HTML
table_data = s.post(url, data=payload_data)
# Do some pandas magic to get the table data
df = pd.read_html(StringIO(table_data.text))[-3]
df = df.drop(df.columns[0], axis=1)
df.dropna(inplace=True)
df.columns = df.iloc[0]
df = df.iloc[1:]
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))
Output:
+---------+---------+
| Fecha | Valor |
|---------+---------|
| 8/2023 | 8518 |
| 7/2023 | 12676 |
| 6/2023 | 11967 |
| 5/2023 | 12059 |
| 4/2023 | 9740 |
| 3/2023 | 13057 |
| 2/2023 | 11409 |
| 1/2023 | 10430 |
| 12/2022 | 9943 |
| 11/2022 | 13934 |
| 10/2022 | 13292 |
| 9/2022 | 12948 |
| 8/2022 | 7854 |
| 7/2022 | 11507 |
| 6/2022 | 12232 |
| 5/2022 | 10114 |
| 4/2022 | 8575 |
| 3/2022 | 13742 |
| 2/2022 | 10709 |
| 1/2022 | 10994 |
| 12/2021 | 11236 |
| 11/2021 | 13042 |
| 10/2021 | 12482 |
| 9/2021 | 14193 |
| 8/2021 | 8084 |
| 7/2021 | 12583 |
| 6/2021 | 11702 |
| 5/2021 | 11750 |
| 4/2021 | 11945 |
| 3/2021 | 11595 |
| 2/2021 | 10465 |
| 1/2021 | 9698 |
| 12/2020 | 9316 |
| 11/2020 | 11836 |
| 10/2020 | 9845 |
| 9/2020 | 11212 |
| 8/2020 | 7190 |
| 7/2020 | 9349 |
| 6/2020 | 9767 |
| 5/2020 | 8619 |
| 4/2020 | 5899 |
| 3/2020 | 8384 |
| 2/2020 | 11551 |
| 1/2020 | 10796 |
| 12/2019 | 10165 |
| 11/2019 | 9427 |
| 10/2019 | 12212 |
| 9/2019 | 10816 |
| 8/2019 | 6863 |
| 7/2019 | 15208 |
| 6/2019 | 12110 |
| 5/2019 | 12661 |
| 4/2019 | 12189 |
| 3/2019 | 12236 |
| 2/2019 | 11758 |
| 1/2019 | 11731 |
+---------+---------+
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.