Issue
I try to get a table by using BeautifulSoap, and I faced error while using find
method.
I want to get headers of table from here: https://stooq.pl/t/?i=513&v=1&l=1
The id of a table i interested in is fth1
, and HTML looks like that:
<table class="fth1" id="fth1" width="100%" cellspacing="0" cellpadding="3" border="0">
<thead style="background-color:e9e9e9">
<tr align="center">
<th id="f13">
<a href="t/?i=513&v=1&o=1">Symbol</a>
</th>
<th id="f13">
<a href="t/?i=513&v=1&o=2">Nazwa</a>
</th>
...
My python script:
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
page = requests.get('https://stooq.pl/t/?i=513&v=1&l=1')
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', {'id': "fth1"})
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
print(headers)
I got the error:
Traceback (most recent call last): File "/home/.../script.py", line 25, in for i in table1.find_all('th'): AttributeError: 'NoneType' object has no attribute 'find_all'
I found that the variable table1
has a type None
.
I've tried use html.parser
and html5lib
instead of lxml
but with no success.
What is wrong that I got such error?
Solution
You can still scrape the site, in order to do so you need to copy your cookies/headers from your browser and inject them into the request. If you go to your Network tab on the browser, find the HTML document and inspect it or right click and copy as curl, you can then convert that to python.
Your request would then look something like this (but with your cookies):
import requests
from bs4 import BeautifulSoup
cookies = {
'cookie_uu': '',
'privacy': '',
'PHPSESSID': '',
'uid': '',
'cookie_user': '',
'_ga': '',
'_gid': '',
'__gads': '',
'FCCDCF': '',
'FCNEC': '',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-GB,en;q=0.5',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
params = {
'i': '513',
'v': '1',
'l': '1',
}
response = requests.get('https://stooq.pl/t/', params=params, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', {'id': "fth1"})
headers = [i.text for i in table.find_all('th')]
print(headers)
This returns:
['Symbol', 'Nazwa', 'Otwarcie', 'Max', 'Min', 'Kurs', 'Zmiana', 'Wolumen', 'ObrĂ³t', 'Data', '']
Answered By - Sam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.