Issue
I have created a program that collects table data at the following location. And when extracting data in soup library, it appears fine, but when converting html codes to a table using pandas library pd.read_html(table) I get an error message, I don't know why
this code bellow :
import requests
from bs4 import BeautifulSoup
import pandas as pd
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
url = "https://www.worldometers.info/coronavirus/#countries"
req = requests.get(url, headers=header)
#test of response server (permession for client)
""" True response with 200 and not forbiden access """
#soup methode : extract data finded in html page from the link
soup = BeautifulSoup(req.content, 'lxml')
tables = soup.find('table',{'id':'main_table_countries_today'})
df = pd.read_html(tables)
print(df)
after excute :
Traceback (most recent call last):
File "c:\Users\pc\Desktop\manual scrap\scrap 1\covid find all data.py", line 18, in <module>
df = pd.read_html(tables)
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 1085, in read_html
return _parse(
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 893, in _parse
tables = p.parse_tables()
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\pandas\io\html.py", line 717, in _build_doc
r = parse(self.io, parser=parser)
File "C:\Users\pc\AppData\Roaming\Python\Python39\site-packages\lxml\html\__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseDocument
TypeError: 'NoneType' object is not callable
PS C:\Users\pc\Desktop\manual scrap\scrap 1>
The program aims to print table data at code execution time using pd.read_html()
Such as :
Solution
Convert the soup
to string before passing in into .read_html()
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
url = "https://www.worldometers.info/coronavirus/#countries"
req = requests.get(url, headers=header)
# test of response server (permession for client)
""" True response with 200 and not forbiden access """
# soup methode : extract data finded in html page from the link
soup = BeautifulSoup(req.content, "lxml")
tables = soup.find("table", {"id": "main_table_countries_today"})
df = pd.read_html(str(tables).upper())[0] # <-- convert the soup to str first
print(df)
Prints:
# COUNTRY,OTHER TOTALCASES NEWCASES TOTALDEATHS NEWDEATHS TOTALRECOVERED NEWRECOVERED ACTIVECASES SERIOUS,CRITICAL TOT CASES/1M POP DEATHS/1M POP TOTALTESTS TESTS/ 1M POP POPULATION CONTINENT 1 CASEEVERY X PPL 1 DEATHEVERY X PPL 1 TESTEVERY X PPL NEW CASES/1M POP NEW DEATHS/1M POP ACTIVE CASES/1M POP
0 NaN NORTH AMERICA 48516635 142956.0 1000760.0 2587.0 37927770.0 60945.0 9588105.0 32950.0 NaN NaN NaN NaN NaN NORTH AMERICA NaN NaN NaN NaN NaN NaN
1 NaN ASIA 70629576 250387.0 1043218.0 3825.0 65875219.0 257280.0 3711139.0 40409.0 NaN NaN NaN NaN NaN ASIA NaN NaN NaN NaN NaN NaN
2 NaN SOUTH AMERICA 36987244 27316.0 1132455.0 757.0 34786052.0 1198.0 1068737.0 15195.0 NaN NaN NaN NaN NaN SOUTH AMERICA NaN NaN NaN NaN NaN NaN
3 NaN EUROPE 55607330 135394.0 1176585.0 1575.0 50739008.0 129944.0 3691737.0 11644.0 NaN NaN NaN NaN NaN EUROPE NaN NaN NaN NaN NaN NaN
4 NaN AFRICA 7902675 11584.0 197645.0 305.0 7027951.0 16775.0 677079.0 4464.0 NaN NaN NaN NaN NaN AFRICA NaN NaN NaN NaN NaN NaN
5 NaN OCEANIA 166231 1768.0 2196.0 8.0 118215.0 1108.0 45820.0 237.0 NaN NaN NaN NaN NaN AUSTRALIA/OCEANIA NaN NaN NaN NaN NaN NaN
6 NaN NaN 721 NaN 15.0 NaN 706.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN WORLD 219810412 569405.0 4552874.0 9057.0 196474921.0 467250.0 18782617.0 104899.0 28200.0 584.1 NaN NaN NaN ALL NaN NaN NaN NaN NaN NaN
8 1.0 USA 40449279 115125.0 661200.0 1237.0 31175601.0 37665.0 8612478.0 25675.0 121372.0 1984.0 588098779.0 1764643.0 3.332678e+08 NORTH AMERICA 8.0 504.0 1.0 345.00 4.00 25843.0
9 2.0 INDIA 32902293 45430.0 439900.0 341.0 32056062.0 34965.0 406331.0 8944.0 23572.0 315.0 524868734.0 376027.0 1.395828e+09 ASIA 42.0 3173.0 3.0 33.00 0.20 291.0
10 3.0 BRAZIL 20830495 26280.0 581914.0 686.0 19775873.0 NaN 472708.0 8318.0 97192.0 2715.0 56897224.0 265475.0 2.143224e+08 SOUTH AMERICA 10.0 368.0 4.0 123.00 3.00 2206.0
11 4.0 RUSSIA 6956318 18985.0 184812.0 798.0 6218048.0 18669.0 553458.0 2300.0 47644.0 1266.0 179500000.0 1229389.0 1.460075e+08 EUROPE 21.0 790.0 1.0 130.00 5.00 3791.0
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.