Issue
I'm Getting the value error trying to parse a page with BeautifulSoup and html5lib in Jupyter:
import pandas as pd
import requests
import html5lib
url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
r = requests.get(url)
df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
df = df_list[0]
df.head()
ValueError Traceback (most recent call last)
Cell In[1], line 9
6 url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
8 r = requests.get(url)
----> 9 df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list
10 df = df_list[0]
11 df.head()
File D:\Drivers\Anaconda\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1205, in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only, extract_links)
1201 validate_header_arg(header)
1203 io = stringify_path(io)
-> 1205 return _parse(
1206 flavor=flavor,
1207 io=io,
1208 match=match,
1209 header=header,
1210 index_col=index_col,
1211 skiprows=skiprows,
1212 parse_dates=parse_dates,
1213 thousands=thousands,
1214 attrs=attrs,
1215 encoding=encoding,
1216 decimal=decimal,
1217 converters=converters,
1218 na_values=na_values,
1219 keep_default_na=keep_default_na,
1220 displayed_only=displayed_only,
1221 extract_links=extract_links,
1222 )
File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:1006, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
1004 else:
1005 assert retained is not None # for mypy
-> 1006 raise retained
1008 ret = []
1009 for table in tables:
File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:986, in _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs)
983 p = parser(io, compiled_match, attrs, encoding, displayed_only, extract_links)
985 try:
--> 986 tables = p.parse_tables()
987 except ValueError as caught:
988 # if `io` is an io-like object, check if it's seekable
989 # and try to rewind it before trying the next parser
990 if hasattr(io, "seekable") and io.seekable():
File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:262, in _HtmlFrameParser.parse_tables(self)
254 def parse_tables(self):
255 """
256 Parse and return all tables from the DOM.
257
(...)
260 list of parsed (header, body, footer) tuples from tables.
261 """
--> 262 tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
263 return (self._parse_thead_tbody_tfoot(table) for table in tables)
File D:\Drivers\Anaconda\lib\site-packages\pandas\io\html.py:618, in _BeautifulSoupHtml5LibFrameParser._parse_tables(self, doc, match, attrs)
615 tables = doc.find_all(element_name, attrs=attrs)
617 if not tables:
--> 618 raise ValueError("No tables found")
620 result = []
621 unique_tables = set()
ValueError: No tables found
I've been trying page parsing in jupyter by using
BeautifulSoup(html.text, 'html.parser')
But in this case it doesn't bring the proper page content from a browser - the tables are not seen in the result.
I read that this is possible with selenium or pycharm.
But, also with pandas and html5lib. I never used it and don't know what the approach should be.
Something specific with html5lib? Any inconsistencies in my simpliest code? Any other ways to parse tables in web page? With lxml? Where to look at for the decision?
Solution
The data is in page, but it's being transformed into a table by Javascript. Pandas cannot execute Javascript to see that table. I notice you're also importing requests
package. Here is one way of obtaining that GDP data, using requests to retrieve the data, then using BeautifulSoup to parse the html response and isolate the element holding the data, then using JSON to parse that element and get the actual data:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://worldpopulationreview.com/countries/countries-by-gdp/#worldCountries"
r = requests.get(url)
soup = bs(r.text, 'html.parser')
elem_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text
df = pd.json_normalize(json.loads(elem_w_data)['props']['pageProps']['data'])
print(df)
Result in terminal:
pop id imfGDP unGDP country gdpPerCapita continent
0 3.399966e+05 840 2.669515e+13 18624475000000 United States 7.851594e+04 North America
1 5.050000e-03 840 2.669515e+13 18624475000000 United States 5.286168e+12 North America
2 1.425671e+06 156 2.186548e+13 11218281029298 China 1.533697e+04 Asia
3 -1.500000e-04 156 2.186548e+13 11218281029298 China -1.457699e+14 Asia
4 1.232945e+05 392 5.291351e+12 4936211827875 Japan 4.291635e+04 Asia
... ... ... ... ... ... ... ...
419 8.260000e-03 788 0.000000e+00 41703561397 Tunisia 5.048857e+09 Africa
420 4.606200e+01 796 0.000000e+00 917550492 Turks and Caicos Islands 1.991990e+04 North America
421 7.860000e-03 796 0.000000e+00 917550492 Turks and Caicos Islands 1.167367e+08 North America
422 3.674463e+04 804 0.000000e+00 93270354852 Ukraine 2.538339e+03 Europe
423 -7.448000e-02 804 0.000000e+00 93270354852 Ukraine -1.252287e+09 Europe
424 rows × 7 columns
Relevant documentation: pandas, requests, BeautifulSoup.
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.