Issue
I am trying to extract multiple tables from the BLS website and am hitting a roadblock of sorts. I can extract the tables, I looked at other SO posts for that - (How to read an html table with multiple tbodies with python pandas' read_html?). The tables that I extract however are not formatted correctly and the data is all over the place. Here is what I have done till now;
import urllib
import pandas as pd
from bs4 import BeautifulSoup
def new_func():
url = input('Please enter the BLS publication that you want to scrape table from:')
return url
url = new_func()
data = urllib.request.urlopen(url).read()
sp = BeautifulSoup(data,'html.parser')
#unwrap multiple table tags in the html
for table in sp.findChildren(attrs={'id': 'regular'}):
for c in table.children:
if c.name in ['tbody', 'thead']:
c.unwrap()
#dataset creation
data_pandas = pd.read_html(str(sp), flavor="bs4",thousands=',',decimal='.')
#clean dataset
df = pd.concat(data_pandas, axis=0)#to convert lists of pd.read_html to dataframe
#export to csv
df.to_csv(input('Specify .csv filename:'))
The code works for a page with a single table but when it is a url like - https://www.bls.gov/regions/new-york-new-jersey/news-release/2022/occupationalemploymentandwages_kingston_20220616.htm, it seems to stop working and messes up the formatting.
I apologize if my code is messy or redundant in certain blocks - this is my first foray into Python and am struggling to figure this out. Any help or guidance on what I should look to solve this issue is appreciated.
Solution
If possible, it's probably easier to download the data as Excel files and process them directly: https://www.bls.gov/oes/current/oes_28740.htm and https://www.bls.gov/oes/current/oes_nat.htm (from "Related Links" on the news release page).
The main reason it's failing with multiple tables is that you merge all the tables into one, which doesn't work nicely if they don't have the same format. Replace the pd.concat()
call with df = data_pandas[0]
to get the first table, and/or ask the user which table they want and pass that to the match
argument of pd.read_html()
.
Here's an improved version that also cleans up some of the extra formatting (footnotes, dollar signs, etc.):
import re
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
def new_func():
url = input("Please enter the BLS publication that you want to scrape table from: ")
return url
url = new_func()
data = urllib.request.urlopen(url).read()
sp = BeautifulSoup(data, "html.parser")
tables = sp.find_all("table", class_="regular")
for table in tables:
# merge multiple tbody tags into the first one
all_tbodies = table.find_all("tbody", recursive=False)
if len(all_tbodies) > 1:
main_tbody = all_tbodies[0]
for tbody in all_tbodies[1:]:
main_tbody.append(tbody.extract())
tbody.unwrap()
# find footnote markers and remove footer
footnote_markers = []
for tfoot in table.find_all("tfoot", recursive=False):
for s in tfoot.strings:
if " " in s:
parts = s.split(maxsplit=1)
footnote_markers.append(parts[0])
tfoot.decompose()
# remove footnote links
for footnoteid in table.select("a > .footnoteid"):
footnoteid.parent.decompose()
# strip footnote markers and dollar signs from table cells
if footnote_markers:
footnote_re = re.compile(r"(\s*(" + "|".join(map(re.escape, footnote_markers)) + "))+$")
for td in table.find_all("td"):
if td.string:
td.string = footnote_re.sub("", td.string).lstrip("$")
match = ".+"
if len(tables) > 1:
match = re.escape(
input("Multiple tables found, please enter the table name to export (e.g. Table A): ")
)
# dataset creation
data_pandas = pd.read_html(str(sp), match=match, flavor="bs4", thousands=",", decimal=".")
df = data_pandas[0]
# export to csv
df.to_csv(input("Specify .csv filename: "))
Answered By - yut23
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.