Wednesday, July 20, 2022

[FIXED] df is pulling a none value from bs4 element

July 20, 2022 beautifulsoup, pandas, python No comments

Issue

I'm still new to Python and thanks to everyone for the earlier help. I am trying to parse a webscraped bs4 element with no tables into a df. The data I need is identified as 'pre'. I thought using read_html with the right attributes would work, but I'm getting a None value from the bs4 element.

Code:

headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO%20q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'

response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data

input= pd.read_html(data, attrs = {'pre':'table'})#reads html data
df1=pd.DataFrame(input, index=None,)

Solution

Trying to use StringIO method with pandas read_csv()

import io
from bs4 import BeautifulSoup 
import requests
import pandas as pd
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO%20q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'

response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data

df = pd.read_csv(io.StringIO(data.text))
df = df.xs('BEGIN DATA', axis=1, drop_level=True)
print(df.iloc[:-1])

Output:

   DATE       TIME         CHRO    Q
07/12/2022 00:00                10.60       
07/12/2022 00:15                10.60       
07/12/2022 00:30                10.60       
07/12/2022 00:45                10.60       
                           ...
07/19/2022 22:45                 9.36       
07/19/2022 23:00                 9.36
07/19/2022 23:15                 9.36
07/19/2022 23:30                 9.36
07/19/2022 23:45                 9.36
 Length: 769

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, July 20, 2022

[FIXED] df is pulling a none value from bs4 element

Issue

Code:

Solution

0 comments:

Post a Comment

Popular Posts

Labels