Monday, July 4, 2022

[FIXED] How to convert a soup to a Dataframe

July 04, 2022 beautifulsoup, json, python No comments

Issue

I am new to the beautiful soup. I am working to scrape some excel files from a source. URL to source: https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date= Original Data source: https://droughtmonitor.unl.edu/DmData/GISData.aspx/

My main objective is to scrape the data from this URL and convert the same into a data frame including all the files in the original data source URL and also if some new files added could be downloaded automatically and added to the source.

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

url2 = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text,'html.parser')
raw_data = [data.text for data in soup]

The above code gives me an output:-

["MapDate,FIPS,County,State,Nothing,D0,D1,D2,D3,D4,ValidStart,ValidEnd\r\n20220621,01001,Autauga County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01003,Baldwin County,AL,81.22,18.78,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01005,Barbour County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01007,Bibb County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01009,Blount County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01011,Bullock County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01013,Butler County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01015,Calhoun County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01017,Chambers County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01019,Cherokee County,AL,69.27,30.73,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01021,Chilton County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n

I want to have the inital 12 values: MapDate,FIPS,County,State,Nothing,D0,D1,D2,D3,D4,ValidStart,ValidEnd to be my column name and rest to the values associated with the same.

Also, the original data source URL has values from the year 2000 to 2022. I need all the data in the same format and in a single CSV.

Also, I need to have the code in such a manner that it will automatically extract any new data added to the source.

Can someone guide me on this.

Solution

It sends file csv so you don't need BeautifulSoup

You can use io with pandas.read_csv()

import requests
import pandas as pd
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
df = pd.read_csv(fh)

print(df)

or you can use io with module csv

import requests
import csv
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
data = list(csv.reader(fh))

print(data)

EDIT:

You can even use url directly with pandas

import pandas as pd

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

df = pd.read_csv(url)

print(df)

EDIT:

Now you need only list with dates and run for-loop to read all csv and keep on list. And later you can use pandas.concat() to convert this list to single DataFrame.

Pandas doc: Merge, join, concatenate and compare

Minimal working example:

import pandas as pd

# --- before loop ---

all_dates = ["2022-06-21", "2022-06-14", "2022-06-07"]
all_dfs = []

# url without `2022-06-21` at the end
url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date='

# --- loop ---

for date in all_dates:
    print('date:', date)
    df = pd.read_csv( url + date )
    all_dfs.append( df )

# --- after loop --- 

full_df = pd.concat(all_dfs)
print(full_df)

To get list of dates you could scrape them from webpage but it may need Selenium instead of beautifulsoupbecause page uses JavaScript to add dates on page.

Or you should use DevTools (tab: Network, filter: XHR) to see what url is used by JavaScript to get dates and use requests to get them.

import requests

# without header `Content-Type` it sends `HTML` instead of `JSON`
headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
#    'X-Requested-With': 'XMLHttpRequest',
#    'Referer': 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/',
    'Content-Type': 'application/json; charset=utf-8',
}

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/ReturnDMWeeks'

response = requests.get(url, headers=headers)
#print(response.text)

data = response.json()

all_dates = data['d']
all_dates = [f"{d[:4]}-{d[4:6]}-{d[6:]}" for d in all_dates]

print(all_dates)

Result

['2022-06-21', '2022-06-14', '2022-06-07', ..., '2000-01-18', '2000-01-11', '2000-01-04']

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, July 4, 2022

[FIXED] How to convert a soup to a Dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels