Sunday, April 10, 2022

[FIXED] Click a date range button and crawler one html table in Python

April 10, 2022 beautifulsoup, dataframe, pandas, python-3.x, python-requests No comments

Issue

I try to crawler a small table data from here, the process is shown by the figure below:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

contents = soup.find_all('div', {'class': 'info_table'})
print(contents[0].children)

rows = []
for child in contents[0].children:
    row = []
    for td in child:
        print(td) # not work after this line
        try:
            row.append(td.text.replace('\n', ''))
        except:
            continue
if len(row) > 0:
    rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)

Since the output of contents is quite large html data, so I don't know how to correctly extract them and save as dataframe. Could someone share an answer or give me some tips? Thanks.

Solution

You can use:

table = soup.find('div', {'class': 'info_table'})
data = [[cell.text.strip() for cell in row.find_all('div')]
            for row in table.find_all('div', recursive=False)]
df = pd.DataFrame(data[1:], columns=data[0])

Output:

>>> df
              Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0     4th Mar 2022      519      130        650         280                       
1    25th Feb 2022      522      127        650         290                       
2    18th Feb 2022      520      124        645         283                  11.60
3    11th Feb 2022      516      118        635         275                  11.60
4     4th Feb 2022      497      116        613         264                  11.60
..             ...      ...      ...        ...         ...                    ...
358  26th Dec 2014     1499      340       1840         367                   9.12
359  19th Dec 2014     1536      338       1875         415                   9.13
360  12th Dec 2014     1546      346       1893         411                   9.14
361   5th Dec 2014     1575      344       1920         428                   9.12
362  21st Nov 2014     1574      355       1929         452                   9.08

[363 rows x 6 columns]

Update

A lazy solution to let Pandas guess the datatype is to convert your data to csv:

import io

table = soup.find('div', {'class': 'info_table'})
data = ['\t'.join(cell.text.strip() for cell in row.find_all('div'))
            for row in table.find_all('div', recursive=False)]
buf = io.StringIO()
buf.writelines('\n'.join(data))
buf.seek(0)

df = pd.read_csv(buf, sep='\t', parse_dates=['Date'])

Output:

>>> df
          Date  Oil Rigs  Gas Rigs  Total Rigs  Frac Spread  Production Million Bpd
0   2022-03-04       519       130         650          280                     NaN
1   2022-02-25       522       127         650          290                     NaN
2   2022-02-18       520       124         645          283                   11.60
3   2022-02-11       516       118         635          275                   11.60
4   2022-02-04       497       116         613          264                   11.60
..         ...       ...       ...         ...          ...                     ...
358 2014-12-26      1499       340        1840          367                    9.12
359 2014-12-19      1536       338        1875          415                    9.13
360 2014-12-12      1546       346        1893          411                    9.14
361 2014-12-05      1575       344        1920          428                    9.12
362 2014-11-21      1574       355        1929          452                    9.08

[363 rows x 6 columns]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    363 non-null    datetime64[ns]
 1   Oil Rigs                363 non-null    int64         
 2   Gas Rigs                363 non-null    int64         
 3   Total Rigs              363 non-null    int64         
 4   Frac Spread             363 non-null    int64         
 5   Production Million Bpd  360 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 17.1 KB

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] Click a date range button and crawler one html table in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels