Sunday, June 26, 2022

[FIXED] How to get the element data out of a list using beautiful soup?

June 26, 2022 beautifulsoup, pandas, python, web-scraping No comments

Issue

The code below gets the html data into a list. I am trying to scrape a specific element called data-append-csv (example is: data-append-csv="abbotco01") from the baseball reference page html link (see the code for the link):

Current Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml
[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x]

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal: Be able to have a pandas dataframe that has each element of data-append-csv from the html table.

index	data-append-csv
0	abbotco01
1	abreual01
2	abreubr01

etc.

Solution

First convert the string into an BeautifulSoup object and .select('[data-append-csv]'):

table = [x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0]
[(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')]

To ensure a correct join to your original data, try to scrape the rank as well in case that there is are rows without these attribute and the length of both dataframes will be different:

(a.find_previous('th').text,a.get('data-append-csv'))

Now you could create your dataframe from your list:

pd.DataFrame([(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')],columns=['Rk','data-append-csv'],dtype='object')

Example

Join your data to your initial dataframe and check last column:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.text)
table = [x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0]

### create and clean dataframe 1
df1 = pd.read_html(table)[0]
df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]
df1.set_index('Rk', inplace=True)

### create and clean dataframe 2
df2 = pd.DataFrame([(a.find_previous('th').text,a.get('data-append-csv')) for a in BeautifulSoup(table).select('[data-append-csv]')],columns=['Rk','data-append-csv'],dtype='object')
df2.set_index('Rk', inplace=True)

### join both dataframe
df1.join(df2).reset_index()

Output

	Rk	Name	Age	Tm	Lg	G	PA	AB	R	H	2B	3B	HR	RBI	SB	BB	SO	BA	OBP	SLG	OPS	OPS+	TB	GDP	HBP	SF	IBB	Pos Summary	data-append-csv
0	1	Fernando Abad*	35	BAL	AL	2	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	1	abadfe01
1	2	Cory Abbott	25	CHC	NL	8	3	3	0	1	0	0	0	0	0	0	1	0.333	0.333	0.333	0.667	81	1	0	0	0	0	/1H	abbotco01
2	3	Albert Abreu	25	NYY	AL	3	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	1	abreual01
3	4	Bryan Abreu	24	HOU	AL	1	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	1	abreubr01
4	5	José Abreu	34	CHW	AL	152	659	566	86	148	30	2	30	117	1	61	143	0.261	0.351	0.481	0.831	124	272	28	22	10	3	*3D/5	abreujo02

....

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 26, 2022

[FIXED] How to get the element data out of a list using beautiful soup?

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels