Saturday, September 17, 2022

[FIXED] How to scrape text from HTML to dataframe removing header and footer extra information?

September 17, 2022 beautifulsoup, dataframe, pandas, python, web-scraping No comments

Issue

I would like to extract focal mechanism information from the GCMT catalog (https://www.globalcmt.org/). In the future I plan on doing this in an automated way in python to extract earthquake information within python outside of the GCMT webpage for plotting/analysis.

Here's the code I have so far with an example URL:

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

text = soup.body.get_text(separator= '\n', strip=True)
print(text)

Global CMT Catalog
Search criteria:
Start date: 1976/1/1   End date: 1976/12/30
-90 <=lat<= 90          -180 <=lon<= 180 
0 <=depth<= 1000         -9999 <=time shift<= 9999
0 <=mb<= 10        0<=Ms<= 10           0<=Mw<= 10
0 <=tension plunge<= 90         0 <=null plunge<= 90
Results
Output in
GMT
psmeca (GMT v>3.3) format
Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name
-176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A        
-75.14 -13.42 85 -1.78 -0.59 2.37 -1.28 1.97 -2.90 24 X Y 010576A        
159.50 51.45 15 1.10 -0.30 -0.80 1.05 1.24 -0.56 25 X Y 010676A
...

I'm still new to python/webscraping but I would like to extract the data from containing (Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name) excluding the footer information (End of events found with given criteria.) and beyond.

The output would contain column information: lon lat depth mrr mtt mpp mrt mrp mtp iexp name

Then the data (e.g.): -176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A

Solution

You could create a list of dicts from header and values:

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

Tricky part in my opinion is that you have to handle the the last elements in your list to avoid missmatch to headers.

Assuming "X Y ..." belong together:

d[10:13] = [' '.join([str(x) for x in d[10:13]])]

or if they are not needed simply delete them:

del d[10:12]

or adjust the headers instead:

header[10:10] = ['x','y']

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

data = []

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

pd.DataFrame(data)

Output

	lon	lat	depth	mrr	mtt	mpp	mrt	mrp	mtp	iexp	x	y	name
0	-176.96	-29.25	48	7.68	0.09	-7.77	1.39	4.52	-3.26	26	X	Y	010176A
1	-75.14	-13.42	85	-1.78	-0.59	2.37	-1.28	1.97	-2.9	24	X	Y	010576A
2	159.5	51.45	15	1.1	-0.3	-0.8	1.05	1.24	-0.56	25	X	Y	010676A
3	167.81	-15.97	174	-1.7	2.29	-0.59	-2.33	-1.23	2.01	25	X	Y	010976A
4	-16.29	66.33	15	-0.51	-2.86	3.37	0.05	-0.78	-0.86	25	X	Y	011376A
5	-177.04	-29.69	47	4.78	-0.49	-4.3	0.83	3.62	-1.32	27	X	Y	011476A
6	-176.75	-28.72	18	2.56	0.18	-2.74	3.58	6.77	-1.23	27	X	Y	011476B
7	-176.62	-28.61	15	2.34	0.24	-2.58	0.62	3.71	-0.68	25	X	Y	011476C
8	-176.63	-30.25	15	1.44	0.06	-1.5	0.3	1.18	-0.46	25	X	Y	011576A

...

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] How to scrape text from HTML to dataframe removing header and footer extra information?

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels