Wednesday, October 19, 2022

[FIXED] Web scraping div by attribute

October 19, 2022 beautifulsoup, python, selenium, web-scraping No comments

Issue

I have a target website with this code:

<div data-v-afa58544="" class="pa-0 col col-4">
          0.51
        </div>

I'm trying to access that '0.51', but I have no idea of how to reference the data-v-afa58544 of the div. What could be the way to access that value with Selenium (or BeautifulSoup)? I would appreciate any help.

Edit:

Can't access the element with the following code:

v = soup.find('div[data-v-afa58544]')
p = v.select_one('.pa-0 col col-4')

even trying v = soup.select_one or v = soup.find_all and iterating it always returns null value.

Solution

The text node value of div data-v-afa58544' class="pa-0 col col-4" is 0.51

solution (using bs4):

v = soup.select_one(".pa-0.col.col-4").get_text(strip=True)

v = soup.select_one('div[data-v-afa58544]').get_text(strip=True)

Example:

html = '''
<div data-v-afa58544="" class="pa-0 col col-4">
          0.51
        </div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())
v = soup.select_one('.pa-0')
p= v.get_text(strip=True) if v else None
print(p)

Output:

0.51

Update:

import requests
import pandas as pd

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }

url = 'https://www.river.go.jp/kawabou/file/files/tmlist/rn/20221004/0845/1614200100163.json'
r = requests.get(url,headers=headers)
df = pd.json_normalize(r.json()['min10Values'])
print(df)

Output:

           obsTime     rn10m  rn10mCcd  rnInc  rnIncCcd
0   2022/10/04 08:40      1         0      6         0
1   2022/10/04 08:30      0         0      5         0
2   2022/10/04 08:20      1         0      5         0
3   2022/10/04 08:10      0         0      4         0
4   2022/10/04 08:00      1         0      4         0
5   2022/10/04 07:50      0         0      3         0
6   2022/10/04 07:40      1         0      3         0
7   2022/10/04 07:30      0         0      2         0
8   2022/10/04 07:20      0         0      2         0
9   2022/10/04 07:10      0         0      2         0
10  2022/10/04 07:00      0         0      2         0
11  2022/10/04 06:50      0         0      2         0
12  2022/10/04 06:40      1         0      2         0
13  2022/10/04 06:30      0         0      1         0
14  2022/10/04 06:20      0         0      1         0
15  2022/10/04 06:10      0         0      1         0
16  2022/10/04 06:00      0         0      1         0
17  2022/10/04 05:50      0         0      1         0
18  2022/10/04 05:40      1         0      1         0
19  2022/10/04 05:30      0         0      0         0
20  2022/10/04 05:20      0         0      0         0
21  2022/10/04 05:10      0         0      0         0
22  2022/10/04 05:00      0         0      0         0
23  2022/10/04 04:50      0         0      0         0
24  2022/10/04 04:40      0         0      0         0
25  2022/10/04 04:30      0         0      0         0
26  2022/10/04 04:20      0         0      0         0
27  2022/10/04 04:10      0         0      0         0
28  2022/10/04 04:00      0         0      0         0
29  2022/10/04 03:50      0         0      0         0
30  2022/10/04 03:40      0         0      0         0
31  2022/10/04 03:30      0         0      0         0
32  2022/10/04 03:20      0         0      0         0
33  2022/10/04 03:10      0         0      0         0
34  2022/10/04 03:00      0         0      0         0
35  2022/10/04 02:50      0         0      0         0
36  2022/10/04 02:40      0         0      0         0
37  2022/10/04 02:30      0         0      0         0
38  2022/10/04 02:20      0         0      0         0
39  2022/10/04 02:10      0         0      0         0
40  2022/10/04 02:00      0         0      0         0
41  2022/10/04 01:50      0         0      0         0
42  2022/10/04 01:40      0         0      0         0
43  2022/10/04 01:30      0         0      0         0
44  2022/10/04 01:20      0         0      0         0
45  2022/10/04 01:10      0         0      0         0
46  2022/10/04 01:00      0         0      0         0
47  2022/10/04 00:50      0         0      0         0
48  2022/10/04 00:40      0         0      0         0
49  2022/10/04 00:30      0         0      0         0

Selenium and bs4:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://www.river.go.jp/kawabou/pcfull/tm?itmkndCd=4&ofcCd=20757&obsCd=62&isCurrent=true&fld=0'


driver.get(url)
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"html.parser")

e=[] 
for row in soup.select('div[class="tm-pc-detail-border-line pt-1"] > span'):
    v= row.get_text(strip=True).split(':')
    d={
      v[0]:v[1].replace('marrow_upward','')
    }
    e.append(d)
print(e)

Output:

[{'水位': '0.58'}, {'時間雨量': '3.0mm'}, {'10分雨量': '1.0mm'}, {'降り始めからの雨量': '8.0mm'}]

and

e=[] 
row = soup.select('div[class="tm-pc-detail-border-line pt-1"] > span')[0]
v= row.get_text(strip=True).split(':')
d={
    v[0]:v[1].replace('marrow_upward','')
    }
e.append(d)
print(e)

Output:

[{'水位': '0.59'}]

Answered By - Fazlul

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 19, 2022

[FIXED] Web scraping div by attribute

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels