Sunday, April 10, 2022

[FIXED] webscraping - get the first and the second value from the div tag with multiple values seperated by coma

April 10, 2022 beautifulsoup, python No comments

Issue

I have a tag which looks like this

<div class="small text-gray mb-2">
    <div>
        Pierre M
        <!-- -->
        , 
        <!-- -->
        08/18/2018
        <!-- --> 
        <div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
            updated 
            <!-- -->
            03/11/2021
        </div>
      </div>
       <div>Long Range 4dr Sedan (electric DD)</div>
</div>

I would like to get only the name and surname so the "Pierre M" and the date "08/18/2018"

I was trying this code

import bs4
soup = BeautifulSoup()
data = []

for e in content_list:
    data.append({
        'reviewer-name':e.select_one('div').text,
        'reviewe-date':e.select_one('div').text,
    })

But it results in taking every value from that tag so I get

'reviewe-date': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)',
'reviewer-name': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)'

Solution

You could go with find_all(text=True, recursive=False) to get only the first section of text in your specific case:

for e in soup.select('div.small'):
    data.append({
        'reviewer-name':''.join(e.div.find_all(text=True, recursive=False)).split(',')[0].strip(),
        'reviewe-date':''.join(e.div.find_all(text=True, recursive=False)).split(',')[-1].strip(),
    })

Alternativ would be to check for child <div> with updated, save its text if needed and decompose() it from the DOM - use of walrus operator needs python 3.8 or later else use standard if statement):

for e in soup.select('div.small'):
    if (u := e.select_one('div.rounded')):
        updated = u.text.split('updated')[-1].strip()
        u.decompose()
    else: 
        updated = None
    data.append({
        'reviewer-name':e.div.text.split(',')[0].strip(),
        'reviewe-date':e.div.text.split(',')[-1].strip(),
        'reviewe-updated':updated
    })

Example

from bs4 import BeautifulSoup
html = '''
<div class="small text-gray mb-2">
    <div>
        Pierre M
        <!-- -->
        , 
        <!-- -->
        08/18/2018
        <!-- --> 
        <div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
            updated 
            <!-- -->
            03/11/2021
        </div>
      </div>
       <div>Long Range 4dr Sedan (electric DD)</div>
</div>
'''


soup = BeautifulSoup(html)
data = []

for e in soup.select('div.small'):
    if (u := e.select_one('div.rounded')):
        updated = u.text.split('updated')[-1].strip()
        u.decompose()
    else: 
        updated = None
    data.append({
        'reviewer-name':e.div.text.split(',')[0].strip(),
        'reviewe-date':e.div.text.split(',')[-1].strip(),
        'reviewe-updated':updated
    })

data

Output

[{'reviewer-name': 'Pierre M', 'reviewe-date': '08/18/2018', 'reviewe-updated': '03/11/2021'}]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] webscraping - get the first and the second value from the div tag with multiple values seperated by coma

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels