Issue
I have a tag which looks like this
<div class="small text-gray mb-2">
<div>
Pierre M
<!-- -->
,
<!-- -->
08/18/2018
<!-- -->
<div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
updated
<!-- -->
03/11/2021
</div>
</div>
<div>Long Range 4dr Sedan (electric DD)</div>
</div>
I would like to get only the name and surname so the "Pierre M" and the date "08/18/2018"
I was trying this code
import bs4
soup = BeautifulSoup()
data = []
for e in content_list:
data.append({
'reviewer-name':e.select_one('div').text,
'reviewe-date':e.select_one('div').text,
})
But it results in taking every value from that tag so I get
'reviewe-date': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)',
'reviewer-name': 'John Schreiber, 10/06/2018 updated 10/08/2019Long Range 4dr Sedan (electric DD)'
Solution
You could go with find_all(text=True, recursive=False)
to get only the first section of text in your specific case:
for e in soup.select('div.small'):
data.append({
'reviewer-name':''.join(e.div.find_all(text=True, recursive=False)).split(',')[0].strip(),
'reviewe-date':''.join(e.div.find_all(text=True, recursive=False)).split(',')[-1].strip(),
})
Alternativ would be to check for child <div>
with updated, save its text if needed and decompose()
it from the DOM
-
use of walrus operator
needs python
3.8 or later else use standard if statement
):
for e in soup.select('div.small'):
if (u := e.select_one('div.rounded')):
updated = u.text.split('updated')[-1].strip()
u.decompose()
else:
updated = None
data.append({
'reviewer-name':e.div.text.split(',')[0].strip(),
'reviewe-date':e.div.text.split(',')[-1].strip(),
'reviewe-updated':updated
})
Example
from bs4 import BeautifulSoup
html = '''
<div class="small text-gray mb-2">
<div>
Pierre M
<!-- -->
,
<!-- -->
08/18/2018
<!-- -->
<div class="d-inline-block px-0_25 text-white bg-primary-darker rounded">
updated
<!-- -->
03/11/2021
</div>
</div>
<div>Long Range 4dr Sedan (electric DD)</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.small'):
if (u := e.select_one('div.rounded')):
updated = u.text.split('updated')[-1].strip()
u.decompose()
else:
updated = None
data.append({
'reviewer-name':e.div.text.split(',')[0].strip(),
'reviewe-date':e.div.text.split(',')[-1].strip(),
'reviewe-updated':updated
})
data
Output
[{'reviewer-name': 'Pierre M', 'reviewe-date': '08/18/2018', 'reviewe-updated': '03/11/2021'}]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.