Issue
Apologies in advance for such a long and basic question!
Given the following three html
snippets which are part of a bigger part as follows:
<html>
<body>
<span _ngcontent-ont-c199="" class="font-weight-bold">
<span _ngcontent-ont-c199="" class="ng-star-inserted">
<span _ngcontent-ont-c199="" translate="">
nro
</span>
4 A.
</span>
<!-- -->
<span _ngcontent-ont-c199="" class="ng-star-inserted">
6.12.1939
</span>
<!-- -->
</span>
<span _ngcontent-ont-c199="" class="ng-star-inserted">
, JR 10
</span>
<!-- -->
<!-- -->
<span _ngcontent-ont-c199="" class="ng-star-inserted">
:
<span _ngcontent-ont-c199="" translate="">
sivu
</span>
1
</span>
</body>
</html>
and
<html>
<body>
<span _ngcontent-evu-c199="" class="font-weight-bold">
<!-- -->
<span _ngcontent-evu-c199="" class="ng-star-inserted">
1905
</span>
<!-- -->
</span>
<span _ngcontent-evu-c199="" class="ng-star-inserted">
, Aksel Paul
</span>
<!-- -->
<span _ngcontent-evu-c199="" class="ng-star-inserted">
, Helsinki
</span>
<!-- -->
<span _ngcontent-evu-c199="" class="ng-star-inserted">
:
<span _ngcontent-evu-c199="" translate="">
page
</span>
63
</span>
</body>
</html>
and
<html>
<body>
<span _ngcontent-ejj-c199="" class="badge badge-secondary ng-star-inserted">
22
</span>
<span _ngcontent-dna-c199="" class="font-weight-bold">
<span _ngcontent-dna-c199="" class="ng-star-inserted">
<span _ngcontent-dna-c199="" translate="">
nro
</span>
12 ZZ
</span>
<span _ngcontent-dna-c199="" class="ng-star-inserted">
10.2016
</span>
</span>
<span _ngcontent-ejj-c199="" class="ng-star-inserted">
, Arbetarförlaget Ab
</span>
<!-- -->
<span _ngcontent-ejj-c199="" class="ng-star-inserted">
, Stockholm
</span>
<!-- -->
<span _ngcontent-ejj-c199="" class="ng-star-inserted">
:
<span _ngcontent-ejj-c199="" translate="">
sida
</span>
20
</span>
</body>
</html>
I would like to extract 6 different information (if available otherwise None
) using a desired list which looks as follows:
desired_list = ["badge", "issue", "date", "publisher", "city", "page"]
So I have the following code (very inefficient using for loop):
desired_list = [None]*6 # initialize with [None, None, None, None, None, None]
soup = BeautifulSoup(html, "lxml") # html_1, html_2, html_3
fwb = soup.find("span", class_="font-weight-bold")
issue_date = fwb.select("span.ng-star-inserted") # always a list of 2 elements: ['nro XX extension', 'DD.MM.YYYY']
for el in issue_date:
element = el.text.split()
if "nro" in element:
desired_list[1] = " ".join(element) # handling issue: nro XX extension
desired_list[2] = " ".join(element) # handling date: DD.MM.YYYY
badge = soup.find("span", class_="badge badge-secondary ng-star-inserted")
if badge: desired_list[0] = " ".join(badge.text.split()) # handling badge
Currently, I can only extract info for first three components in my desired_list
, namely, badge
, issue
, date
.
[None, 'nro 4 A.', '6.12.1939', None, None, None] # html_1
[None, None, '1905', None, None, None] # html_2
['22', 'nro 12 ZZ', '10.2016', None, None, None] # html_3
Whereas, my desired list for each html
should looks like this:
[None, 'nro 4 A.', '6.12.1939', 'JR 10', None, 'sivu 1'] # html_1
[None, None, '1905', 'Aksel Paul', 'Helsinki', 'page 63'] # html_2
['22', 'nro 12 ZZ', '10.2016', 'Arbetarförlaget Ab', 'Stockholm', 'sida 20'] # html_3
And I don't know how to manipulate my code to retrieve all 6 fields given the aforementioned html
snippets since their occurrences are very stochastic meaning that it can happen some information is missing. I really appreciate if someone can recommend me smarter and more efficient way of handling this.
I am aware of soup.find_all("span", class_="ng-star-inserted")
. However, the problem is that find_all
does not always return a list with length of 6 to enumerate!
Cheers,
Solution
You might just define a list of selectors corresponding to desired_list
:
selectors = [
'span.badge.badge-secondary.ng-star-inserted', #badge
'span.font-weight-bold span.ng-star-inserted:has(span[translate])', #issue
'span.font-weight-bold span.ng-star-inserted:last-child', #date
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ")', #publisher
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ") + span.ng-star-inserted:-soup-contains(", ")', #city
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(":")', #page
]
and then
desired_list = [None if s[0] is None else (
s[0].get_text(' ', strip=True)[2:] if '-soup-contains' in s[1]
else s[0].get_text(' ', strip=True)
) for s in [(soup.select_one(sel), sel) for sel in selectors]]
should give you what you're looking for.
Please note that you might need to install and use html5lib parser (which can be a bit slower than lxml)
soup = BeautifulSoup(html, "html5lib") # html_1, html_2, html_3
otherwise, pseudoclasses like :has
and :-soup-contains
might raise errors or just return nothing. [Although, after installing html5lib, I noticed the selectors started working no matter which parser I used, including lxml...but I did need to install html5lib first]
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.