Friday, November 18, 2022

[FIXED] BeautifulSoup extract desired information in custome-made list from different html snippets

November 18, 2022 beautifulsoup, python, web-scraping No comments

Issue

Apologies in advance for such a long and basic question!

Given the following three html snippets which are part of a bigger part as follows:

<html>
 <body>
  <span _ngcontent-ont-c199="" class="font-weight-bold">
   <span _ngcontent-ont-c199="" class="ng-star-inserted">
    <span _ngcontent-ont-c199="" translate="">
     nro
    </span>
    4 A.
   </span>
   <!-- -->
   <span _ngcontent-ont-c199="" class="ng-star-inserted">
    6.12.1939
   </span>
   <!-- -->
  </span>
  <span _ngcontent-ont-c199="" class="ng-star-inserted">
   , JR 10
  </span>
  <!-- -->
  <!-- -->
  <span _ngcontent-ont-c199="" class="ng-star-inserted">
   :
   <span _ngcontent-ont-c199="" translate="">
    sivu
   </span>
   1
  </span>
 </body>
</html>

and

<html>
 <body>
  <span _ngcontent-evu-c199="" class="font-weight-bold">
   <!-- -->
   <span _ngcontent-evu-c199="" class="ng-star-inserted">
    1905
   </span>
   <!-- -->
  </span>
  <span _ngcontent-evu-c199="" class="ng-star-inserted">
   , Aksel Paul
  </span>
  <!-- -->
  <span _ngcontent-evu-c199="" class="ng-star-inserted">
   , Helsinki
  </span>
  <!-- -->
  <span _ngcontent-evu-c199="" class="ng-star-inserted">
   :
   <span _ngcontent-evu-c199="" translate="">
    page
   </span>
   63
  </span>
 </body>
</html>

and

<html>
 <body>
  <span _ngcontent-ejj-c199="" class="badge badge-secondary ng-star-inserted">
   22
  </span>
  <span _ngcontent-dna-c199="" class="font-weight-bold">
   <span _ngcontent-dna-c199="" class="ng-star-inserted">
    <span _ngcontent-dna-c199="" translate="">
     nro
    </span>
    12 ZZ
   </span>
   <span _ngcontent-dna-c199="" class="ng-star-inserted">
    10.2016
   </span>
  </span>
  <span _ngcontent-ejj-c199="" class="ng-star-inserted">
   , Arbetarförlaget Ab
  </span>
  <!-- -->
  <span _ngcontent-ejj-c199="" class="ng-star-inserted">
   , Stockholm
  </span>
  <!-- -->
  <span _ngcontent-ejj-c199="" class="ng-star-inserted">
   :
   <span _ngcontent-ejj-c199="" translate="">
    sida
   </span>
   20
  </span>
 </body>
</html>

I would like to extract 6 different information (if available otherwise None) using a desired list which looks as follows:

desired_list = ["badge", "issue", "date", "publisher", "city", "page"]

So I have the following code (very inefficient using for loop):

desired_list = [None]*6 # initialize with [None, None, None, None, None, None]

soup = BeautifulSoup(html, "lxml") # html_1, html_2, html_3

fwb = soup.find("span", class_="font-weight-bold")
issue_date = fwb.select("span.ng-star-inserted") # always a list of 2 elements: ['nro XX extension', 'DD.MM.YYYY']

for el in issue_date:
    element = el.text.split()
    if "nro" in element:
        desired_list[1] = " ".join(element) # handling issue: nro XX extension
    desired_list[2] = " ".join(element) # handling date: DD.MM.YYYY
badge = soup.find("span", class_="badge badge-secondary ng-star-inserted")
if badge: desired_list[0] = " ".join(badge.text.split()) # handling badge

Currently, I can only extract info for first three components in my desired_list, namely, badge, issue, date.

[None, 'nro 4 A.', '6.12.1939', None, None, None] # html_1
[None, None, '1905', None, None, None]            # html_2
['22', 'nro 12 ZZ', '10.2016', None, None, None]  # html_3

Whereas, my desired list for each html should looks like this:

[None, 'nro 4 A.', '6.12.1939', 'JR 10', None, 'sivu 1']                     # html_1
[None, None, '1905', 'Aksel Paul', 'Helsinki', 'page 63']                    # html_2
['22', 'nro 12 ZZ', '10.2016', 'Arbetarförlaget Ab', 'Stockholm', 'sida 20'] # html_3

And I don't know how to manipulate my code to retrieve all 6 fields given the aforementioned html snippets since their occurrences are very stochastic meaning that it can happen some information is missing. I really appreciate if someone can recommend me smarter and more efficient way of handling this.

I am aware of soup.find_all("span", class_="ng-star-inserted"). However, the problem is that find_all does not always return a list with length of 6 to enumerate!

Cheers,

Solution

You might just define a list of selectors corresponding to desired_list:

selectors = [
    'span.badge.badge-secondary.ng-star-inserted',                                             #badge 
    'span.font-weight-bold span.ng-star-inserted:has(span[translate])',                        #issue 
    'span.font-weight-bold span.ng-star-inserted:last-child',                                  #date 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ")',                      #publisher 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ") + span.ng-star-inserted:-soup-contains(", ")', #city 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(":")',                       #page 
]

and then

desired_list = [None if s[0] is None else (
    s[0].get_text(' ', strip=True)[2:] if '-soup-contains' in s[1] 
    else s[0].get_text(' ', strip=True)
) for s in [(soup.select_one(sel), sel) for sel in selectors]]

should give you what you're looking for.

Please note that you might need to install and use html5lib parser (which can be a bit slower than lxml)

soup = BeautifulSoup(html, "html5lib") # html_1, html_2, html_3

otherwise, pseudoclasses like :has and :-soup-contains might raise errors or just return nothing. [Although, after installing html5lib, I noticed the selectors started working no matter which parser I used, including lxml...but I did need to install html5lib first]

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 18, 2022

[FIXED] BeautifulSoup extract desired information in custome-made list from different html snippets

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels