Saturday, September 17, 2022

[FIXED] How to get text from and HTML with a bit weird structure?

September 17, 2022 beautifulsoup, html, parsing, python, web-scraping No comments

Issue

I have a website with HTML structure like this inside it:

<div class="ui-rectframe">
    <p class="ui-li-desc"></p>
    <h4 class="ui-li-heading">Qualifications</h4>
    MBBS (University of Singapore, Singapore) 1978
    <br>
    MCFP (Family Med) (College of Family Physicians, Singapore) 1984
    <br>
    Dip Geriatric Med (NUS, Singapore) 2012
    <br>
    GDPM (NUS, Singapore) 2015
    <br>
    <h4 class="ui-li-heading">Type of first registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Type of current registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Practising Certificate Start Date</h4>
    01/01/2022<br>
    <h4 class="ui-li-heading">Practising Certificate End Date</h4>
    31/12/2023<br>
    <p></p><br>
</div>

I need to extract qualifications -- [ 'MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015' ] How can I do that using css selector or xpath? I am able to extract all text items inside that parent div, but I can't separate qualifications from other values like Type of first registration, etc.

Solution

You could extract a list of headers and one of all stripped_strings and use a function to seperate them by checking against the headers:

def create_dict(strings, headers):
    idx = 0
    d = {}
    for header in headers:
        sublist = []
        while strings[idx] != header:
            sublist.append(strings[idx])
            idx += 1
        if sublist:
            d.update({sublist[0]:sublist[1:]})
    return(d)

h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)

create_dict(s,h)

Output:

Note - This will store results in dict to pick also from the other sections if necessary:

{'Qualifications': ['MBBS (University of Singapore, Singapore) 1978',
  'MCFP (Family Med) (College of Family Physicians, Singapore) 1984',
  'Dip Geriatric Med (NUS, Singapore) 2012',
  'GDPM (NUS, Singapore) 2015'],
 'Type of first registration / date': ['Full Registration (14/06/1979)'],
 'Type of current registration / date': ['Full Registration (14/06/1979)'],
 'Practising Certificate Start Date': ['01/01/2022']}

Example

from bs4 import BeautifulSoup

html = '''
<div class="ui-rectframe">
    <p class="ui-li-desc"></p>
    <h4 class="ui-li-heading">Qualifications</h4>
    MBBS (University of Singapore, Singapore) 1978
    <br>
    MCFP (Family Med) (College of Family Physicians, Singapore) 1984
    <br>
    Dip Geriatric Med (NUS, Singapore) 2012
    <br>
    GDPM (NUS, Singapore) 2015
    <br>
    <h4 class="ui-li-heading">Type of first registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Type of current registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Practising Certificate Start Date</h4>
    01/01/2022<br>
    <h4 class="ui-li-heading">Practising Certificate End Date</h4>
    31/12/2023<br>
    <p></p><br>
</div>
'''
soup = BeautifulSoup(html)

def create_dict(strings, headers):
    idx = 0
    d = {}
    for header in headers:
        sublist = []
        while strings[idx] != header:
            sublist.append(strings[idx])
            idx += 1
        if sublist:
            d.update({sublist[0]:sublist[1:]})
    return(d)

h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)

create_dict(s,h)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] How to get text from and HTML with a bit weird structure?

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels