Saturday, December 4, 2021

[FIXED] Find the corresponding siblings in meta content using Beautiful Soup

December 04, 2021 beautifulsoup, python, python-3.x No comments

Issue

The original html looks like:

  <meta content="A" name="citation_author"/>
  <meta content="Axxxxx" name="citation_author_institution"/>
  <meta content="Aorcid" name="citation_author_orcid"/>
  <meta content="B" name="citation_author"/>
  <meta content="Bxxx1" name="citation_author_institution"/>
  <meta content="Bxxx2" name="citation_author_institution"/>
  <meta content="C" name="citation_author"/>
  <meta content="D" name="citation_author"/>
  <meta content="Dorcid" name="citation_author_orcid">
  <meta content="E" name="citation_author"/>
  <meta content="Eyyyyy" name="citation_email"/>

The output results should be like this：

name	instituion	orcid	email
A	Axxxxx	Aorcid
B	Bxxx1; Bxxx2
C
D		Dorcid
E			Eyyyyy

I'm using Python 3.7.

I tried using 'find_all' to get all the names, then using find_next_sibling('meta', 'name':'xxx') to get the corresponding columns of a specific author. But just the take the example of ORCID, since authors BC dont have ORCID, the codes I wrote will return the ORCID of D.

AU_names = soup.find_all('meta', {'name': 'citation_author'})
for name in AU_names:
    AU_name = name.attrs['content']
    ORCID = name.find_next_sibling('meta', {'name': 'citation_author_orcid'})
    ORCID = ORCID.attrs['content'] if ORCID else ''
    print(AU_name, ORCID)

Could anyone help me? Thank u!

Solution

Well, this turned out to be an interesting question...

Try something along these lines:

orc = """<doc>[your html above]</doc>""" #we need to wrap the html in a root element

from bs4 import BeautifulSoup as bs
import pandas as pd

soup = bs(orc,'lxml')
targets = ["citation_author_institution","citation_author_orcid","citation_email"]
entries = []
for met in soup.select('meta[name=citation_author]'): 
    person = []
    for m in met.findAllNext():
        row  = []
        if (m.attrs['name'])=="citation_author":
            break
        else:
            row.append(m)
        person.append(row)
    if len(person)==0: #this to account for authors like C, with no data
        entry = [met.attrs['content'],'NA','NA','NA']
        entries.append(entry)
    else:
        entry = [met.attrs['content']]
        for target in targets:  
            mini_t = []
            for p in person:                
                if p[0].attrs['name']==target:
                    #mini_t.append(p[0].attrs['content']) #old version
                    mini_t.append(p[0].attrs['content']+' ')#edited version
            entry.append(mini_t)
        for tf in entry:
            if len(tf)==0: #this is to account for missing data
                tf.append('NA')
        #entry =[' '.join(tf) for tf in entry] #convert all inside lists to text - old version
        entry =[''.join(tf) for tf in entry] #edited version
        entries.append(entry)
#finally, create the dataframe           
columns = ['name','inst','orcid','email']
pd.DataFrame(entries,columns=columns)

Output:

name    inst    orcid   email
0   A   Axxxxx  Aorcid  NA
1   B   Bxxx1 Bxxx2 NA  NA
2   C   NA      NA     NA
3   D   NA     Dorcid   NA
4   E   NA     NA      Eyyyyy

Answered By - Jack Fleeting

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 4, 2021

[FIXED] Find the corresponding siblings in meta content using Beautiful Soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels