Issue
The original html looks like:
<meta content="A" name="citation_author"/>
<meta content="Axxxxx" name="citation_author_institution"/>
<meta content="Aorcid" name="citation_author_orcid"/>
<meta content="B" name="citation_author"/>
<meta content="Bxxx1" name="citation_author_institution"/>
<meta content="Bxxx2" name="citation_author_institution"/>
<meta content="C" name="citation_author"/>
<meta content="D" name="citation_author"/>
<meta content="Dorcid" name="citation_author_orcid">
<meta content="E" name="citation_author"/>
<meta content="Eyyyyy" name="citation_email"/>
The output results should be like this:
name | instituion | orcid | |
---|---|---|---|
A | Axxxxx | Aorcid | |
B | Bxxx1; Bxxx2 | ||
C | |||
D | Dorcid | ||
E | Eyyyyy |
I'm using Python 3.7.
I tried using 'find_all' to get all the names, then using find_next_sibling('meta', 'name':'xxx') to get the corresponding columns of a specific author. But just the take the example of ORCID, since authors BC dont have ORCID, the codes I wrote will return the ORCID of D.
AU_names = soup.find_all('meta', {'name': 'citation_author'})
for name in AU_names:
AU_name = name.attrs['content']
ORCID = name.find_next_sibling('meta', {'name': 'citation_author_orcid'})
ORCID = ORCID.attrs['content'] if ORCID else ''
print(AU_name, ORCID)
Could anyone help me? Thank u!
Solution
Well, this turned out to be an interesting question...
Try something along these lines:
orc = """<doc>[your html above]</doc>""" #we need to wrap the html in a root element
from bs4 import BeautifulSoup as bs
import pandas as pd
soup = bs(orc,'lxml')
targets = ["citation_author_institution","citation_author_orcid","citation_email"]
entries = []
for met in soup.select('meta[name=citation_author]'):
person = []
for m in met.findAllNext():
row = []
if (m.attrs['name'])=="citation_author":
break
else:
row.append(m)
person.append(row)
if len(person)==0: #this to account for authors like C, with no data
entry = [met.attrs['content'],'NA','NA','NA']
entries.append(entry)
else:
entry = [met.attrs['content']]
for target in targets:
mini_t = []
for p in person:
if p[0].attrs['name']==target:
#mini_t.append(p[0].attrs['content']) #old version
mini_t.append(p[0].attrs['content']+' ')#edited version
entry.append(mini_t)
for tf in entry:
if len(tf)==0: #this is to account for missing data
tf.append('NA')
#entry =[' '.join(tf) for tf in entry] #convert all inside lists to text - old version
entry =[''.join(tf) for tf in entry] #edited version
entries.append(entry)
#finally, create the dataframe
columns = ['name','inst','orcid','email']
pd.DataFrame(entries,columns=columns)
Output:
name inst orcid email
0 A Axxxxx Aorcid NA
1 B Bxxx1 Bxxx2 NA NA
2 C NA NA NA
3 D NA Dorcid NA
4 E NA NA Eyyyyy
Answered By - Jack Fleeting
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.