Issue
I am trying to do NLP on the dataset consisting of the following row
00001 B 74457
00002 C 12804123 16026213 14627885
00004 A 15329425 9058342 11279767
where 1st element in the row is the identifier 2nd on is a label recommends, it can have only three labels $A, B, C$ and the number for examples 12804123 represent the id of the XML, it contains data, for example, text, location, etc. Based on this I need to extract the data from the XML file and use it to make a model. So first of all I want to extract some of the data from the XML file and make a data frame of structure data. An example of the XML file is below. When I run the command pd.read_xml(xml) it gives
medlinecitation pubmeddata
0 NaN NaN
Any example from Kaggle or any other source etc I can follow to do the analysis.
74457.xml = '''
<pubmedarticleset>
<pubmedarticle>
<medlinecitation owner="NLM" status="MEDLINE">
<pmid version="1"> 74457 </pmid>
<datecreated>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecreated>
<datecompleted>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecompleted>
<daterevised>
<year> 2007 </year>
<month> 11 </month>
<day> 15 </day>
</daterevised>
<article pubmodel="Print">
<journal>
<issn issntype="Print"> 0140-6736 </issn>
<journalissue citedmedium="Print">
<volume> 1 </volume>
<issue> 7984 </issue>
<pubdate>
<year> 1976 </year>
<month> Sep </month>
<day> 4 </day>
</pubdate>
</journalissue>
<title> Lancet </title>
<isoabbreviation> Lancet </isoabbreviation>
</journal>
<articletitle>
Prophylactic treatment of alcoholism by lithium carbonate. A controlled study.
</articletitle>
<pagination>
<medlinepgn> 481-2 </medlinepgn>
</pagination>
<abstract>
<abstracttext>
Lithium therapy has been shown to have a therapeutic influence in reducing the drinking and incapacity by alcohol in depressive alcoholics in a prospective double-blind placebo-controlled trial conducted over one year, but it had no significant effect on non-depressed patients. Patients in the trial treated by placebo had significantly greater alcoholic morbidity if they were depressive than if they were non-depressive.
</abstracttext>
</abstract>
<authorlist completeyn="Y">
<author validyn="Y">
<lastname> Merry </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Reynolds </lastname>
<forename> C M </forename>
<initials> CM </initials>
</author>
<author validyn="Y">
<lastname> Bailey </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Coppen </lastname>
<forename> A </forename>
<initials> A </initials>
</author>
</authorlist>
<language> eng </language>
<publicationtypelist>
<publicationtype> Clinical Trial </publicationtype>
<publicationtype> Comparative Study </publicationtype>
<publicationtype> Journal Article </publicationtype>
<publicationtype> Randomized Controlled Trial </publicationtype>
</publicationtypelist>
</article>
<medlinejournalinfo>
<country> ENGLAND </country>
<medlineta> Lancet </medlineta>
<nlmuniqueid> 2985213R </nlmuniqueid>
<issnlinking> 0140-6736 </issnlinking>
</medlinejournalinfo>
<chemicallist>
<chemical>
<registrynumber> 0 </registrynumber>
<nameofsubstance> Placebos </nameofsubstance>
</chemical>
<chemical>
<registrynumber> 7439-93-2 </registrynumber>
<nameofsubstance> Lithium </nameofsubstance>
</chemical>
</chemicallist>
<citationsubset> AIM </citationsubset>
<citationsubset> IM </citationsubset>
<meshheadinglist>
<meshheading>
<descriptorname majortopicyn="N"> Adult </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcohol Drinking </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcoholism </descriptorname>
<qualifiername majortopicyn="Y"> drug therapy </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Clinical Trials as Topic </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Depression </descriptorname>
<qualifiername majortopicyn="N"> chemically induced </qualifiername>
<qualifiername majortopicyn="Y"> prevention & control </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Double-Blind Method </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Drug Evaluation </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Female </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Humans </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Lithium </descriptorname>
<qualifiername majortopicyn="Y"> therapeutic use </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Male </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Middle Aged </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Placebos </descriptorname>
</meshheading>
</meshheadinglist>
</medlinecitation>
<pubmeddata>
<history>
<pubmedpubdate pubstatus="pubmed">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
</pubmedpubdate>
<pubmedpubdate pubstatus="medline">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 1 </minute>
</pubmedpubdate>
<pubmedpubdate pubstatus="entrez">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 0 </minute>
</pubmedpubdate>
</history>
<publicationstatus> ppublish </publicationstatus>
<articleidlist>
<articleid idtype="pubmed"> 74457 </articleid>
</articleidlist>
</pubmeddata>
</pubmedarticle>
</pubmedarticleset>'''
Please help me to understand what is happening? And how can I make it a data frame?
Solution
Here is one way to do it:
import pandas as pd
try:
medlinecitation = pd.read_xml("74457.xml", xpath=".//medlinecitation").dropna(
axis=1
)
except ValueError:
medlinecitation = pd.DataFrame()
try:
pubmedpubdate = pd.read_xml("74457.xml", xpath=".//pubmedpubdate")
except ValueError:
pubmedpubdate = pd.DataFrame()
df = pd.merge(
left=medlinecitation,
right=pubmedpubdate,
how="outer",
left_index=True,
right_index=True,
).fillna(method="ffill")
print(df)
# Output
owner status pmid citationsubset pubstatus year month day hour \
0 NLM MEDLINE 74457.0 IM pubmed 1976 9 4 NaN
1 NLM MEDLINE 74457.0 IM medline 1976 9 4 0.0
2 NLM MEDLINE 74457.0 IM entrez 1976 9 4 0.0
minute
0 NaN
1 1.0
2 0.0
Answered By - Laurent
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.