Issue
I have a XML file of a few thousand records, from which I want to extract:
- The city: tag 110 code c (for example Berlin)
- The library code: tag 110 code g (for example D-Bbbf)
I want would like to get a dataframe of all the cities next to the library code. But if the library code (code="g") does not exist, then I would like NaN or something else dat inditcates that there is no value. So for example
df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}
This is a piece of the XML:
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
This is what I have tried:
# Import BeautifulSoup
from bs4 import BeautifulSoup
Data= {'Cities':[],
'Code':[]}
# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
for record in soup.find_all(tag="110"):
find = record.find_all('[code="g"]')
for code in record:
if find is not None:
City = record.select_one('[code="c"]') # select city
Code = record.select_one('[code="g"]') # select code
Data['Cities'].append(City.get_text(strip=True))
Data['Code'].append(Code.get_text(strip=True))
else:
print(NaN)
print(Data)
Solution
Think it is not necessary to work with these list, its easier to use one list of dicts - While iterating your records check if element your looking for is available to append its text or None
:
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
Example
xml='''
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
'''
# Import BeautifulSoup
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
pd.DataFrame(data)
Output
City | Code |
---|---|
Berlin | D-Bbbf |
London | None |
EDIT
If you not using latest python version, this would be an alternativ to check with walrus operator
:
...
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
try:
city = record.select_one('[code="c"]').get_text(strip=True)
except:
city = None
try:
code = record.select_one('[code="g"]').get_text(strip=True)
except:
code = None
data.append({
'City' : city,
'Code' : code
})
pd.DataFrame(data)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.