Issue
In my XML file [studentinfo.xml] some tags have namespace prefixes, is there a way to loop through the xml file and parse tag content [all sibling and child tags] without defining the URI/URL for namespace?
If you have another way of parsing the xml file not using pandas I am open to any and all solutions.
<?xml version="1.0" encoding="UTF-8"?>
<stu:StudentBreakdown>
<stu:Studentdata>
<stu:StudentScreening>
<st:name>Sam Davies</st:name>
<st:age>15</st:age>
<st:hair>Black</st:hair>
<st:eyes>Blue</st:eyes>
<st:grade>10</st:grade>
<st:teacher>Draco Malfoy</st:teacher>
<st:dorm>Innovation Hall</st:dorm>
</stu:StudentScreening>
<stu:StudentScreening>
<st:name>Cassie Stone</st:name>
<st:age>14</st:age>
<st:hair>Science</st:hair>
<st:grade>9</st:grade>
<st:teacher>Luna Lovegood</st:teacher>
</stu:StudentScreening>
<stu:StudentScreening>
<st:name>Derek Brandon</st:name>
<st:age>17</st:age>
<st:eyes>green</st:eyes>
<st:teacher>Ron Weasley</st:teacher>
<st:dorm>Hogtie Manor</st:dorm>
</stu:StudentScreening>
</stu:Studentdata>
</stu:StudentBreakdown>
below is my code:
import pandas as pd
from bs4 import BeautifulSoup
with open('studentinfo.xml', 'r') as f:
file = f.read()
def parse_xml(file):
soup = BeautifulSoup(file, 'xml')
df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
all_items = soup.find_all('info')
items_length = len(all_items)
for index, info in enumerate(all_items):
StudentName = info.find('<st:name>').text
Age = info.find('<st:age>').text
Hair = info.find('<st:hair>').text
Eyes = info.find('<st:eyes>').text
Grade = info.find('<st:grade>').text
Teacher = info.find('<st:teacher>').text
Dorm = info.find('<st:dorm>').text
row = {
'StudentName': StudentName,
'Age': Age,
'Hair': Hair,
'Eyes': Eyes,
'Grade': Grade,
'Teacher': Teacher,
'Dorm': Dorm
}
df1 = df1.append(row, ingore_index=True)
print(f'Appending row %s of %s' %(index+1, items_length))
return df1
Desired Output:
Name | age | hair | eyes | grade | teacher | dorm | |
---|---|---|---|---|---|---|---|
0 | Sam Davies | 15 | Black | Blue | 10 | Draco Malfoy | Innovation Hall |
1 | Cassie Stone | 14 | Science | N/A | 9 | Luna Lovegood | N/A |
2 | Derek Brandon | 17 | N/A | green | N/A | Ron Weasley | Hogtie Manor |
Solution
You were about 90% there. I just fixed up a couple of things:
all_items
: to findStudentScreening
instead ofinfo
info.find()
statements : dealt with missing valuespd.concat()
: instead ofdf1.append()
- called the function parse_xml at the end
Here is the code:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
# Read in the XML file
with open('studentinfo.xml', 'r') as f:
file = f.read()
def parse_xml(file):
soup = BeautifulSoup(file, 'xml')
df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
all_items = soup.find_all('StudentScreening')
for index, info in enumerate(all_items):
row = {
'StudentName': info.find('name').text if info.find('name') else np.nan,
'Age': info.find('age').text if info.find('age') else np.nan,
'Hair': info.find('hair').text if info.find('hair') else np.nan,
'Eyes': info.find('eyes').text if info.find('eyes') else np.nan,
'Grade': info.find('grade').text if info.find('grade') else np.nan,
'Teacher': info.find('teacher').text if info.find('teacher') else np.nan,
'Dorm': info.find('dorm').text if info.find('dorm') else np.nan
}
df1 = pd.concat([df1, pd.Series(row).to_frame().T], ignore_index=True)
return df1
print(parse_xml(file))
OUTPUT:
StudentName Age Hair Eyes Grade Teacher Dorm
0 Sam Davies 15 Black Blue 10 Draco Malfoy Innovation Hall
1 Cassie Stone 14 Science NaN 9 Luna Lovegood NaN
2 Derek Brandon 17 NaN green NaN Ron Weasley Hogtie Manor
Answered By - ScottC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.