Friday, November 11, 2022

[FIXED] Is there a way to use pandas.read_xml() with out a URI/URL for namespaces?

November 11, 2022 beautifulsoup, pandas, parsing, python, xml No comments

Issue

In my XML file [studentinfo.xml] some tags have namespace prefixes, is there a way to loop through the xml file and parse tag content [all sibling and child tags] without defining the URI/URL for namespace?

If you have another way of parsing the xml file not using pandas I am open to any and all solutions.

<?xml version="1.0" encoding="UTF-8"?>
<stu:StudentBreakdown>
<stu:Studentdata>
    <stu:StudentScreening>
        <st:name>Sam Davies</st:name>
        <st:age>15</st:age>
        <st:hair>Black</st:hair>
        <st:eyes>Blue</st:eyes>
        <st:grade>10</st:grade>
        <st:teacher>Draco Malfoy</st:teacher>
        <st:dorm>Innovation Hall</st:dorm>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Cassie Stone</st:name>
        <st:age>14</st:age>
        <st:hair>Science</st:hair>
        <st:grade>9</st:grade>
        <st:teacher>Luna Lovegood</st:teacher>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Derek Brandon</st:name>
        <st:age>17</st:age>
        <st:eyes>green</st:eyes>
        <st:teacher>Ron Weasley</st:teacher>
        <st:dorm>Hogtie Manor</st:dorm>
    </stu:StudentScreening>
</stu:Studentdata>
</stu:StudentBreakdown>

below is my code:

import pandas as pd
from bs4 import BeautifulSoup
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('info')
    items_length = len(all_items)
    for index, info in enumerate(all_items):
        StudentName = info.find('<st:name>').text
        Age = info.find('<st:age>').text
        Hair = info.find('<st:hair>').text
        Eyes = info.find('<st:eyes>').text
        Grade = info.find('<st:grade>').text
        Teacher = info.find('<st:teacher>').text
        Dorm = info.find('<st:dorm>').text
      row = {
            'StudentName': StudentName,
            'Age': Age,
            'Hair': Hair,
            'Eyes': Eyes,
            'Grade': Grade,
            'Teacher': Teacher,
            'Dorm': Dorm
        }
        
        df1 = df1.append(row, ingore_index=True)
        print(f'Appending row %s of %s' %(index+1, items_length))
    
    return df1

Desired Output:

	Name	age	hair	eyes	grade	teacher	dorm
0	Sam Davies	15	Black	Blue	10	Draco Malfoy	Innovation Hall
1	Cassie Stone	14	Science	N/A	9	Luna Lovegood	N/A
2	Derek Brandon	17	N/A	green	N/A	Ron Weasley	Hogtie Manor

Solution

You were about 90% there. I just fixed up a couple of things:

all_items : to find StudentScreening instead of info
info.find() statements : dealt with missing values
pd.concat() : instead of df1.append()
called the function parse_xml at the end

Here is the code:

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Read in the XML file
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('StudentScreening')
    for index, info in enumerate(all_items):

        row = {
            'StudentName': info.find('name').text if info.find('name') else np.nan,
            'Age': info.find('age').text if info.find('age') else np.nan,
            'Hair': info.find('hair').text if info.find('hair') else np.nan, 
            'Eyes': info.find('eyes').text if info.find('eyes') else np.nan,
            'Grade': info.find('grade').text if info.find('grade') else np.nan,
            'Teacher': info.find('teacher').text if info.find('teacher') else np.nan,
            'Dorm': info.find('dorm').text if info.find('dorm') else np.nan
        }
        
        df1 = pd.concat([df1, pd.Series(row).to_frame().T], ignore_index=True)
        
    
    return df1  


print(parse_xml(file))

OUTPUT:

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor

Answered By - ScottC

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 11, 2022

[FIXED] Is there a way to use pandas.read_xml() with out a URI/URL for namespaces?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels