Issue
I'm using python 3 and beautifulsoup4, pandas, counter, to convert one XML to CSV file
There is several thousand products in this Xml. I have trouble with one particular problem.
Many of this product in XML are a children of parent product, but parent product is not itself in XML.
Each of this children product have special parent tag with the same value (parent id) so we can know they are children.
<parent>x</parent>
Xml structure is folowing:
<product>
<id>1</id>
<parent>x</parent>
</product>
<product>
<id>2</id>
<parent>x</parent>
</product>
<product>
<id>3</id>
<parent>x</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>y</parent>
</product>
<product>
<id>6</id>
<parent>y</parent>
</product>
You can see that product with id 4 don't have parent tag so is not children first 3 products have one parent together with value x and last two product have another parent with value Y and so on more than 7000 products.
For my purpose I need to replace each <parent/>
tag with the fist id of the same value. My desired outcome:
<product>
<id>1</id>
<parent>1</parent>
</product>
<product>
<id>2</id>
<parent>1</parent>
</product>
<product>
<id>3</id>
<parent>1</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>5</parent>
</product>
<product>
<id>6</id>
<parent>5</parent>
</product>
What I done so far I need also to convert in csv each value to their respective column and row.
#test.py
def parse_xml(xml_data):
# Initializing soup variable
soup = BeautifulSoup(xml_data, 'xml')
# Creating column for table
df = pd.DataFrame(columns=['id', 'parent'])
# Here I get all duplacates of the same tag
lst = soup.select('parent')
d = Counter(lst)
resultparent = [k for k, v in d.items() if v > 1]
#I spleat on seperate text to get all duplicate x and y as one
def a():
for index, i in enumerate(resultparent):
a = i.text
return a
# I get also id or every x an y
def b():
for index, i in enumerate(resultparent):
c = i.find_previous('id').text
return c
# now I start writing csv
all_products = soup.findAll('product')
product_length = len(all_products)
for index, product in enumerate(all_products):
parent = product.find('parent')
if parent is None:
parent = ""
else:
parent = parent.text
# here I wanted to check if I could find duplicate values with existing, I hope that if
# there will be let say parent tag is x will replace with 1 (don't work)
if parent == def a():
parent = def b()
product_id = product.find('id').text
# then I write all in csv
row = [{
'id': product_id,
'parent': parent}]
df = df.append(row, ignore_index=True)
print(f'Appending row %s of %s' % (index+1, product_length))
return df
df = parse_xml(xml_data)
df.to_csv('test.csv')
These code above don't work correctly it replace only first value x x x with 1 1 1 but don't replace -y- value and the rest when is written in CSV. Thank you for help.
Solution
Given the following input.xml
file:
<products>
<product>
<id>1</id>
<parent>x</parent>
</product>
<product>
<id>2</id>
<parent>x</parent>
</product>
<product>
<id>3</id>
<parent>x</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>y</parent>
</product>
<product>
<id>6</id>
<parent>y</parent>
</product>
</products>
Here is one way to get the correct matches:
import pandas as pd
df = pd.read_xml("input.xml")
pairs = (
df.groupby("parent")
.agg(list)
.pipe(lambda df_: df_["id"].apply(lambda x: str(x[0])))
.to_dict()
)
print(pairs) # {'x': '1', 'y': '5'}
And then, using Python standard library XML module:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
for product in tree.getroot():
for child in product:
if child.tag == "parent":
child.text = pairs[child.text]
tree.write("output.xml")
In output.xml
:
<products>
<product>
<id>1</id>
<parent>1</parent>
</product>
<product>
<id>2</id>
<parent>1</parent>
</product>
<product>
<id>3</id>
<parent>1</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>5</parent>
</product>
<product>
<id>6</id>
<parent>5</parent>
</product>
</products>
Answered By - Laurent
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.