Issue
I have parsed XML from website and i found that it has two branches (children),
How to Separate the two branches into two lists of dictionaries,
here's my code so far:
import pandas as pd
import xml.etree.ElementTree as ET
import requests
url = "http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php"
params = {'data':'spurpyr'}
response = requests.get (url, params)
tree = response.content
#extract the root element as separate variable, and display the root tag.
root = ET.fromstring(tree)
print(root.tag)
#Get attributes of root
root_attr = root.attrib
print(root_attr)
#Finding children of root
for child in root:
print(child.tag, child.attrib)
#extract the two children of the root element into another two separate variables, and display their tags as well
child_dict = []
for child in root:
child_dict.append(child.tag)
tweets_branch = child_dict[0]
cities_branch = child_dict[1]
#the elements in the entire tree
[elem.tag for elem in root.iter()]
#specify both the encoding and decoding of the document you are displaying as the string
print(ET.tostring(root, encoding='utf8').decode('utf8'))
Solution
Using beautifulsoup
module. To parse tweets and cities to list of dictionaries you can use this example:
import requests
from bs4 import BeautifulSoup
url = "http://cs.stir.ac.uk/~soh/BD2spring2022/assignmentdata.php"
params = {"data": "spurpyr"}
soup = BeautifulSoup(requests.get(url, params=params).content, "xml")
tweets = []
for t in soup.select("tweets > tweet"):
tweets.append({"id": t["id"], **{x.name: x.text for x in t.find_all()}})
cities = []
for c in soup.select("cities > city"):
cities.append({"id": c["id"], **{x.name: x.text for x in c.find_all()}})
print(tweets)
print(cities)
Prints:
[
{
"id": "16620625 5686",
"Name": "Kenyon Conley",
"Phone": "0327 103 9485",
"Email": "[email protected]",
"Location": "45.5333, -73.2833",
"GenderID": "male",
"Tweet": "#FollowFriday @DanielleMorrill - She's with @Seattle20 and @Twilio. Also fun to talk to. #entrepreneur",
"City": "Saint-Basile-le-Grand",
"Country": "Canada",
"Age": "34",
},
{
"id": "16310427-5502",
"Name": "Griffin Norton",
"Phone": "0306 178 7917",
"Email": "[email protected]",
"Location": "52.0000, 84.9833",
"GenderID": "male",
"Tweet": "!!!Veryy Bored!!! ~~Craving Million's Of MilkShakes~~",
"City": "Belokurikha",
"Country": "Russia",
"Age": "33",
},
...
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.