Issue
I have been asking for help in importing JSON-formatted data from URLs (I am a newbie as far as dealing with JSON) and received a great answer in response to this question.
However, I have encountered a complication. Some of my property names contain spaces. For example, "Property1" and several other property names from my previous question might actually be "Property1_word1 Property1_word2." The current solution preserves only the first word of a property name. I could get away with that at first but now need all words. If anyone could please point me to any tips, I would be grateful. I haven't managed to find any so far.
Edit (providing all information here so that there's no need to refer to previous posts):
I want to import data from a website. First I save the contents (below) of the website as a file. In my previous question, each property name was made up of only one word. Now I'm dealing with property names that are made up of multiple words. I have provided an example below, where Property1, Property4, and Property8 have names with multiple words.
{
"payload": {
"allShortcutsEnabled": false,
"fileTree": {
"": {
"items": [
{
"name": "thing",
"path": "thing",
"contentType": "directory"
},
{
"name": ".repurlignore",
"path": ".repurlignore",
"contentType": "file"
},
{
"name": "README.md",
"path": "README.md",
"contentType": "file"
},
{
"name": "thing2",
"path": "thing2",
"contentType": "file"
},
{
"name": "thing3",
"path": "thing3",
"contentType": "file"
},
{
"name": "thing4",
"path": "thing4",
"contentType": "file"
},
{
"name": "thing5",
"path": "thing5",
"contentType": "file"
},
{
"name": "thing6",
"path": "thing6",
"contentType": "file"
},
{
"name": "thing7",
"path": "thing7",
"contentType": "file"
},
{
"name": "thing8",
"path": "thing8",
"contentType": "file"
},
{
"name": "thing9",
"path": "thing9",
"contentType": "file"
},
{
"name": "thing10",
"path": "thing10",
"contentType": "file"
},
{
"name": "thing11",
"path": "thing11",
"contentType": "file"
}
],
"totalCount": 500
}
},
"fileTreeProcessingTime": 5.262188,
"foldersToFetch": [],
"reducedMotionEnabled": null,
"repo": {
"id": 1234567,
"defaultBranch": "main",
"name": "repository",
"ownerLogin": "contributor",
"currentUserCanPush": false,
"isFork": false,
"isEmpty": false,
"createdAt": "2023-10-31",
"ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1",
"public": true,
"private": false,
"isOrgOwned": false
},
"symbolsExpanded": false,
"treeExpanded": true,
"refInfo": {
"name": "main",
"listCacheKey": "v0:13579",
"canEdit": false,
"refType": "branch",
"currentOid": "identifier"
},
"path": "thing2",
"currentUser": null,
"blob": {
"rawLines": [
" C_1H_4 Methane ",
" 5.00000 Property1_word1 Property1_word2 ",
" 20.00000 Property2 ",
" 500.66500 Property3 ",
" 100.00000 Property4_word1 Property4_word2 ",
" -4453.98887 Property5 ",
" 100.48200 Property6 ",
" 59.75258 Property7 ",
" 5.33645 Property8_word1 Property8_word2 ",
" 0.00000 Property9 ",
" 645.07777 Property10 ",
" 0.00000 Property11 ",
" 0.00000 Property12 ",
" 0.00000 Property13 ",
" 0.00000 Property14 ",
" 0.00000 Property15 ",
" 0.00000 Property16 ",
" 0.00000 Property17 ",
" 0.00000 Property18 ",
" 0.00000 Property19 ",
" 0.00000 Property20 ",
" 0.00000 Property21 ",
" 0.00000 Property22 ",
" 0.00000 Property23 ",
" 0.00000 Property24 ",
" 0.00000 Property25 ",
" 0.57876 Property26 ",
" 4.00000 Property27 ",
" 0.00000 Property28 ",
" 0.00000 Property29 ",
" 0.00000 Property30 ",
" 0.00000 Property31 ",
" 0.00000 Property32 ",
" 1.00000 Property33 ",
" 0.00000 Property34 ",
" 26.00000 Property35 ",
" 1.44571 Property36 ",
" 1.08756 Property37 ",
" 0.00000 Property38 ",
" 0.00000 Property39 ",
" 0.00000 Property40 ",
" 6.00000 Property41 ",
" 9.00000 Property42 ",
" 0.00000 Property43 "
],
"stylingDirectives": [
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]
],
"csv": null,
"csvError": null,
"dependabotInfo": {
"showConfigurationBanner": false,
"configFilePath": null,
"networkDependabotPath": "/contributor/repository/network/updates",
"dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice",
"configurationNoticeDismissed": null,
"repoAlertsPath": "/contributor/repository/security/dependabot",
"repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis",
"repoOwnerIsOrg": false,
"currentUserCanAdminRepo": false
},
"displayName": "thing2",
"displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true",
"headerInfo": {
"blobSize": "3.37 KB",
"deleteInfo": {
"deleteTooltip": "You must be signed in to make or propose changes"
},
"editInfo": {
"editTooltip": "XXX"
},
"ghDesktopPath": "https://desktop.repurl.com",
"repurlLfsPath": null,
"onBranch": true,
"shortPath": "5678",
"siteNavLoginPath": "/login?return_to=identifier",
"isCSV": false,
"isRichtext": false,
"toc": null,
"lineInfo": {
"truncatedLoc": "33",
"truncatedSloc": "33"
},
"mode": "executable file"
},
"image": false,
"isCodeownersFile": null,
"isPlain": false,
"isValidLegacyIssueTemplate": false,
"issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue",
"issueTemplate": null,
"discussionTemplate": null,
"language": null,
"languageID": null,
"large": false,
"loggedIn": false,
"newDiscussionPath": "/contributor/repository/issues/new",
"newIssuePath": "/contributor/repository/issues/new",
"planSupportInfo": {
"repoOption1": null,
"repoOption2": null,
"requestFullPath": "/contributor/repository/blob/main/thing2",
"repoOption4": null,
"repoOption5": null,
"repoOption6": null,
"repoOption7": null
},
"repoOption8": {
"repoOption9": "/settings/dismiss-notice/repoOption10",
"releasePath": "/contributor/repository/releases/new=true",
"repoOption11": false,
"repoOption12": false
},
"rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2",
"repoOption13": false,
"richText": null,
"renderedFileInfo": null,
"shortPath": null,
"tabSize": 8,
"topBannersInfo": {
"overridingGlobalFundingFile": false,
"universalPath": null,
"repoOwner": "contributor",
"repoName": "repository",
"repoOption14": false,
"citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about",
"repoOption15": false,
"repoOption16": null
},
"truncated": false,
"viewable": true,
"workflowRedirectUrl": null,
"symbols": {
"timedOut": false,
"notAnalyzed": true,
"symbols": []
}
},
"collabInfo": null,
"collabMod": false,
"wtsdf_signifier": {
"/contributor/repository/branches": {
"post": "identifier"
},
"/repos/preferences": {
"post": "identifier"
}
}
},
"title": "repository/thing2 at main \\u0000 contributor/repository"
}
Here is the code that deals with property names made up of one word (the command that strips whitespace mans that only the first word of names made up of multiple words is imported):
import json
import pandas as pd
f = open("yourJson.json", "r")
data = json.load(f)
f.close()
# Get what we want to extract from the json
to_extract = data["payload"]["blob"]["rawLines"]
# Remove useless whitespace
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]
# Transform the list of string to a dict
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}
# Load the dict with pandas
df = pd.DataFrame(as_dict.items(), columns=['Value', 'Property'])
I have experimented with various solutions (e.g., not stripping whitespace, specifying the exact property names associated with the data I need) but am so lost as far as JSON that the errors are not meaningful.
Solution
Let's break down your example to just two lines of data.
to_extract = [
" C_1H_4 Methane ",
" 5.00000 Property1_word1 Property1_word2 ",
]
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]
print(f"{trimmed=}")
This gives us the cleaned data:
trimmed=['C_1H_4 Methane', '5.00000 Property1_word1 Property1_word2']
In the next part of your code you split the strings in this list and construct the dictionary. Let's see what we get here:
for e in trimmed:
print(e.split(' '))
The resulting lists look like this
['C_1H_4', 'Methane']
['5.00000', 'Property1_word1', 'Property1_word2']
As you can see, the second string was split into a list with 3 parts and the third part (index 2
) gets lost in your code. You could join together the parts again, but there is an easier way. The split
method has a maxsplit
parameter that we can use to do only one split.
for e in trimmed:
print(e.split(' ', 1))
Both lists now have only 2 entries.
['C_1H_4', 'Methane']
['5.00000', 'Property1_word1 Property1_word2']
So you just have to change your old code
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}
to
as_dict = {e.split(' ')[0]: e.split(' ', 1)[1] for e in trimmed}
.
Additionally: I don't like that we do split
two times. And splitting first and then rejoining the strings when constructing trimmed
seems to be too much work too.
We can throw out the intermediate creation of stripped
and trimmed
and boil all of this down to:
as_dict = dict(line.strip().split(None, 1) for line in to_extract)
The result is:
{'C_1H_4': 'Methane', '5.00000': 'Property1_word1 Property1_word2'}
Answered By - Matthias
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.