Issue
I have a very large json file with thousands of rows that look like this (scraped):
[
{"result": ["/results/1138/dundalk-aw/2022-03-11/806744", "/results/1138/dundalk-aw/2022-03-11/806745", "/results/1138/dundalk-aw/2022-03-11/806746", "/results/1138/dundalk-aw/2022-03-11/806747", "/results/1138/dundalk-aw/2022-03-11/806748", "/results/1138/dundalk-aw/2022-03-11/806749", "/results/1138/dundalk-aw/2022-03-11/806750", "/results/1138/dundalk-aw/2022-03-11/806751", "/results/14/exeter/2022-03-11/804190", "/results/14/exeter/2022-03-11/804193", "/results/14/exeter/2022-03-11/804194", "/results/14/exeter/2022-03-11/804192", "/results/14/exeter/2022-03-11/804196", "/results/14/exeter/2022-03-11/804191", "/results/14/exeter/2022-03-11/804195", "/results/30/leicester/2022-03-11/804201", "/results/30/leicester/2022-03-11/804200", "/results/30/leicester/2022-03-11/804198", "/results/30/leicester/2022-03-11/804197", "/results/30/leicester/2022-03-11/804199", "/results/30/leicester/2022-03-11/804202", "/results/37/newcastle/2022-03-11/804181", "/results/37/newcastle/2022-03-11/804179", "/results/37/newcastle/2022-03-11/804182", "/results/37/newcastle/2022-03-11/804180", "/results/37/newcastle/2022-03-11/804177", "/results/37/newcastle/2022-03-11/804176", "/results/37/newcastle/2022-03-11/804178", "/results/513/wolverhampton-aw/2022-03-11/804352", "/results/513/wolverhampton-aw/2022-03-11/804353", "/results/513/wolverhampton-aw/2022-03-11/806925", "/results/513/wolverhampton-aw/2022-03-11/804350", "/results/513/wolverhampton-aw/2022-03-11/804354", "/results/513/wolverhampton-aw/2022-03-11/804349", "/results/513/wolverhampton-aw/2022-03-11/804351", "/results/1303/al-ain/2022-03-11/806926", "/results/1244/goulburn/2022-03-11/807045", "/results/869/sakhir/2022-03-11/806948", "/results/1244/goulburn/2022-03-11/807045", "/results/869/sakhir/2022-03-11/806948"]},
{"result": ["/results/8/carlisle/2022-03-10/804174", "/results/8/carlisle/2022-03-10/804172", "/results/8/carlisle/2022-03-10/804170", "/results/8/carlisle/2022-03-10/804175", "/results/8/carlisle/2022-03-10/804171", "/results/8/carlisle/2022-03-10/804173", "/results/8/carlisle/2022-03-10/805620", "/results/1353/newcastle-aw/2022-03-10/804340", "/results/1353/newcastle-aw/2022-03-10/804341", "/results/1353/newcastle-aw/2022-03-10/804338", "/results/1353/newcastle-aw/2022-03-10/804342", "/results/1353/newcastle-aw/2022-03-10/804337", "/results/1353/newcastle-aw/2022-03-10/804339", "/results/394/southwell-aw/2022-03-10/804346", "/results/394/southwell-aw/2022-03-10/804344", "/results/394/southwell-aw/2022-03-10/804345", "/results/394/southwell-aw/2022-03-10/804348", "/results/394/southwell-aw/2022-03-10/806779", "/results/394/southwell-aw/2022-03-10/804343", "/results/394/southwell-aw/2022-03-10/804347", "/results/394/southwell-aw/2022-03-10/806778", "/results/198/thurles/2022-03-10/806623", "/results/198/thurles/2022-03-10/806624", "/results/198/thurles/2022-03-10/806625", "/results/198/thurles/2022-03-10/806626", "/results/198/thurles/2022-03-10/806627", "/results/198/thurles/2022-03-10/806628", "/results/198/thurles/2022-03-10/806629", "/results/90/wincanton/2022-03-10/804183", "/results/90/wincanton/2022-03-10/804186", "/results/90/wincanton/2022-03-10/804188", "/results/90/wincanton/2022-03-10/804185", "/results/90/wincanton/2022-03-10/804187", "/results/90/wincanton/2022-03-10/804184", "/results/90/wincanton/2022-03-10/804189", "/results/219/saint-cloud/2022-03-10/807032", "/results/219/saint-cloud/2022-03-10/806812", "/results/219/saint-cloud/2022-03-10/806837", "/results/219/saint-cloud/2022-03-10/807033", "/results/219/saint-cloud/2022-03-10/807037", "/results/219/saint-cloud/2022-03-10/807041", "/results/219/saint-cloud/2022-03-10/807042", "/results/219/saint-cloud/2022-03-10/807043", "/results/219/saint-cloud/2022-03-10/807044", "/results/219/saint-cloud/2022-03-10/806837", "/results/219/saint-cloud/2022-03-10/807033"]}
]
Now, inside the "result" arrays there are some duplicates. In this case for example /results/1244/goulburn/2022-03-11/807045
How could I filter these duplicates out? I found some solutions here on Stackoverflow to check for duplicate "result" arrays, but not for checking if anything inside the array is a duplicate. At least nothing I tried worked, but I guess I am messing something up. Tried for two days, but could not make out this one on my own - or I am too stupid to find the answer in the smiliar questions here on stackoverflow - and I have very very limited Java knowledge.
I looked into converting the json into a list and then filtering out the duplicates, but that seemed way to clunky for a large file?
Solution
Once you have all the JSON data loaded you can map the results, remove the duplicated ones with set
and convert back to list
to preserve the original structure:
data = [{...}] # large JSON data list
data = list(map(lambda x: {'result': list(set(x['result']))}, data))
Answered By - anbarquer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.