Thursday, October 28, 2021

[FIXED] Wrong/wierd output by the .sort() and sorted() function in python

October 28, 2021 pandas, python, sorting No comments

Issue

I have a .csv file that looks partly like this in form of a table:

Each row represtents an entity in this case games. Column "0" are links to their dbpedia page, column "1" represents labels and column "2" is an index. It starts at 1 and counts up.

What I'd like in the end is a list of just the links e.g. column "1" but sorted by column "2".

I've done it the same way for a lot of other tables but for this one it seems the method breaks and I don't know why.

import pandas as pd

entities = pd.read_csv("24142265_0_4577466141408796359.csv", header=None)

entitiesUri = [str(ent) for ent in entities[0]]
tmp = entitiesUri.copy()

#I sort 'entitiesUri' by the second column in 'entities' and the index of the link in tmp
entitiesUri.sort(key = lambda k: int(entities[2][tmp.index(k)]))

I've created a copy of entitiesUri (tmp) to be sure that the sort() method doesn't mess up when using the list it has to sort in the lambda function.

This is the print of "entitiesUri":

It didn't sort the links by the index but neither alphabetically it seems. But somehow it bunched the same games together in a random order?

I also used

entitiesUri = sorted(entitiesUri, key = lambda k: int(entities[2][tmp.index(k)]))

instead of sort() but the results are the same.

The only thing that worked for me this far was the sort_values() function from Pandas

entities = entities.sort_values(2)

entitiesUri = [ent for ent in entities[0]]

With the right result:

but this method slows me down a lot. Any ideas why sort() and sorted() break?

I've linked to dropbox where you can download the .csv file if you want to try it out yourself.

https://www.dropbox.com/s/ld8u4td5rk4vn71/24142265_0_4577466141408796359.csv?dl=0

Solution

The reason why the first approach does not work as you would expect is that your input has duplicates in the URL column and that list.index() returns the index of the first item.

$ grep The_Elder_Scrolls_V 24142265_0_4577466141408796359.csv
"http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim","the elder scrolls v: skyrim","3"
"http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim","the elder scrolls v: skyrim","1"
"http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim","the elder scrolls v: skyrim","5"

So, for example, key = lambda k: int(entities[2][tmp.index(k)]) returns 3 (the value in the last column of the dataframe for the first occurrence of the URL above) for all 3 occurrences in the dataframe.

>>> for e in tmp:
...   if e == 'http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim':
...       print(e, int(entities[2][tmp.index(e)]))
... 
http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim 3
http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim 3
http://dbpedia.org/resource/The_Elder_Scrolls_V:_Skyrim 3

As 3 is the smallest value (you can verify that by removing the if statement from the above listing), the URL appears first and 3 times in the output of sorted() and sort(). Removing the if statement will also make obvious why sorting entitiesUri produces the result you get.

Answered By - Nikolaos Chatzis

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 28, 2021

[FIXED] Wrong/wierd output by the .sort() and sorted() function in python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels