Issue
I have two advanced sorting scenarios.
They are independent cases so I have listed below a sample with expected results.
import pandas as pd
a={
"ip":['10.10.11.30','10.10.11.30','10.10.11.30', '10.2.2.10', '10.10.2.1', '10.2.2.2'],
"path":['/data/foo/err','/data/foo/zone','/data/foo/err','/data/foo/zone','/data/foo/zone','/data/foo/tmp'],
"date":['25/01/2024','25/01/2024','01/08/2020','23/01/2024','24/01/2024','25/01/2024'],
"count":[3,10,20,5,20,50]
}
df=pd.DataFrame(a)
print(df)
print()
## Output
ip path date count
0 10.10.11.30 /data/foo/err 25/01/2024 3
1 10.10.11.30 /data/foo/zone 25/01/2024 10
2 10.10.11.30 /data/foo/err 01/08/2020 20
3 10.2.2.10 /data/foo/zone 23/01/2024 5
4 10.10.2.1 /data/foo/zone 24/01/2024 20
5 10.2.2.2 /data/foo/tmp 25/01/2024 50
Runnable sample: https://onecompiler.com/python/422hhyqa5
Sorting case 1
Rule(s)
- Order by ip ASC, date ASC, count ASC
Expected output
ip path date count
5 10.2.2.2 /data/foo/tmp 25/01/2024 50
3 10.2.2.10 /data/foo/zone 23/01/2024 5
4 10.10.2.1 /data/foo/zone 24/01/2024 20
2 10.10.11.30 /data/foo/err 01/08/2020 20
0 10.10.11.30 /data/foo/err 25/01/2024 3
1 10.10.11.30 /data/foo/zone 25/01/2024 10
My Attempt
Performing "natural" sorting on multiple columns is straight forward (date is of type DateTime).
I also managed to achieve sorting by ip.
But I did not manage to combine the two as it always gives different errors
df_sorted_by_date = (df.sort_values(by=['date', 'count'],
ascending=[True, True],
ignore_index=True)
df_sorted_by_ip = (df.sort_values(by=["ip"],
key=lambda x: x.str.split(".").apply(lambda y: [int(z) for z in y]),
ignore_index=True))
Sorting case 2
Rule(s)
rank1
: if(path contains 'zone') and (count >=10)
then place 1st and order byIP ASC
rank2
: if(path NOT contains 'zone') and (count >=10) and (date = today())
then place 2nd and order byIP ASC
rank3
: The remaining rows are placed last and ordered byIP ASC
thendate ASC
if equal values
Expected output
let's assume today is 25th
ip path date count
4 10.10.2.1 /data/foo/zone 24/01/2024 20 # rank1
1 10.10.11.30 /data/foo/zone 25/01/2024 10 # rank1
5 10.2.2.2 /data/foo/tmp 25/01/2024 50 # rank2
3 10.2.2.10 /data/foo/zone 23/01/2024 5 # rank3
2 10.10.11.30 /data/foo/err 01/08/2020 20 # rank3
0 10.10.11.30 /data/foo/err 25/01/2024 3 # rank3
My Attempt
None. I am not sure how to proceed, this is very advanced for me >_<
Solution
case 1
Perform natural sorted with natsort
:
# pip install natsort
from natsort import natsort_key
# ensure datetime
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# sort in desired order
out = df.sort_values(by=['ip', 'date', 'count'], key=natsort_key)
Output:
ip path date count
5 10.2.2.2 /data/foo/tmp 2024-01-25 50
3 10.2.2.10 /data/foo/zone 2024-01-23 5
4 10.10.2.1 /data/foo/zone 2024-01-24 20
2 10.10.11.30 /data/foo/err 2020-08-01 20
0 10.10.11.30 /data/foo/err 2024-01-25 3
1 10.10.11.30 /data/foo/zone 2024-01-25 10
case 2
Assign your rank computed with numpy.select
and perform natural sorted with natsort
:
# pip install natsort
from natsort import natsort_key
# ensure datetime
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# conditions
m1 = df['path'].str.contains('zone')
m2 = df['count'].ge(10)
m3 = df['date'].eq(pd.Timestamp('25/01/2024')) # or 'today' in your case
# create rank based on your conditions
df['rank'] = np.select([m1&m2, m2&m3&~m1],
['rank1', 'rank2'],
'rank3')
# sort in desired order
out = df.sort_values(by=['rank', 'ip', 'date', 'count'], key=natsort_key)
NB. if needed, you can always drop
the rank afterwards.
Output:
ip path date count rank
4 10.10.2.1 /data/foo/zone 2024-01-24 20 rank1
1 10.10.11.30 /data/foo/zone 2024-01-25 10 rank1
5 10.2.2.2 /data/foo/tmp 2024-01-25 50 rank2
3 10.2.2.10 /data/foo/zone 2024-01-23 5 rank3
2 10.10.11.30 /data/foo/err 2020-08-01 20 rank3
0 10.10.11.30 /data/foo/err 2024-01-25 3 rank3
handling NaT (bug?)
natsort
seems to replace the missing values by -inf, which doesn't compare to timestamps. A workaround could be to use your own wrapper:
import numpy as np
def mykey(s):
if np.issubdtype(s.dtype, np.datetime64):
return s
else:
return natsort_key(s)
out = df.sort_values(by=['ip', 'date', 'count'], key=mykey)
Output:
ip path date count
5 10.2.2.2 /data/foo/tmp 2024-01-25 50
6 10.2.2.2 /data/foo/tmp NaT 50
3 10.2.2.10 /data/foo/zone 2024-01-23 5
4 10.10.2.1 /data/foo/zone 2024-01-24 20
2 10.10.11.30 /data/foo/err 2020-08-01 20
0 10.10.11.30 /data/foo/err 2024-01-25 3
1 10.10.11.30 /data/foo/zone 2024-01-25 10
or to convert the dates to integers:
out = df.loc[df.astype({'date': int}).sort_values(by=['ip', 'date', 'count'], key=natsort_key).index]
Output:
ip path date count
6 10.2.2.2 /data/foo/tmp NaT 50
5 10.2.2.2 /data/foo/tmp 2024-01-25 50
3 10.2.2.10 /data/foo/zone 2024-01-23 5
4 10.10.2.1 /data/foo/zone 2024-01-24 20
2 10.10.11.30 /data/foo/err 2020-08-01 20
0 10.10.11.30 /data/foo/err 2024-01-25 3
1 10.10.11.30 /data/foo/zone 2024-01-25 10
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.