Issue
I have the following dataframe,
df = pd.DataFrame(np.random.randint(0,1000, (5,6)), \
columns = pd.MultiIndex.from_product([['CPC', 'Conversions'], ['April', 'June', 'May']])).rename_axis(index = {None : 'idx'})
df
CPC Conversions
April June May April June May
idx
0 663 964 971 663 76 927
1 405 217 754 370 306 34
2 474 229 664 354 66 885
3 73 538 139 417 876 855
4 619 618 618 455 134 805
I wanted to access the months column level in right format so I did this
df.loc[:, (slice(None), ['April', 'May', 'June'])]
CPC Conversions CPC Conversions CPC Conversions
April April May May June June
idx
0 806 202 963 975 110 55
1 263 884 442 563 216 694
2 462 361 780 412 858 670
3 742 756 525 33 477 826
4 579 332 91 802 829 231
It's weird how the result comes while it works perfectly fine when I use two months instead of three (see below)
df.loc[:, (slice(None), ['April', 'May'])]
CPC Conversions
April May April May
idx
0 856 619 180 593
1 64 403 929 80
2 973 285 803 967
3 769 405 701 267
4 940 368 863 717
Can anyone tell me what's wrong, it seems buggy
to me.
Currently what I'm doing is creating a MultiIndex.from_product
(just as I used for creating this example) with ordered months and replace the existing column with it. However, I don't want to do that since the other column levels might change over time.
Solution
It's not buggy, it's just unexpected:
>>> df.columns.levels
FrozenList([['CPC', 'Conversions'], ['April', 'June', 'May']])
>>> df.columns.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>> df.columns.is_monotonic
True
We can see 'April' has the code 0, 'June' the code 1 and 'May' the code 2.
Now you slice your dataframe by second level:
# Your code (unordered codes)
>>> df.loc[:, (slice(None), ['April', 'May', 'June'])]
CPC Conversions CPC Conversions CPC Conversions
April April May May June June
...
but if your slice your dataframe according the ordered codes, it works as expected:
>>> df.loc[:, (slice(None), ['April', 'June', 'May'])]
CPC Conversions
April June May April June May
...
Check some properties:
>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.is_monotonic
False
>>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.is_monotonic
True
>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.codes
FrozenList([[0, 1, 0, 1, 0, 1], [0, 0, 2, 2, 1, 1]])
>>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
If you use sort_index
after your slice, the dataframe is sorted as its creation:
>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].sort_index(axis=1)
CPC Conversions
April June May April June May
...
Sorting MultiIndex
is based on codes
.
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.