Wednesday, June 15, 2022

[FIXED] pandas: divide a dataframe based on a condition in certain columns and rows

June 15, 2022 dataframe, divide, pandas, python, rows No comments

Issue

I have a dataframe (much larger than this example)as follows where all rows in the first two columns are repeated 5 times.

import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
                        'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
                        'the day is long','the day is long','the day is long','the day is long','the day is long'],
               'reference':['weather','weather','weather','weather','weather',
                            'house','house','house','house','house',
                            'day','day','day','day','day'],
               'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})

I would like to divide this pandas dataframe to two dataframes in a way that the first two consecutive rows appear in one and the three others appear in the second dataframe as follows.

The desired output:

first df:

                      text reference  id
0      the weather is nice   weather   1
1      the weather is nice   weather   2
3   the house is beautiful     house   6
4   the house is beautiful     house   7
5         the day is long       day  11
6         the day is long       day  12

second df:
                      text reference  id
0      the weather is nice   weather   3
1      the weather is nice   weather   4
2      the weather is nice   weather   5
3   the house is beautiful     house   8
4   the house is beautiful     house   9
5   the house is beautiful     house  10
6         the day is long       day  13
7         the day is long       day  14
8         the day is long       day  15

obviously selecting n-rows does not work (e,g df.iloc[::3, :] or df[df.index % 3 == 0]) so I would like to know how the above-mentioned output would be possible.

Solution

If you want to group on the value of reference (first 2 items vs rest):

mask = df.groupby('reference').cumcount().gt(1)
groups = [g for k,g in df.groupby(mask)]

# or manually
# df1 = df[~mask]
# df2 = df[mask]

Using position:

mask = (np.arange(len(df))%5)<1

# or with a range index
# mask = df.index.mod(5).gt(1)

# then same as above using groupby or slicing

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 15, 2022

[FIXED] pandas: divide a dataframe based on a condition in certain columns and rows

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels