Monday, January 15, 2024

[FIXED] Pandas data frame use interpolate() partitioning with specific columns

January 15, 2024 dataframe, fillna, interpolation, pandas, python No comments

Issue

I have the following Pandas data frame (called df).

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | NaN    |
| John   | Cats   | 2002 | NaN    |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | NaN    |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

My target is to get the following, which means using the interpolate method to fill the NaN values, but based on the other column value. In other words, it should

partition the df using the Person and Animal columns
order by Year (asc)
apply the interpolate method

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | 2      |
| John   | Cats   | 2002 | 3      |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | 3      |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

What I have done

I can filter for each Person and each Animal and then apply the interpolate methods. Finally, merge all together, but this sounds dull and long if you have many columns.

Solution

You can try:

df['Number'] = (df.sort_values('Year', ascending=True)
                  .groupby(['Person', 'Animal'])['Number']
                  .transform(lambda x: x.interpolate()))
print(df)

# Output
   Person Animal  Year  Number
0    John   Dogs  2000     2.0
1    John   Dogs  2001     2.0
2    John   Dogs  2002     2.0
3    John   Dogs  2003     2.0
4    John   Dogs  2004     2.0
5    John   Dogs  2005     2.0
6    John   Cats  2000     1.0
7    John   Cats  2001     2.0  # interpolate
8    John   Cats  2002     3.0  # interpolate
9    John   Cats  2003     4.0
10   John   Cats  2004     5.0
11   John   Cats  2005     6.0
12  Peter   Dogs  2000     NaN
13  Peter   Dogs  2001     1.0
14  Peter   Dogs  2002     3.0
15  Peter   Dogs  2003     5.0
16  Peter   Dogs  2004     5.0
17  Peter   Dogs  2005     5.0
18  Peter   Cats  2000     NaN
19  Peter   Cats  2001     4.0
20  Peter   Cats  2002     4.0
21  Peter   Cats  2003     4.0
22  Peter   Cats  2004     4.0
23  Peter   Cats  2005     4.0

For multiple columns, just use the same operation:

cols = ['Number']  # list of columns
df[cols] = (df.sort_values('Year', ascending=True)
              .groupby(['Person', 'Animal'])[cols]
              .transform(lambda x: x.interpolate()))

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[FIXED] Pandas data frame use interpolate() partitioning with specific columns

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels