Tuesday, October 19, 2021

[FIXED] Using pandas to fill gaps only, and not NaNs on the ends

October 19, 2021 numpy, pandas, python No comments

Issue

I have some housing price data that spans about 8 months, and tracks the price as houses come onto the market up until they are sold. There are a couple gaps in the data in the middle that I'd like to fill in, but I'd like to leave the NaNs on the end of each untouched.

To use a simple example, let's say we have house1 that comes on the market for 200000 on 'Day 4', and sells for 190000 on 'Day 9'. And we have house2 that stays at 180000 for Days 1 - 12 and doesn't sell in that time window. But, something went wrong on days 6 and 7 and I lost the data:

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

Now imagine instead of regular arrays these were columns in Pandas Dataframes indexed by date.

The trouble is, the function I would normally use to fill the gaps here would be DataFrame.fillna() using either the backfill or ffill methods. If I use ffill, house1 returns this:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

Which fills the gap, but also incorrectly fills the data past the day of sale. If I use backfill instead, I get this:

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

Again, it fills the gap, but this time it also fills the front end of the data. If I use 'limit=2' with ffill, then what I get is:

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

Once again, it fills the gap, but then it also begins to fill the data beyond the end of where the 'real' data ends.

My solution so far was to write the following function:

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

It simply skips all the NaNs in front, fills in the gaps (defined by groups of NaNs between real values), and doesn't fill in NaNs on the end.

Is there a cleaner way to do this, or a built-in pandas function I'm unaware of?

Solution

You can use fillna on certain parts of the Series. Based on your description, fillna should only fill up the NaNs after the first non-NaN, and before the last non-NaN:

import numpy as np
import pandas as pd


def fill_column(house):
    house = house.copy()
    non_nans = house[~house.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
    return house


house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)

Output:

0        NaN
1        NaN
2        NaN
3     200000
4     200000
5     200000
6     200000
7     200000
8     190000
9        NaN
10       NaN
11       NaN

Note that this assumes that the Series contains at least two non-NaNs, corresponding to the prices on the first and last day.

Answered By - YS-L

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 19, 2021

[FIXED] Using pandas to fill gaps only, and not NaNs on the ends

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels