Issue
I have some housing price data that spans about 8 months, and tracks the price as houses come onto the market up until they are sold. There are a couple gaps in the data in the middle that I'd like to fill in, but I'd like to leave the NaNs on the end of each untouched.
To use a simple example, let's say we have house1 that comes on the market for 200000 on 'Day 4', and sells for 190000 on 'Day 9'. And we have house2 that stays at 180000 for Days 1 - 12 and doesn't sell in that time window. But, something went wrong on days 6 and 7 and I lost the data:
house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]
Now imagine instead of regular arrays these were columns in Pandas Dataframes indexed by date.
The trouble is, the function I would normally use to fill the gaps here would be DataFrame.fillna() using either the backfill or ffill methods. If I use ffill, house1 returns this:
house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]
Which fills the gap, but also incorrectly fills the data past the day of sale. If I use backfill instead, I get this:
house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]
Again, it fills the gap, but this time it also fills the front end of the data. If I use 'limit=2' with ffill, then what I get is:
house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]
Once again, it fills the gap, but then it also begins to fill the data beyond the end of where the 'real' data ends.
My solution so far was to write the following function:
def fillGaps(houseDF):
"""Fills up holes in the housing data"""
def fillColumns(column):
filled_col = column
lastValue = None
# Keeps track of if we are dealing with a gap in numbers
gap = False
i = 0
for currentValue in filled_col:
# Loops over all the nans before the numbers begin
if not isANumber(currentValue) and lastValue is None:
pass
# Keeps track of the last number we encountered before a gap
elif isANumber(currentValue) and (gap is False):
lastIndex = i
lastValue = currentValue
# Notes when we encounter a gap in numbers
elif not isANumber(currentValue):
gap = True
# Fills in the gap
elif isANumber(currentValue):
gapIndicies = range(lastIndex + 1, i)
for j in gapIndicies:
filled_col[j] = lastValue
gap = False
i += 1
return filled_col
filled_df = houseDF.apply(fillColumns, axis=0)
return filled_df
It simply skips all the NaNs in front, fills in the gaps (defined by groups of NaNs between real values), and doesn't fill in NaNs on the end.
Is there a cleaner way to do this, or a built-in pandas function I'm unaware of?
Solution
You can use fillna
on certain parts of the Series. Based on your description, fillna
should only fill up the NaNs after the first non-NaN, and before the last non-NaN:
import numpy as np
import pandas as pd
def fill_column(house):
house = house.copy()
non_nans = house[~house.apply(np.isnan)]
start, end = non_nans.index[0], non_nans.index[-1]
house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
return house
house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)
Output:
0 NaN
1 NaN
2 NaN
3 200000
4 200000
5 200000
6 200000
7 200000
8 190000
9 NaN
10 NaN
11 NaN
Note that this assumes that the Series contains at least two non-NaNs, corresponding to the prices on the first and last day.
Answered By - YS-L
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.