Issue
i want to interpolate (Linear interpolation) data. but There is no NA.
Here is my data.with many missing values.
timestamp | id | strength |
---|---|---|
1383260400000 | 1 | -0.3803901328171995 |
1383261000000 | 1 | -0.42196042219455937 |
1383265200000 | 1 | -0.460714706261982 |
My expected :
timestamp | id | strength |
---|---|---|
1383260400000 | 1 | -0.3803901328171995 |
1383261000000 | 1 | -0.42196042219455937 |
1383261600000 | 1 | Linear interpolated data |
1383262200000 | 1 | Linear interpolated data |
1383262800000 | 1 | Linear interpolated data |
1383263400000 | 1 | Linear interpolated data |
1383264000000 | 1 | Linear interpolated data |
1383264600000 | 1 | Linear interpolated data |
1383265200000 | 1 | -0.460714706261982 |
timestamp starts 1383260400000, ends 1383343800000 and another id(from 1 to 2025) has same issues.
Solution
Idea is create datetimes, convert to DatetimeIndex
and in lambda function add missing datetimes by Series.asfreq
with interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
f = lambda x: x.asfreq('10Min').interpolate()
df = df.set_index('timestamp').groupby('id')['strength'].apply(f).reset_index()
print (df)
id timestamp strength
0 1 2013-10-31 23:00:00 -0.380390
1 1 2013-10-31 23:10:00 -0.421960
2 1 2013-10-31 23:20:00 -0.427497
3 1 2013-10-31 23:30:00 -0.433033
4 1 2013-10-31 23:40:00 -0.438569
5 1 2013-10-31 23:50:00 -0.444106
6 1 2013-11-01 00:00:00 -0.449642
7 1 2013-11-01 00:10:00 -0.455178
8 1 2013-11-01 00:20:00 -0.460715
Last if need original format of timestamps:
df['timestamp'] = df['timestamp'].astype(np.int64) // 1000000
print (df)
id timestamp strength
0 1 1383260400000 -0.380390
1 1 1383261000000 -0.421960
2 1 1383261600000 -0.427497
3 1 1383262200000 -0.433033
4 1 1383262800000 -0.438569
5 1 1383263400000 -0.444106
6 1 1383264000000 -0.449642
7 1 1383264600000 -0.455178
8 1 1383265200000 -0.460715
EDIT:
#data from question
df =pd.DataFrame({'timestamp': [1383260400000, 1383261000000, 1383265200000],
'id': [1, 1, 1],
'strength':[-0.3803901328171995,-0.4219604221945593,-0.460714706261982]})
print (df)
timestamp id strength
0 1383260400000 1 -0.380390
1 1383261000000 1 -0.421960
2 1383265200000 1 -0.460715
Solution create for each id
all datetimes by date_range
and create missing values by DataFrame.reindex
with MultiIndex
, last per id
is used interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
r = pd.date_range(pd.to_datetime(1383260400000, unit='ms') ,
pd.to_datetime(1383343800000, unit='ms'),
freq='10Min')
ids = df['id'].unique()
mux = pd.MultiIndex.from_product([r, ids], names=['timestamp','id'])
f = lambda x: x.interpolate()
df = (df.set_index(['timestamp', 'id'])
.reindex(mux)
.groupby('id')['strength']
.transform(f)
.reset_index())
print (df)
timestamp id strength
0 2013-10-31 23:00:00 1 -0.380390
1 2013-10-31 23:10:00 1 -0.421960
2 2013-10-31 23:20:00 1 -0.427497
3 2013-10-31 23:30:00 1 -0.433033
4 2013-10-31 23:40:00 1 -0.438569
.. ... .. ...
135 2013-11-01 21:30:00 1 -0.460715
136 2013-11-01 21:40:00 1 -0.460715
137 2013-11-01 21:50:00 1 -0.460715
138 2013-11-01 22:00:00 1 -0.460715
139 2013-11-01 22:10:00 1 -0.460715
[140 rows x 3 columns]
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.