Issue
I am grabbing data from a MongoDB database and converting it into a Pandas dataframe for additional operations to be done later. The MongoDB database contains a bunch of time-based entries and due to how they are stored, each sample for each channel is its own document. Some of these channels always sample at the same time while others are on different schedules. Below is a quick example of a document.
timestamp:
2024-01-05T08:16:30.848+00:00
metaData:
deviceId:
"123"
channelName:
"Channel1"
_id:
659c23016ad87924ff552882
Channel1:
10345
So when I try to grab a few channels from the database using something like
b = pd.DataFrame(list(timeCol.find({'metaData.deviceId':'123','metaData.channelName':{'$in':['Channel1','Channel2','Channel3','Channel4','Channel5']}},{'_id':0,'metaData':0}).sort('timestamp')))
I get a dataframe that looks something like below
timestamp Channel1 Channel2 Channel3 Channel4 Channel5
0 2024-01-05 20:27:31.340 0.0 NaN NaN NaN NaN
1 2024-01-05 20:27:31.382 1.0 NaN NaN NaN NaN
2 2024-01-05 20:27:31.400 NaN 2456 NaN NaN NaN
3 2024-01-05 20:27:31.400 NaN NaN 10.231 NaN NaN
4 2024-01-05 20:27:31.400 NaN NaN NaN 2.4 NaN
But it has many more entries because I'm usually interested in a timespan of a few hours. Anyways as you can see, Channels2-5 typically share a timestamp but Channel1 is at a higher rate.
Is there any way that I can set the timestamp column to be the index and have Pandas only use unique entries for timestamp and then correctly sample the other columns?
I know I can probably do this by creating a series for each column and then merging/joining them, but I think that would require a separate call to the DB for every channel and I would prefer to limit DB calls for speed and efficiency. I could request some changes to the DB but this is how the data is broadcast (separate messages for each channel/device) and nothing guarantees channels will be on the same timestamps but that appears to happen more often than not for certain channels. There are also additional channels that are broadcast at a much higher rate that I also need to work into my analysis but I plan to query for those separately and add them in later.
Thanks!
Solution
If you want to merge the common timestamps, use a groupby.first
:
out = df.groupby('timestamp').first()
Output:
Channel1 Channel2 Channel3 Channel4 Channel5
timestamp
2024-01-05 20:27:31.340 0.0 NaN NaN NaN NaN
2024-01-05 20:27:31.382 1.0 NaN NaN NaN NaN
2024-01-05 20:27:31.400 NaN 2456.0 10.231 2.4 NaN
If you want to aggregate on a specific frequency (e.g. 100ms), use a resample
aggregation (for example with mean
here):
df['timestamp'] = pd.to_datetime(df['timestamp'])
out = df.set_index('timestamp').resample('100ms').mean()
Output:
Channel1 Channel2 Channel3 Channel4 Channel5
timestamp
2024-01-05 20:27:31.300 0.5 NaN NaN NaN NaN
2024-01-05 20:27:31.400 NaN 2456.0 10.231 2.4 NaN
Variant with the first timestamp as origin:
df['timestamp'] = pd.to_datetime(df['timestamp'])
out = df.set_index('timestamp').resample('100ms', origin='start').mean()
Output:
Channel1 Channel2 Channel3 Channel4 Channel5
timestamp
2024-01-05 20:27:31.340 0.0 2456.0 10.231 2.4 NaN
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.