Monday, January 15, 2024

[FIXED] How to set new index for Pandas dataframe from existing column that has duplicates?

January 15, 2024 mongodb, pandas, python No comments

Issue

I am grabbing data from a MongoDB database and converting it into a Pandas dataframe for additional operations to be done later. The MongoDB database contains a bunch of time-based entries and due to how they are stored, each sample for each channel is its own document. Some of these channels always sample at the same time while others are on different schedules. Below is a quick example of a document.

timestamp:
    2024-01-05T08:16:30.848+00:00
metaData:
    deviceId:
        "123"
    channelName:
        "Channel1"
_id:
    659c23016ad87924ff552882
Channel1:
    10345

So when I try to grab a few channels from the database using something like

b = pd.DataFrame(list(timeCol.find({'metaData.deviceId':'123','metaData.channelName':{'$in':['Channel1','Channel2','Channel3','Channel4','Channel5']}},{'_id':0,'metaData':0}).sort('timestamp')))

I get a dataframe that looks something like below

                     timestamp  Channel1          Channel2       Channel3        Channel4  Channel5
0      2024-01-05 20:27:31.340    0.0                NaN           NaN             NaN        NaN
1      2024-01-05 20:27:31.382    1.0                NaN           NaN             NaN        NaN
2      2024-01-05 20:27:31.400    NaN               2456           NaN             NaN        NaN
3      2024-01-05 20:27:31.400    NaN                NaN        10.231             NaN        NaN
4      2024-01-05 20:27:31.400    NaN                NaN           NaN             2.4        NaN

But it has many more entries because I'm usually interested in a timespan of a few hours. Anyways as you can see, Channels2-5 typically share a timestamp but Channel1 is at a higher rate.

Is there any way that I can set the timestamp column to be the index and have Pandas only use unique entries for timestamp and then correctly sample the other columns?

I know I can probably do this by creating a series for each column and then merging/joining them, but I think that would require a separate call to the DB for every channel and I would prefer to limit DB calls for speed and efficiency. I could request some changes to the DB but this is how the data is broadcast (separate messages for each channel/device) and nothing guarantees channels will be on the same timestamps but that appears to happen more often than not for certain channels. There are also additional channels that are broadcast at a much higher rate that I also need to work into my analysis but I plan to query for those separately and add them in later.

Thanks!

Solution

If you want to merge the common timestamps, use a groupby.first:

out = df.groupby('timestamp').first()

Output:

                         Channel1  Channel2  Channel3  Channel4  Channel5
timestamp                                                                
2024-01-05 20:27:31.340       0.0       NaN       NaN       NaN       NaN
2024-01-05 20:27:31.382       1.0       NaN       NaN       NaN       NaN
2024-01-05 20:27:31.400       NaN    2456.0    10.231       2.4       NaN

If you want to aggregate on a specific frequency (e.g. 100ms), use a resample aggregation (for example with mean here):

df['timestamp'] = pd.to_datetime(df['timestamp'])

out = df.set_index('timestamp').resample('100ms').mean()

Output:

                         Channel1  Channel2  Channel3  Channel4  Channel5
timestamp                                                                
2024-01-05 20:27:31.300       0.5       NaN       NaN       NaN       NaN
2024-01-05 20:27:31.400       NaN    2456.0    10.231       2.4       NaN

Variant with the first timestamp as origin:

df['timestamp'] = pd.to_datetime(df['timestamp'])

out = df.set_index('timestamp').resample('100ms', origin='start').mean()

Output:

                         Channel1  Channel2  Channel3  Channel4  Channel5
timestamp                                                                
2024-01-05 20:27:31.340       0.0    2456.0    10.231       2.4       NaN

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[FIXED] How to set new index for Pandas dataframe from existing column that has duplicates?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels