Issue
I have a DataFrame
which looks as follows:
id time activity
4 1596213715048 [{"name":"STILL","conf":100}]
4 1596213739171 [{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]
4 1596213755797 [{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]
6 1596214842817 [{"name":"STILL","conf":100}]
6 1596214931090 [{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]
8 1596214957246 [{"name":"STILL","conf":100}]
9 1596215304418 [{"name":"STILL","conf":100}]
I would like to split the activity
column according to name
. The resulting DataFrame should look like:
id time IN_VEHICLE ON_BICYLE ON_FOOT WALKING RUNNING TILTING STILL UNKNOWN
4 1596213715048 0 0 0 0 0 0 100 0
4 1596213739171 8 9 19 19 0 0 54 3
4 1596213755797 1 0 0 0 0 0 97 2
6 1596214842817 0 0 0 0 0 0 100 0
6 1596214931090 28 8 15 15 0 0 34 3
8 1596214957246 0 0 0 0 0 0 100 0
9 1596215304418 0 0 0 0 0 0 100 0
How can this split be done? The resulting columns are fixed but if still a entry in the activity
string does not exist as a column in the resulting DataFrame, a error should be thrown.
Solution
- This answer is is 8x faster than the other solution for a dataframe with 100k rows
- The other implementation works, but uses
.apply
twice and a list comprehension, which are slow, compared to vectorized methods.
- The other implementation works, but uses
Explanation
.apply(literal_eval)
converts the'activity'
column from astrings
to a python literal (e.g.lists
ofdicts
;'[{"name":"STILL","conf":100}]'
→[{"name":"STILL","conf":100}]
).explode
separates thedicts
in eachlist
to separate rows- Extract the
keys
andvalues
in the'activity'
column into separate columns and then.join
the columns back todf
- The timing analysis of this answer shows the fastest way to extract a column of single level
dicts
to a dataframe is withpd.DataFrame(df.pop('activity').values.tolist())
- The timing analysis of this answer shows the fastest way to extract a column of single level
.pivot
thedf
into a wide format- Change
dfp.columns.name
from'name'
toNone
- this is cosmetic, and can be removed
- This was performed in pandas 1.2.0
import pandas as pd
from ast import literal_eval
# test data
data = {'id': [4, 4, 4, 6, 6, 8, 9], 'time': [1596213715048, 1596213739171, 1596213755797, 1596214842817, 1596214931090, 1596214957246, 1596215304418], 'activity': ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']}
df = pd.DataFrame(data)
# function to transform column of strings
def test(df):
df.activity = df.activity.apply(literal_eval)
df = df.explode('activity', ignore_index=True)
df = df.join(pd.DataFrame(df.pop('activity').values.tolist()))
dfp = df.pivot(index=['id', 'time'], columns='name', values='conf').fillna(0).astype(int).reset_index()
dfp.columns.rename(None, inplace=True)
return dfp
# call the function
test(df)
# result
id time IN_VEHICLE ON_BICYCLE ON_FOOT STILL UNKNOWN WALKING
0 4 1596213715048 0 0 0 100 0 0
1 4 1596213739171 8 9 19 54 3 19
2 4 1596213755797 1 0 0 97 2 0
3 6 1596214842817 0 0 0 100 0 0
4 6 1596214931090 28 8 15 34 3 15
5 8 1596214957246 0 0 0 100 0 0
6 9 1596215304418 0 0 0 100 0 0
%%timeit
testing
import numpy as np
import random
import pandas
import json
from ast import literal_eval
# test data with 100000 rows
np.random.seed(365)
random.seed(365)
rows = 1000000
activity = ['[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":54},{"name":"ON_FOOT","conf":19},{"name":"WALKING","conf":19},{"name":"ON_BICYCLE","conf":9},{"name":"IN_VEHICLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":97},{"name":"UNKNOWN","conf":2},{"name":"IN_VEHICLE","conf":1}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":34},{"name":"IN_VEHICLE","conf":28},{"name":"ON_FOOT","conf":15},{"name":"WALKING","conf":15},{"name":"ON_BICYCLE","conf":8},{"name":"UNKNOWN","conf":3}]', '[{"name":"STILL","conf":100}]', '[{"name":"STILL","conf":100}]']
data = {'time': pd.bdate_range('2021-01-15', freq='s', periods=rows),
'id': np.random.randint(10, size=(rows)),
'activity': [random.choice(activity) for _ in range(rows)]}
df = pd.DataFrame(data)
# test the function in this answer
%%timeit -r1 -n1 -q -o
test(df)
[out]:
<TimeitResult : 31.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
# test the implementation from the other answer
def flatten_json_to_dict(s):
return {obj['name']: obj['conf'] for obj in json.loads(s)}
def nick(df):
expanded = df['activity'].apply(flatten_json_to_dict).apply(pd.Series)
df = df.join(expanded)
df = df.drop('activity', axis=1)
df = df.fillna(0)
return df
%%timeit -r1 -n1 -q -o
nick(df)
[out]:
<TimeitResult : 4min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.