Issue
Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.
DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS
So there are many such log files. Each with the test data. I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.
Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.
Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:
DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS
Is this the right way to deal with this kind of data?
If so, then does Pandas has any inbuilt support for this?
I will be using scikit-learn to create models.
Solution
First convert columns to ordered CategoricalIndex
for same order of columns in output, convert DATE
column by to_datetime
and convert datetimes to dates by Series.dt.date
with cumcount
for counter, create MultiIndex
by set_index
, reshape by unstack
and sort second level of MultiIndex in columns
by sort_index
. Last flatten it by list comprehension
with reset_index
:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(drop=True)
print (df)
DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
If need also date
s in separate first column:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
DAT DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.