Issue
I have a dataset of chat conversations that looks like this (where the message_id
is an index for all of the messages in the database).:
| message_id | to_user | from_user | message |
|------------|---------|-----------|--------------|
| 123 | al | sal | hi |
| 871 | al | hal | hey |
| 989 | al | bob | me too |
| 900 | sal | sal | hello |
| 107 | bob | al | i'm bob |
| 242 | sal | al | how are you? |
| 101 | al | bob | hi, i'm al |
| 898 | sal | al | i'm good |
What I want to do is sort this table to reflect a conversation between two people. So it would first group all of the conversations from to_user
and each from_user
that they chatted with, and then for each conversation between a to_user
and from_user
sort them by their message_id so it would reflect the back and forth conversation.
| message_id | to_user | from_user | message |
|------------|---------|-----------|--------------|
| 101 | al | bob | hi, i'm al |
| 107 | bob | al | i'm bob |
| 989 | al | bob | me too |
| 123 | al | sal | hi |
| 242 | sal | al | how are you? |
| 871 | al | sal | hey |
| 898 | sal | al | i'm good |
| 900 | sal | al | hello |
How would I accomplish this in Pandas?
Solution
We can use np.sort
to sort values across rows so that we have columns that specify the participants, but not direction, then sort by conversation and message id with DataFrame.sort_values
:
df[['person_a', 'person_b']] = np.sort(df[['to_user', 'from_user']])
df = df.sort_values(['message_id', 'person_a', 'person_b'], ignore_index=True)
message_id to_user from_user message person_a person_b
0 101 al bob hi, i'm al al bob
1 107 bob al i'm bob al bob
2 989 al bob me too al bob
3 123 al sal hi al sal
4 242 sal al how are you? al sal
5 871 al sal hey al sal
6 898 sal al i'm good al sal
7 900 sal al hello al sal
We can drop
these additional columns after we're done with them:
df[['person_a', 'person_b']] = np.sort(df[['to_user', 'from_user']])
df = df.sort_values(
['message_id', 'person_a', 'person_b'], ignore_index=True
).drop(columns=['person_a', 'person_b'])
df
:
message_id to_user from_user message
0 101 al bob hi, i'm al
1 107 bob al i'm bob
2 989 al bob me too
3 123 al sal hi
4 242 sal al how are you?
5 871 al sal hey
6 898 sal al i'm good
7 900 sal al hello
Setup and imports (edited to match output):
import numpy as np
import pandas as pd
df = pd.DataFrame({
'message_id': [123, 871, 989, 900, 107, 242, 101, 898],
'to_user': ['al', 'al', 'al', 'sal', 'bob', 'sal', 'al', 'sal'],
'from_user': ['sal', 'sal', 'bob', 'al', 'al', 'al', 'bob', 'al'],
'message': ['hi', 'hey', 'me too', 'hello', "i'm bob", 'how are you?',
"hi, i'm al", "i'm good"]
})
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.