Issue
I have a dataframe column with different device models. I want to generate summary for common models by grouping similar devices together. I don't have a list of exact brands/words to use difflib SequenceMatcher.
Main challenge I'm facing: - I don't have a fixed list of products (devices) - it's a million rows long data frame with devices ranging from Playstations and Nintendos to all the phones and tablets.
This is how my data looks like:
device_model | volume of requests (100k) |
---|---|
iphone XR | 1.6151 |
ipad | 1.4768 |
galaxy s21 5g | 1.4443 |
ultra 5g galaxy s21 | 1.3395 |
samsung galaxy s9 | 1.2902 |
iphone 11 | 1.2696 |
galaxy s10 | 1.2613 |
a32 5g galaxy | 1.209 |
galaxy s10e | 1.0994 |
apple iphone 13 pro max | 1.0285 |
a12 galaxy | 1.0089 |
galaxy s22 ultra 5g | 1.0054 |
samsung galaxy s20 fe 5g | 0.9686 |
iphone 13 pro | 0.9116 |
galaxy s10+ | 0.8795 |
S22 galaxy 5g | 0.8279 |
galaxy s9 plus | 0.8175 |
5g galaxy s21 plus | 0.8155 |
apple ipad air | 0.7958 |
This is how I would like my output to be
device_model | total request volume (100k) |
---|---|
iPhone | 4.8248 |
iPad | 2.2726 |
Galaxy | 13.967 |
I tried to apply topic modelling using Spacy & Gensim, and N-grams, but it doesn't work reliably for such small words. Looking for a quick and simple way.
Solution
Found a neat solution (I think), loved the puzzle:
It finds the most common word in the column and assign this word as device for the rows where the most common word appears. Then, for the remaining rows, it finds the most common word and assigns this word as device for these rows. This repeats untill the complete column device
is filled.
from collections import Counter
df['device'] = ''
while sum(df['device'] == '') > 0:
most_common_word = Counter(" ".join(df[df['device'] == '']["device_model"]).split()).most_common(1)[0][0]
df.loc[df.device_model.str.contains(most_common_word) & (df.device == ''), 'device'] = most_common_word
print(most_common_word)
Returns:
galaxy
iphone
ipad
Then apply pd.groupby
:
df.groupby('device')['volume of requests (100k)'].sum()
Output:
device
galaxy 13.9670
ipad 2.2726
iphone 4.8248
Name: volume of requests (100k), dtype: float64
Note that this does not work when (for example) the word pro is the most common word which is quite possible. However for the dataset given, it works. These über common words need to be removed first, otherwise it is quite hard for a program to know that pro is not a device
Answered By - T C Molenaar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.