Issue
I have the following dataset:
+-------+--------+---------------+----------+
| store | loc_id | competitor_id | distance |
+-------+--------+---------------+----------+
| 1 | 235467 | 567 | 1 |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
| 1200 | 236667 | 7890 | 70 |
+-------+--------+---------------+----------+
Where:
store
: has the unique numbers for each store - total of 1200
loc_id
: has the unique location number for each store - total of 1200
competitors
: has a variety of competitors IDs. - total of 600
distance
: has the distance a competitor is from the store in miles - up to 70 miles
This dataset is massive, around 5 million rows.
The competitors
can repeat per store since they might be in different distances, and also repeat thorough different stores meaning:
+-------+--------+---------------+----------+
| store | loc_id | competitor_id | distance |
+-------+--------+---------------+----------+
| 1 | 235467 | 567 | 1 |
| 1 | 235467 | 567 | 20 |
| 65 | 235532 | 567 | 5 |
+-------+--------+---------------+----------+
I need to aggroup the occurrences of each competitor within buckets of distances where the resulting dataset would be the following:
+-------+--------+---------------+---------------+----------------+----------+
| store | loc_id | competitor_id | under_10miles | ten_to_20miles | above_20 |
+-------+--------+---------------+---------------+----------------+----------+
| 1 | 235467 | 567 | 2 | 0 | 15 |
| . | . | . | . | . | . |
| . | . | . | . | . | . |
| . | . | . | . | . | . |
| 1200 | 236667 | 7890 | 1 | 5 | 0 |
+-------+--------+---------------+---------------+----------------+----------+
meaning:
There are two 567 competitor, under 10 miles distance, from the store number 1.
I am using Jupyter notebook, so any help in Python or Pandas would be highly appreciated.
Solution
You can use groupby().value_counts()
:
(df.groupby(['store', 'loc_id', 'competitor_id'])
['distance'].value_counts(bins=[-1, 10, 20, np.inf])
.unstack(level=-1, fill_values=0)
)
Or pd.crosstab
with pd.cut
:
pd.crosstab([df['store'], df['loc_id'], df['competitor_id']],
pd.cut(df['distance'], bins=[-1,10,20, np.inf],
labels=['<10', '10-20', '>20'])
)
Answered By - Quang Hoang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.