Thursday, November 3, 2022

[FIXED] Grouping and bucketing a dataset

November 03, 2022 group-by, jupyter-notebook, pandas, python No comments

Issue

I have the following dataset:

+-------+--------+---------------+----------+
| store | loc_id | competitor_id | distance |
+-------+--------+---------------+----------+
| 1     | 235467 | 567           | 1        |
| .     | .      | .             | .        |
| .     | .      | .             | .        |
| .     | .      | .             | .        |
| 1200  | 236667 | 7890          | 70       |
+-------+--------+---------------+----------+

Where:

store: has the unique numbers for each store - total of 1200

loc_id: has the unique location number for each store - total of 1200

competitors: has a variety of competitors IDs. - total of 600

distance: has the distance a competitor is from the store in miles - up to 70 miles

This dataset is massive, around 5 million rows.

The competitors can repeat per store since they might be in different distances, and also repeat thorough different stores meaning:

+-------+--------+---------------+----------+
| store | loc_id | competitor_id | distance |
+-------+--------+---------------+----------+
| 1     | 235467 | 567           | 1        |
| 1     | 235467 | 567           | 20       |
| 65    | 235532 | 567           | 5        |
+-------+--------+---------------+----------+

I need to aggroup the occurrences of each competitor within buckets of distances where the resulting dataset would be the following:

+-------+--------+---------------+---------------+----------------+----------+
| store | loc_id | competitor_id | under_10miles | ten_to_20miles | above_20 | 
+-------+--------+---------------+---------------+----------------+----------+
| 1     | 235467 | 567           | 2             | 0              | 15       |
| .     | .      | .             | .             | .              | .        |
| .     | .      | .             | .             | .              | .        |
| .     | .      | .             | .             | .              | .        |
| 1200  | 236667 | 7890          | 1             | 5              | 0        |
+-------+--------+---------------+---------------+----------------+----------+

meaning:

There are two 567 competitor, under 10 miles distance, from the store number 1.

I am using Jupyter notebook, so any help in Python or Pandas would be highly appreciated.

Solution

You can use groupby().value_counts():

(df.groupby(['store', 'loc_id', 'competitor_id'])
   ['distance'].value_counts(bins=[-1, 10, 20, np.inf])
   .unstack(level=-1, fill_values=0)
)

Or pd.crosstab with pd.cut:

pd.crosstab([df['store'], df['loc_id'], df['competitor_id']],
            pd.cut(df['distance'], bins=[-1,10,20, np.inf],
                   labels=['<10', '10-20', '>20'])
           )

Answered By - Quang Hoang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 3, 2022

[FIXED] Grouping and bucketing a dataset

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels