Saturday, February 5, 2022

[FIXED] Select number of values from column based on condition in a different df column

February 05, 2022 pandas, random No comments

Issue

I am working on creating a dummy dataset for testing a cloud storage and dashboard system for a university. I am currently trying to assign courses to each student id for a given term. this would be the course enrollment step in real life. Most students take a full load, 4 classes, and some take 3,2 or 1 class, with decreasing probability.

I have two pandas DataFrames, 'courses' and 'students_master'.

'courses' has 1100 rows and looks like this:

  subject_id course_id SECTION_SUBJECT        SECTION_SUBJECT_DESC  \
0        HCH   HCH-101            HPCH  Community Health Promotion   
1        HCH   HCH-102            HPCH  Community Health Promotion   
2        HCH   HCH-103            HPCH  Community Health Promotion   
3        HCH   HCH-104            HPCH  Community Health Promotion   
4        HCH   HCH-105            HPCH  Community Health Promotion

'students_master' has 27054 rows and looks like this:

 ID_year_id  cohort      ids  level num_classes
0       22180  2013FA  1001269      4           4
1       49919  2013FA  1000206      4           4
2       48206  2013FA  1000524      4           2
3       40649  2013FA  1000233      4           3
4       29733  2013FA  1000533      4           2

At this point I am trying to create a new column, students_master['selections'], where I use the number, 1-4, in the 'num_classes' column to randomly select a number of course_ids from courses['course_id']. The resulting column values would be small lists like [HCH-101, TWI-302,...]

When I use this piece of code:

list(courses['course_id'].sample(4))

it works, and results in:

['EVS-406', 'BFN-201', 'ATS-105', 'BOL-103']

I have tried using .apply as well as basic for loops with no luck. I think the most promising method is to 'vectorize'. So I wrote this .select statement:

selections=[]
conditions = [
        (students_master['num_classes']==4),
        (students_master['num_classes']==3),
        (students_master['num_classes']==2),
        (students_master['num_classes']==1)
]
choices = [
        ([list(courses['course_id'].sample(4))]),
        ([list(courses['course_id'].sample(3))]),
        ([list(courses['course_id'].sample(2))]),
        ([list(courses['course_id'].sample(1))])
]


selections.append(np.select(conditions, choices))

and it gets the error: "shape mismatch: objects cannot be broadcast to a single shape"

Any advice on how to solve this problem is greatly appreciated.

Solution

This, you can use apply to ensure the courses are not repeated within each student:

selection = student_master['num_classes'].apply(lambda x: np.random.choice(course['course_id'], x, replace=False) )

Answered By - Quang Hoang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 5, 2022

[FIXED] Select number of values from column based on condition in a different df column

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels