Issue
I have a dataframe of my city that contains the people's evaluation of neighbourhood problems and looks something like that:
personID hood problems problemTop1 problemTop2 problemTop3 other columns
0001 A ['cars', 'air', 'trash', 'parks', ...] 'trash' 'cars' 'air' ...
0002 D ['cars', 'air', 'trash', 'parks', ...] 'cars' 'air' 'trash' ...
0003 A ['cars', 'air', 'trash', 'parks', ...] 'cars' 'parks' 'air' ...
0004 M ['cars', 'air', 'trash', 'parks', ...] 'parks' 'cars' 'air' ...
...
Here is a short version of the dataframe with only the important columns. I would like to group the dataframe by neighbourhoods (hood
) and give each of the possible problems a score. The score should be a weighted sum of how problematic the people perceived the problem and as an example should look like that - if in neighbourhood E
10 people perceived trash
as problemTop1
, 7 people as problemTop2
and 3 as problemTop3
, then
E_score_trash = 10*3 + 7*2 + 3*1 = 47
My final output should show for each neighbourhood each problem's score, like this:
>>> df_scores_by_hood
hood 'air' 'cars' 'parks' 'trash'
A 386 871 321 984
B 436 719 504 620
C 383 230 848 601
D 881 524 231 783
...
Is there an easy way to do this with pandas, numpy or other functions? Tnx
Solution
I found a solution, but I am not happy with it - it's not really pandasic, it's not vectorized (it does loops) and I have the feeling there should be something simpler for my problem. Do you know how to improve it?
# list of all the problems
problems = ['cars', 'air', 'trash',...]
for problem in problems:
df[problem] = 0
df[problem] = df.apply(lambda row: 3 if row['problemTop1']==problem
else 2 if row['problemTop2']==problem
else 1 if row['problemTop3']==problem
else 0, axis=1)
Answered By - NeStack
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.