Issue
I'm new to python and trying to compare multiple columns within a pandas dataframe. The columns represent months and I'm trying to look across a period of x months and flag up if row i etc has ever had a value of 2 or more.
Below is the code I used to create my example df:
arr_random = np.random.randint(low=0, high=5, size=(100,26))
arr_random
col_names = []
i = 0
while i <= 25:
col_names.append('mth_'+str(i))
i = i + 1
rand_df = pd.DataFrame(arr_random, index = None, columns = col_names)
I want my flags to be such that: 1 = 2+, 0 = <2, -1 = missing data (I normally set NaN values to -1). Below is the code I'm using to do this:
review_months = [12, 18, 24]
for x in review_months: #check this
rand_df['TWOPLUS_'+str(x)+'M'] = -1
for i in range(x):
rand_df['TWOPLUS_'+str(x)+'M'] = rand_df[['TWOPLUS_'+str(x)+'M', 'mth_'+str(i+1)]].max(axis = 1)
conditions = [ rand_df['TWOPLUS_'+str(x)+'M'] >= 2, rand_df['TWOPLUS_'+str(x)+'M'] < 2, rand_df['mth_'+str(i)] == -1 ]
choices = [ 1 , 0, -1 ]
rand_df['TWOPLUS_'+str(x)+'M'] = np.select(conditions, choices, default=np.nan)
The only problem is, I'm not getting the status if the row has EVER had 2 or more in one of the columns over the given time period, it only gives back if they currently have 2 or more for the specified column.
Solution
You could use the following code to check if the data frame has ever had a value of 2 or greater
for month in (12, 18, 24):
rand_df[f'TWOPLUS_{month}M'] = (rand_df.loc[:, rand_df.columns[:month+1]] >= 2).any(axis=1).astype(int)
rand_df[f'TWOPLUS_{month}M'].fillna(-1, inplace=True)
rand_df
It selects the columns up to the desired month, and checks if each value in the data frame is at least 2. It then uses the any(axis=1)
to check if any value is true in each row. Finally it converts it to 1 if any value in a row is True else 0.
NaNs are replaced with -1.
Here's a link to the documentation for the any method https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html and for using pandas .loc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Output
mth_0 | mth_1 | mth_2 | mth_3 | mth_4 | mth_5 | mth_6 | mth_7 | mth_8 | mth_9 | ... | mth_19 | mth_20 | mth_21 | mth_22 | mth_23 | mth_24 | mth_25 | TWOPLUS_12M | TWOPLUS_18M | TWOPLUS_24M | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 1 | 4 | 3 | 4 | 0 | 0 | 2 | ... | 4 | 4 | 0 | 1 | 0 | 1 | 2 | 1 | 1 | 1 |
1 | 0 | 0 | 3 | 4 | 4 | 1 | 0 | 1 | 4 | 2 | ... | 0 | 2 | 2 | 0 | 3 | 3 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 0 | 1 | 2 | 4 | 2 | 0 | 0 | 0 | ... | 3 | 2 | 2 | 1 | 3 | 4 | 0 | 1 | 1 | 1 |
3 | 3 | 3 | 2 | 4 | 1 | 0 | 4 | 4 | 0 | 2 | ... | 0 | 2 | 2 | 2 | 2 | 0 | 4 | 1 | 1 | 1 |
4 | 1 | 0 | 4 | 3 | 0 | 2 | 2 | 1 | 0 | 1 | ... | 3 | 0 | 1 | 2 | 1 | 3 | 0 | 1 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 3 | 1 | 0 | 1 | 4 | 3 | 2 | 1 | 0 | 2 | ... | 4 | 4 | 2 | 1 | 2 | 2 | 0 | 1 | 1 | 1 |
96 | 2 | 2 | 1 | 3 | 2 | 4 | 0 | 4 | 0 | 3 | ... | 2 | 2 | 3 | 4 | 4 | 1 | 4 | 1 | 1 | 1 |
97 | 3 | 1 | 1 | 4 | 4 | 3 | 0 | 0 | 2 | 0 | ... | 2 | 3 | 3 | 3 | 2 | 3 | 4 | 1 | 1 | 1 |
98 | 2 | 2 | 4 | 3 | 3 | 1 | 3 | 2 | 2 | 0 | ... | 1 | 0 | 3 | 2 | 3 | 1 | 1 | 1 | 1 | 1 |
99 | 4 | 3 | 0 | 1 | 3 | 4 | 0 | 3 | 0 | 4 | ... | 3 | 3 | 0 | 4 | 3 | 3 | 0 | 1 | 1 | 1 |
Answered By - renzo21
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.