Issue
Hi I have the following dataset:
EID | company | SD | ED |
---|---|---|---|
B12345 | A11 | 1/1/2021 | 3/1/2021 |
B12345 | B11 | 1/1/2021 | 1/20/2021 |
B12345 | C11 | 1/21/2021 | 2/1/2021 |
B12345 | C11 | 2/2/2021 | 3/1/2021 |
B12346 | A11 | 1/1/2010 | 12/31/2021 |
B12346 | B11 | 1/1/2011 | 12/31/2015 |
B12346 | C11 | 1/1/2022 | 12/31/2022 |
As you may have observed the dates here for EID B12345 both companies B11 and C11 respectively are inclusive of dates for company A11. Similarly for EID B12346 dates for company B11 is inclusive of date for company A11. So what I want to do here is for all the EIDs that I have in the data frame I want to check for each company whether the Start Date (SD) and End Date (ED) are inclusive of previous company or not. And if they are inclusive then the final dates should be taken of the company whose dates lie in maximum range. And if the dates are different then take the data as it is. So the output should contain the dates which are for company A11.
Basically what I want to check here is whether there is any overlap between the two date columns for each eid with respect to company and if there is then take the maximum range else keep the data as it is. Also its not just comparing the company with the previous company, but also with the company before that if the previous company should be eliminated. The expected output is as follows:
EID | company | SD | ED |
---|---|---|---|
B12345 | A11 | 1/1/2021 | 3/1/2021 |
B12346 | A11 | 1/1/2010 | 12/31/2021 |
B12346 | C11 | 1/1/2022 | 12/31/2022 |
Below is the code I have tried:
group = (df['SD'] <= df.groupby(['EID'['END'].shift()).groupby([df['EID']]).cumsum()
df = df.loc[df.groupby([df['EID'], group])['company'].idxmax()].sort_index()
the error that I am getting here is:
TypeError: reduction operation 'argmax' not allowed for this dtype
I have been unable to understand what this error means. I even tried to change the data type of company column but still not able to fix it.
Any leads would be appreciated. Thanks.!
Solution
Your question is not complete. You not only want to compare the company with the previous company, but also with the company before that if the previous company should be eliminated.
Otherwise, it would be a simple comparison:
df.groupby('EID').apply(
lambda group: group[~(
(group.SD >= group.SD.shift()) &
(group.ED <= group.ED.shift())
)]
)
However, this results in the following:
EID company SD ED
B12345 A11 1/1/2021 3/1/2021
B12345 C11 1/21/2021 2/1/2021
B12345 C11 2/2/2021 3/1/2021
B12346 A11 1/1/2010 12/31/2021
B12346 C11 1/1/2022 12/31/2022
Now, you want to compare C11 with A11. You can do this with a (not so elegant) while loop
:
different = True
while different:
df_orig = df.copy()
df = df.groupby('EID').apply(lambda group: group[~((group.SD >= group.SD.shift()) & (group.ED <= group.ED.shift()))]).reset_index(drop=True)
different = not df.equals(df_orig)
This will keep doing the same trick, and only stops if the resulting dataframe equals the original dataframe.
Resulting in:
EID company SD ED
B12345 A11 1/1/2021 3/1/2021
B12346 A11 1/1/2010 12/31/2021
B12346 C11 1/1/2022 12/31/2022
Answered By - Paul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.