Issue
Below is the data that I have, which has 3 columns:
- ID - Member ID
- Company : Company Name
- Year - Year of Joining the company
import pandas as pd
import numpy as np
data = {'ID':[1,1,1,2,2,3,3,3,3,3,4,4,4],
'Company':['Google','Microsoft','LinkedIn','Youtube','Google','Google','Microsoft','Youtube','Google','Microsoft','Microsoft','Google','LinkedIn'],
'Year':[2001,2004,2009,2001,2009,1999,2000,2003,2006,2010,2010,2012,2020]}
FullData = pd.DataFrame(data)
FullData -
ID Company Year
1 Google 2001
1 Microsoft 2004
1 LinkedIn 2009
2 Youtube 2001
2 Google 2009
3 Google 1999
3 Microsoft 2000
3 Youtube 2003
3 Google 2006
3 Microsoft 2010
4 Microsoft 2010
4 Google 2012
4 LinkedIn 2020
Below I have grouped the data by ID and ranked it according to the Year
FullData['Rank'] = FullData.groupby('ID')['Year'].rank(method='first').astype(int)
FullData
ID Company Year Rank
1 Google 2001 1
1 Microsoft 2004 2
1 LinkedIn 2009 3
2 Youtube 2001 1
2 Google 2009 2
3 Google 1999 1
3 Microsoft 2000 2
3 Youtube 2003 3
3 Google 2006 4
3 Microsoft 2010 5
4 Microsoft 2010 1
4 Google 2012 2
4 LinkedIn 2020 3
Now I need to get only the Member ID's who have joined Microsoft right after google. I need to get only the records partitioned or grouped by ID which has Company Google and Microsoft and the Rank of Google is followed by Microsoft consecutively. (Accepted Output --> Google - Rank 1 and Microsoft -Rank 2 or Google - Rank 4 and Microsoft -Rank 5 and so on..)
Below is the sample of desired output
ID Company Year Rank
1 Google 2001 1
1 Microsoft 2004 2
3 Google 1999 1
3 Microsoft 2000 2
3 Google 2006 4
3 Microsoft 2010 5
OR Count of Unique ID's
Count of Unique ID's/Members who worked for Google prior to Microsoft = 2
Any help is appreciated. Thanks a million in advance
Solution
Use boolean indexing:
def myfunc(df):
m1 = (df['Company'].eq('Google') & df['Company'].shift(-1).eq('Microsoft'))
m2 = (df['Rank'].eq(df['Rank'].shift(-1) - 1))
return df[(m1 & m2) | (m1.shift() & m2.shift())]
out = FullData[FullData['Company'].isin(['Google', 'Microsoft'])] \
.groupby('ID').apply(myfunc).droplevel(0)
print(out)
# Output:
ID Company Year Rank
0 1 Google 2001 1
1 1 Microsoft 2004 2
5 3 Google 1999 1
6 3 Microsoft 2000 2
8 3 Google 2006 4
9 3 Microsoft 2010 5
For the unique count, use out['ID'].nunique()
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.