Issue
I can't figure out how to achieve a certain task in my python script.
I have a dataframe that contains media coverage for a specific topic. One of my columns names the author of the specific article for example...
I am trying to create a pivot table to show the total count for each journalist in the column like so
datajournalist = company1_topline.pivot_table(index='AuthorUsername', values='ContentID', aggfunc= np.count_nonzero )
which would give me something like
AuthorUsername count_nonzero
Aaron Mehta 1
Aamer Madhani 1
Aamer Madhani ; Ben Fox 1
What I am looking for is a way to have the pivot table also count the names that appear in multiple value cells, to get a true count of each author. So for example the row that has "Aamer Madhani ; Ben Fox" would also count towards the count for just "Aamer Madhani" so the row for just "Aamer Madhani" would have a count of 2 instead of 1 etc.... Is there a way to do this? In excel this can be achieved with a SUMIF, but I do not know how to do this with Python/Pandas.
Desired output
AuthorUsername count_nonzero
Aaron Mehta 1
Aamer Madhani 2
Aamer Madhani ; Ben Fox 1
If anyone can point me in the right direction it would be much appreciated.
Solution
With your DataFrame having the AuthorUsername
column like this:
AuthorUsername
0 Aaron Mehta
1 Aamer Madhani
2 Aamer Madhani ; Ben Fox
You can do:
import collections
# Remove leading and trailing spaces (if any).
df['AuthorUsername'] = df['AuthorUsername'].str.strip()
# Get unique authors and their counts.
authors_counts = collections.Counter(df['AuthorUsername'].str.split('\s*;\s*').sum())
# Add to new column.
real_counts = collections.defaultdict(lambda: 1, authors_counts)
df['count_nonzero'] = [real_counts[a] for a in df['AuthorUsername']]
print(df)
Result:
AuthorUsername count_nonzero
0 Aaron Mehta 1
1 Aamer Madhani 2
2 Aamer Madhani ; Ben Fox 1
Edit after comments - sum more metrics:
After the comments, here is a more versatile version to also sum the Metrics
column, and potentially others.
Input dataframe:
AuthorUsername Metrics
0 Aaron Mehta 1.3
1 Aamer Madhani 2.0
2 Aamer Madhani ; Ben Fox 0.5
Code:
df['AuthorUsername'] = df['AuthorUsername'].str.strip()
df['single_authors'] = df['AuthorUsername'].str.split('\s*;\s*')
df['count_nonzero'] = 1
single_metrics = df.explode('single_authors').groupby('single_authors').sum()
multiple_metrics = df[df['single_authors'].map(len) > 1].groupby('AuthorUsername').sum()
all_metrics = pd.concat([single_metrics, multiple_metrics]).rename_axis('AuthorUsername').reset_index()
df = df.drop(columns=['Metrics', 'count_nonzero', 'single_authors']).merge(all_metrics, how='left', on='AuthorUsername')
print(df)
Result:
AuthorUsername Metrics count_nonzero
0 Aaron Mehta 1.3 1
1 Aamer Madhani 2.5 2
2 Aamer Madhani ; Ben Fox 0.5 1
Answered By - user2246849
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.