Issue
Example of the dataframe:
cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'],
'Days_Sold': [3, 4, 1, 2, 2, 1]}
cake = pd.DataFrame(data=cup)
cake
I calculated the most common words of the dataframe (with stop words removed)
from collections import Counter Counter(" ".join(cake['Description']).split()).most_common()
I put this into a new dataframe and reset the index
count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common()) count.columns = ['Words', 'Values'] count.index= np.arange(1, len(count)+1) count.head()
The Values is in the 'count' dataframe. The Days_Sold is in the 'cake' dataframe. What I would like to do now is if the common word in the 'count' dataframe shows up, like cupcake, how long would this take for me to sell the product using the 'cake' dataframe, and that would go through every common word in the 'count' dataframe until it's done? The answer should come out to be (3+4+1) 8 for cupcake.
My actual dataframe is over 3000 lines (and not exactly about cakes). The description is longer. I need over 40 common words, adjustable to my need.
This is why I can't be typing in each word. I believe this requires a 'nested for loop'. But I am stuck on it.
for day in cake:
for top in count:
top= count.Words
day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]
The error says: 'int' object is not iterable
Thank you!
Update:
Thank you so much to everyone helping me on this large project. I am posting my solution to #3, adjusted from the answer by Mark Moretto.
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df
# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "Values", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
df_freq
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics
df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)
Solution
Another way, though I didn't really remove any words by frequency or anything.
# <...your starter code for dataframe creation...>
# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
# Get count of words
df_counts = (df.groupby("Description")
.size()
.reset_index()
.rename(columns={0: "word_count"})
)
# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")
# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
.loc[:, ["Description_x", "word_count", "Days_Sold"]]
.rename(columns={"Description_x": "Description"})
)
# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()
print(df_metrics)
Output:
word_count Days_Sold
Description
blueberry 2 4
cookie 3 2
cupcake 3 4
grape 1 2
lemon 1 1
organic 3 2
strawberry 2 3
Answered By - Mark Moretto
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.