Issue
I would need to check if two or more words in a list are similar. To do this, I am using the Jaro Wrinkler distance as follows:
from similarity.jarowinkler import JaroWinkler
word1='sweet chili'
word2='sriracha chilli'
jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))
It seems to be able to detect the similarity between words, but I would need to set a threshold to select only words that are similar at 80%. My difficulties, however, are in checking all the words within a data frame's column:
Words
sweet chili
sriracha chilli
tomato
mayonnaise
water
milk
still water
sparkling water
wine
chicken
beef
...
What I would like to do is: - starting with the first element, check the similarity between this one and the others; if the similarity is greater than a threshold (80%), save it in a new array; - check the second element (sriracha chilli) as above; - and so on.
Could you please tell me how to run such a similar loop?
Solution
- With the given data
- Using the
strsim
package - If the real dataframe has many columns, consider making a dataframe with just the
Words
columnnew_df = pd.DataFrame({'Words': df.Words})
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from similarity.jarowinkler import JaroWinkler
import numpy as np
df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})
# call similarity method
jarowinkler = JaroWinkler()
# remove whitespace
df.Words = df.Words.str.strip()
# create column of matching values for each word
words = df.Words.tolist()
for word in words:
df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))
| | Words | sweet chili | sriracha chilli | tomato | mayonnaise | water | milk | still water | sparkling water | wine | chicken | beef |
|---:|:----------------|--------------:|------------------:|---------:|-------------:|---------:|---------:|--------------:|------------------:|---------:|----------:|---------:|
| 0 | sweet chili | 1 | 0.605772 | 0.419192 | 0.39697 | 0.513131 | 0 | 0.515152 | 0.460101 | 0.560606 | 0.322511 | 0.560606 |
| 1 | sriracha chilli | 0.605772 | 1 | 0.411111 | 0.388889 | 0.344444 | 0.438889 | 0.460101 | 0.488889 | 0.438889 | 0.529365 | 0 |
| 2 | tomato | 0.419192 | 0.411111 | 1 | 0.488889 | 0.411111 | 0.472222 | 0.590909 | 0.411111 | 0 | 0 | 0 |
| 3 | mayonnaise | 0.39697 | 0.388889 | 0.488889 | 1 | 0.433333 | 0.45 | 0.460606 | 0.544444 | 0.45 | 0.328571 | 0 |
| 4 | water | 0.513131 | 0.344444 | 0.411111 | 0.433333 | 1 | 0 | 0.430303 | 0.511111 | 0.633333 | 0.447619 | 0.483333 |
| 5 | milk | 0 | 0.438889 | 0.472222 | 0.45 | 0 | 1 | 0.560606 | 0.538889 | 0.5 | 0.595238 | 0 |
| 6 | still water | 0.515152 | 0.460101 | 0.590909 | 0.460606 | 0.430303 | 0.560606 | 1 | 0.749854 | 0.44697 | 0.489177 | 0 |
| 7 | sparkling water | 0.460101 | 0.488889 | 0.411111 | 0.544444 | 0.511111 | 0.538889 | 0.749854 | 1 | 0.544444 | 0.431746 | 0 |
| 8 | wine | 0.560606 | 0.438889 | 0 | 0.45 | 0.633333 | 0.5 | 0.44697 | 0.544444 | 1 | 0.595238 | 0.5 |
| 9 | chicken | 0.322511 | 0.529365 | 0 | 0.328571 | 0.447619 | 0.595238 | 0.489177 | 0.431746 | 0.595238 | 1 | 0 |
| 10 | beef | 0.560606 | 0 | 0 | 0 | 0.483333 | 0 | 0 | 0 | 0.5 | 0 | 1 |
see values greater than 80%
- none except the exact matching values
df.set_index('Words', inplace=True)
np.where(df[words] > 0.8, df[words], np.nan)
array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, 1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, 1., nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, 1., nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, 1., nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, 1., nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, 1., nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, 1., nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, 1., nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, 1., nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 1.]])
add a heatmap
mask = np.zeros_like(df[words])
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))
ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.