Wednesday, December 1, 2021

[FIXED] How to check term similarity within a pandas column with similarity.jarowinkler

December 01, 2021 cosine-similarity, nlp, pandas, python No comments

Issue

I would need to check if two or more words in a list are similar. To do this, I am using the Jaro Wrinkler distance as follows:

from similarity.jarowinkler import JaroWinkler

word1='sweet chili'
word2='sriracha chilli'

jarowinkler = JaroWinkler()
print(jarowinkler.similarity(word1, word2))

It seems to be able to detect the similarity between words, but I would need to set a threshold to select only words that are similar at 80%. My difficulties, however, are in checking all the words within a data frame's column:

Words

sweet chili
sriracha chilli
tomato
mayonnaise 
water
milk
still water
sparkling water
wine
chicken 
beef
...

What I would like to do is: - starting with the first element, check the similarity between this one and the others; if the similarity is greater than a threshold (80%), save it in a new array; - check the second element (sriracha chilli) as above; - and so on.

Could you please tell me how to run such a similar loop?

Solution

With the given data
Using the strsim package
If the real dataframe has many columns, consider making a dataframe with just the Words column
- new_df = pd.DataFrame({'Words': df.Words})

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from similarity.jarowinkler import JaroWinkler
import numpy as np

df = pd.DataFrame({'Words': ['sweet chili', 'sriracha chilli', 'tomato', 'mayonnaise ', 'water', 'milk', 'still water', 'sparkling water', 'wine', 'chicken ', 'beef']})

# call similarity method
jarowinkler = JaroWinkler()

# remove whitespace
df.Words = df.Words.str.strip()

# create column of matching values for each word
words = df.Words.tolist()

for word in words:
    df[word] = df.Words.apply(lambda x: jarowinkler.similarity(x, word))

|    | Words           |   sweet chili |   sriracha chilli |   tomato |   mayonnaise |    water |     milk |   still water |   sparkling water |     wine |   chicken |     beef |
|---:|:----------------|--------------:|------------------:|---------:|-------------:|---------:|---------:|--------------:|------------------:|---------:|----------:|---------:|
|  0 | sweet chili     |      1        |          0.605772 | 0.419192 |     0.39697  | 0.513131 | 0        |      0.515152 |          0.460101 | 0.560606 |  0.322511 | 0.560606 |
|  1 | sriracha chilli |      0.605772 |          1        | 0.411111 |     0.388889 | 0.344444 | 0.438889 |      0.460101 |          0.488889 | 0.438889 |  0.529365 | 0        |
|  2 | tomato          |      0.419192 |          0.411111 | 1        |     0.488889 | 0.411111 | 0.472222 |      0.590909 |          0.411111 | 0        |  0        | 0        |
|  3 | mayonnaise      |      0.39697  |          0.388889 | 0.488889 |     1        | 0.433333 | 0.45     |      0.460606 |          0.544444 | 0.45     |  0.328571 | 0        |
|  4 | water           |      0.513131 |          0.344444 | 0.411111 |     0.433333 | 1        | 0        |      0.430303 |          0.511111 | 0.633333 |  0.447619 | 0.483333 |
|  5 | milk            |      0        |          0.438889 | 0.472222 |     0.45     | 0        | 1        |      0.560606 |          0.538889 | 0.5      |  0.595238 | 0        |
|  6 | still water     |      0.515152 |          0.460101 | 0.590909 |     0.460606 | 0.430303 | 0.560606 |      1        |          0.749854 | 0.44697  |  0.489177 | 0        |
|  7 | sparkling water |      0.460101 |          0.488889 | 0.411111 |     0.544444 | 0.511111 | 0.538889 |      0.749854 |          1        | 0.544444 |  0.431746 | 0        |
|  8 | wine            |      0.560606 |          0.438889 | 0        |     0.45     | 0.633333 | 0.5      |      0.44697  |          0.544444 | 1        |  0.595238 | 0.5      |
|  9 | chicken         |      0.322511 |          0.529365 | 0        |     0.328571 | 0.447619 | 0.595238 |      0.489177 |          0.431746 | 0.595238 |  1        | 0        |
| 10 | beef            |      0.560606 |          0        | 0        |     0        | 0.483333 | 0        |      0        |          0        | 0.5      |  0        | 1        |

see values greater than 80%

none except the exact matching values

df.set_index('Words', inplace=True)

np.where(df[words] > 0.8, df[words], np.nan)

array([[ 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan,  1., nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan,  1., nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan,  1., nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan,  1., nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan,  1., nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan,  1., nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan,  1., nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan,  1., nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan,  1., nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,  1.]])

add a heatmap

mask = np.zeros_like(df[words])
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(df[words], mask=mask, square=True, cmap="YlGnBu")

Answered By - Trenton McKinney

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 1, 2021

[FIXED] How to check term similarity within a pandas column with similarity.jarowinkler

Issue

Solution

see values greater than 80%

add a heatmap

0 comments:

Post a Comment

Popular Posts

Labels