Issue
I have a dataframe consisting of individual tweets (id, text, author_id, nn_list) where nn_list is a list of other tweet indices which were previously identified as potential nearest neighbours. Now I have to calculate the cosine similarity of the index and every single entry of this list by looking at the index in the tfidf matrix to compare the vectors but with my current approach this is kind of slow. The current code looks something like this:
for index, row in data_df.iterrows():
for candidate in row["nn_list"]:
candidate_cos = float("%.2f" % pairwise_distances(tfidf_matrix[candidate], tfidf_matrix[index], metric='cosine'))
if candidate_cos < nn_distance:
current_nn_candidate = candidate
nn_distance = candidate_cos
Is there a significantly faster way to calculate this?
Solution
The following code should work assuming you have not a too large range of IDs:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame({"nn_list": [[1, 2], [1,2,3], [1,2,3,7], [11, 12, 13], [2,1]]})
# Data consistent with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
df["data"] = df["nn_list"].apply(lambda x: np.repeat(1, len(x)))
df["row"] = df.index
df["row_ind"] = df[['row', 'nn_list']].apply(lambda x: np.repeat(x[0], len(x[1])), axis=1)
df["col_ind"] = df['nn_list'].apply(lambda x: np.array(x))
m = csr_matrix(
(np.concatenate(df['data']),
(np.concatenate(df['row_ind']), np.concatenate(df['col_ind']))))
cosine_similarity(m)
Will return:
array([[1. , 0.81649658, 0.70710678, 0. , 1. ],
[0.81649658, 1. , 0.8660254 , 0. , 0.81649658],
[0.70710678, 0.8660254 , 1. , 0. , 0.70710678],
[0. , 0. , 0. , 1. , 0. ],
[1. , 0.81649658, 0.70710678, 0. , 1. ]])
If you have a larger range of IDs I recommend to use spark or have look to cosine similarity on large sparse matrix with numpy.
Answered By - dokteurwho
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.