Issue
I'm doing research on keyword importance in annual reports of cloud providers. I already extracted the text from the PDFs and I'm mostly looking for the importance of the word "cloud"
in there.
So I decided to use TF-IDF algorithm to define the importance of the keyword across multiple documents. However, I'm not a data scientist, I'm software engineer. I do not know if my solution makes sense. Here is what I have:
import glob
from pathlib import Path
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
x=[2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
def get_line_and_trend(directory_path):
# Extracts text from files
text_files = glob.glob(f"{directory_path}/**/*.txt", recursive=True)
text_titles = [Path(text).stem for text in text_files]
# TF-IDF
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words="english")
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)
# Search for word "cloud"
df = pd.DataFrame(
tfidf_vector.toarray(),
index=text_titles,
columns=tfidf_vectorizer.get_feature_names()
).sort_index()
tfidf_slice = df[['cloud']]
tfidf_slice.round(decimals=2)
# Draw the trend line
z = np.polyfit(x, tfidf_slice["cloud"], 1)
p = np.poly1d(z)
return tfidf_slice["cloud"], p(x)
I'm passing a folder as parameter which contains the annual reports (.txt) in the different folders (2014-2021). With that I can plot something like:
with the following code:
alibaba_cloud, trendline = get_line_and_trend("./alibaba")
plt.plot(x, alibaba_cloud, color='b')
plt.plot(x, trendline, "b--")
google_cloud, trendline = get_line_and_trend("./google")
plt.plot(x, google_cloud, color='r')
plt.plot(x, trendline, "r--")
plt.legend(["Alibaba", "Trend line", "Google", "Trend line"])
plt.title("TF-IDF for \"cloud\" in annual reports")
plt.show()
So my questions are:
- Does it make sense to use TF-IDF to track the importance of keyword over time? Should I use something else?
- Does the chart really represents what I'm trying to do?
Solution
If I understand what you want, I think "not really".
The purpose of the IDF factor is to provide a normalized weight for each term. If the term occurs in all your documents, you are ranking it as less important than some other terms which don't.
In so many words, the chart is not wrong in essence; it shows how the frequency of the term has increased over the years. But the Y axis is basically meaningless in isolation; you are dividing by a constant which just obscures the actual number which you want to explore, that is, the absolute frequency of your term of interest.
If you were to compare two different terms, IDF would make sense: It normalizes the weight of really common words (like "the" and "and") so their relative use in a specific document can be compared against the relative frequency of less common words (like "fraught" and "outwardly") in the same document on a normalized scale.
It sounds to me like the number you care about is simply the term frequency, though a normalization which could make sense is to divide by the length of the document (so if "cloud" occurs twice in a document with 20,000 words, it's not more significant than if it occurs only once in a smaller document which only contains 10,000 words).
Answered By - tripleee
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.