Issue
I have a very large dataset of roughly 6 million records, it does look like this snippet:
data = pd.DataFrame({
'ID': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'TEXT': [
"Mouthwatering BBQ ribs cheese, and coleslaw.",
"Delicious pizza with pepperoni and extra cheese.",
"Spicy Thai curry with cheese and jasmine rice.",
"Tiramisu dessert topped with cocoa powder.",
"Sushi rolls with fresh fish and soy sauce.",
"Freshly baked chocolate chip cookies.",
"Homemade lasagna with layers of cheese and pasta.",
"Gourmet burgers with all the toppings and extra cheese.",
"Crispy fried chicken with mashed potatoes and extra cheese.",
"Creamy tomato soup with a grilled cheese sandwich."
],
'DATE': [
'2023-02-01', '2023-02-01', '2023-02-01', '2023-02-01', '2023-02-02',
'2023-02-02', '2023-02-01', '2023-02-01', '2023-02-02', '2023-02-02'
]
})
I want to generate bigrams and trigrams from the column 'TEXT.' I'm interested in two types of ngrams for both trigrams and bigrams: those that start with 'extra' and those that don't start with 'extra.' Once we have those, I want to summarize (count the unique ID frequency) of those ngrams by unique 'DATE.' This means that if an ngram appears in an ID more than once, I will count it only once because I want to know in how many different 'IDs' it ultimately appeared.
I'm very new to Python. I come from the R world, in which there is a library called quanteda that uses C programming and parallel computing. Searching for those ngrams looks something like this:
corpus_food %>%
tokens(remove_punct = TRUE) %>%
tokens_ngrams(n = 2) %>%
tokens_select(pattern = "^extra", valuetype = "regex") %>%
dfm() %>%
dfm_group(groups = lubridate::date(DATE)) %>%
textstat_frequency()
yielding my desired results:
feature frequency rank docfreq group
1 extra_cheese 3 1 2 all
My desired result would look like this:
ngram | nunique | group |
---|---|---|
cheese and | 3 | 1/02/2023 |
and extra | 2 | 1/02/2023 |
extra cheese | 2 | 1/02/2023 |
and extra cheese | 2 | 1/02/2023 |
mouthwatering bbq | 1 | 1/02/2023 |
bbq ribs | 1 | 1/02/2023 |
ribs cheese | 1 | 1/02/2023 |
and coleslaw | 1 | 1/02/2023 |
mouthwatering bbq ribs | 1 | 1/02/2023 |
bbq ribs cheese | 1 | 1/02/2023 |
ribs cheese and | 1 | 1/02/2023 |
cheese and coleslaw | 1 | 1/02/2023 |
delicious pizza | 1 | 1/02/2023 |
pizza with | 1 | 1/02/2023 |
with pepperoni | 1 | 1/02/2023 |
pepperoni and | 1 | 1/02/2023 |
delicious pizza with | 1 | 1/02/2023 |
pizza with pepperoni | 1 | 1/02/2023 |
with pepperoni and | 1 | 1/02/2023 |
pepperoni and extra | 1 | 1/02/2023 |
spicy thai | 1 | 1/02/2023 |
thai curry | 1 | 1/02/2023 |
I am in no way comparing the two languages, Python and R. They are amazing, but at the moment, I'm interested in a very straightforward and fast method to achieve my results in Python. I am open to hearing of a way to achieve what I'm looking for in a faster and more efficient way in Python. I'm new to Python.
So far I have found a way to create the bigrams and trigrams but I have no idea as to how perform the selection of those that start with "extra" and those who don't and this very process of creating the ngrams is taking over an hour so I will take all advice on how to reduce the time.
Work around:
import nltk
from nltk import bigrams
from nltk.util import trigrams
from nltk.tokenize import word_tokenize
data['bigrams'] = data['TEXT'].apply(lambda x: list(bigrams(word_tokenize(x))))
data['trigrams'] = data['TEXT'].apply(lambda x: list(trigrams(word_tokenize(x))))
Reading through some posts, some people suggest on using the gensim lib. Would that be a good direction?
Solution
It is easy to find ngrams using sklearn's CountVectorizer using the ngram_range
argument.
You can create a document-term matrix with ngrams of size 2 and 3 only, then append to your original dataset and doing pivoting and aggregation with pandas to find what you need.
First we'll get the document-term matrix and append to our original data:
# Perform the count vectorization, keeping bigrams and trigrams only
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,3))
X = cv.fit_transform(data['TEXT'])
# Create dataframe of document-term matrix
cv_df = pd.DataFrame(X.todense(), columns=cv.get_feature_names_out())
# Append to original data
df = pd.concat([data, cv_df], axis=1)
Then we group by ID and date, and filter where the count is greater than 0, to find the ID-date combinations where each 2 or 3-gram appears, then count the unique IDs for each:
# Group and pivot
pivoted_df = df.groupby(['ID','DATE']).sum().stack().reset_index()
pivoted_df.columns = ['ID', 'DATE', 'ngram', 'count']
# Find n-grams which appear for each ID-DATE combo and count unique ids
pivoted_df = pivoted_df[pivoted_df['count']>0]
pivoted_df.groupby(['DATE','ngram']).nunique('ID').reset_index()
Finally, we can create additional columns for the ngram size, and also whether or not the ngram starts with extra
, and use for filtering:
# Add additional columns for ngram size
pivoted_df['ngram_size'] = pivoted_df['ngram'].str.split().str.len()
# Add additional column for starting with extra
pivoted_df['extra'] = pivoted_df['ngram'].str.startswith('extra')
# Find all the 2-grams that start with "extra"
pivoted_df[(pivoted_df['extra']) & (pivoted_df['ngram_size']==2)]
That being said, if you have 6M records, you have a large dataset, and with this approach you will definitely run into memory issues. You will probably want to filter your data down to what you are most interested in to start, and also make sure you use the min_df
parameter in the CountVectorizer
in order to keep your data tractable.
Answered By - NLP from scratch
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.