Issue
I am trying to count the number of occurrences of each character within a large dateset. For example, if the data was the numpy array ['A', 'AB', 'ABC'] then I would want {'A': 3, 'B': 2, 'C': 1} as the output. I currently have an implementation that looks like this:
char_count = {}
for c in string.printable:
char_count[c] = np.char.count(data, c).sum()
The issue I am having is that this takes too long for my data. I have ~14,000,000 different strings that I would like to count and this implementation is not efficient for that amount of data. Any help is appreciated!
Solution
Another way.
import collections
c = collections.Counter()
for thing in data:
c.update(thing)
Same basic advantage - only iterates the data once.
Answered By - wwii
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.