Issue
Following gensim word2vec embedding tutorial, I have trained a simple word2vec model:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")
I would like to visualize it using the Embedding Projector in TensorBoard. There is another straightforward tutorial in gensim documentation. I did the following in Colab:
!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
word2vec2tensor(args.input, args.output, args.binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
limit=limit, datatype=datatype)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Please note that I did check first this exact same question from 2018 - but the accepted answer no longer works as both in gensim and tensorflow have been updated so I considered it was worth asking again in Q4 2021.
Solution
Saving the model in the original C word2vec implementation format resolves the issue:
model.wv.save_word2vec_format("/content/word2vec.model")
:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("/content/word2vec.model")
There are two formats of storing word2vec models in gensim
: keyed vector format from the original word2vec implementation and format that additionally stores hidden weights, vocabulary frequencies, and more. Examples and details can be found in the documentation. The script word2vec2tensor.py
uses the original format and loads the model with load_word2vec_format
: code.
Answered By - user1635327
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.