Issue
I'm trying to follow this page to create a wiki corpus, but I'm using Jupiter notebook https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
this is my code:
import sys
from gensim.test.utils import datapath
from gensim.corpora import WikiCorpus
path_to_wiki_dump = datapath("enwiki-latest-pages-articles.xml.bz2")
wiki = WikiCorpus(path_to_wiki_dump)
output = open('wiki_en.txt', 'w', encoding='utf-8')
i = 0
for text in wiki.get_texts():
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
i = i + 1
if (i % 10000 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')
The Error I got was
FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/enwiki-latest-pages-articles.xml.bz2'
All the files are in one place so I'm not sure what's wrong
Solution
Did you ever download the file enwiki-latest-pages-articles.xml.bz2
somehow, somewhere?
Did you specifically place it at the path /opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/enwiki-latest-pages-articles.xml.bz2
?
If not the datapath()
function you're using won't construct the right path. (That particular function is meant to find a directory of test data bundled with Gensim, and shouldn't really be used to construct paths to your own dowloaded/created files!)
Instead of using that function, you should just specify the actual path, local to the Jupyter notebook server, where you put the file, as a string argument to WikiCorpus
.
Answered By - gojomo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.