Issue
I am pretty new to python. For this task, I am trying to import a text file, add and to id, and remove punctuation from the text. I tried this method How to strip punctuation from a text file.
import string
def readFile():
translate_table = dict((ord(char), None) for char in string.punctuation)
with open('out_file.txt', 'w') as out_file:
with open('moviereview.txt') as file:
for line in file:
line = ' '.join(line.split(' '))
line = line.translate(translate_table)
out_file.write("<s>" + line.rstrip('\n') + "</s>" + '\n')
return out_file
However, I get an error saying:
TypeError: expected a string or other character buffer object
My thought is that after I split and join the line, I get a list of strings, so I cannot use str.translate() to process it. But it seems like everyone else have the same thing and it works, ex. https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/ in example code from line 13.
So I am really confused, can anyone help? Thanks!
Solution
On Python 2, only unicode
types have a translate
method that takes a dict
. If you intend to work with arbitrary text, the simplest solution here is to just use the Python 3 version of open
on Py2; it will seamlessly decode your inputs and produce unicode
instead of str
.
As of Python 2.6+, replacing the normal built-in open
with the Python 3 version is simple. Just add:
from io import open
to the imports at the top of your file. You can also remove line = ' '.join(line.split(' '))
; that's definitionally a no-op (it splits on single spaces to make a list
, then rejoins on single spaces). You may also want to add:
from __future__ import unicode_literals
to the very top of your file (before all of your code); that will make all of your uses of plain quotes automatically unicode
literals, not str
literals (prefix actual binary data with b
to make it a str
literal on Py2, bytes
literal on Py3).
The above solution is best if you can swing it, because it will make your code work correctly on both Python 2 and Python 3. If you can't do it for whatever reason, then you need to change your translate
call to use the API Python 2's str.translate
expects, which means removing the definition of translate_table
entirely (it's not needed) and just doing:
line = line.translate(None, string.punctuation)
For Python 2's str.translate
, the arguments are a one-to-one mapping table for all values from 0 to 255 inclusive as the first argument (None
if no mapping needed), and the second argument is a string of characters to delete (which string.punctuation
already provides).
Answered By - ShadowRanger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.