Issue
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re
library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Solution
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u
in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u
as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..."
somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u""
.
Answered By - kabanus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.