Issue
I have to check strings in Japanese that are encoded in double-byte characters (naturally the files aren't in Unicode and I have to keep them in Shift-JIS). Many of these strings contain digits that are also double byte characters, (123456789) instead of standard single-byte digits (0-9). As such, the usual methods of searching for digits won't work (using [0-9] in regex, or \d for example).
The only way I've found to make it work is to create a tuple and iterate over the tuple in a string to look for a match, but is there a more effective way of doing this?
This is an example of the output I get when searching for double byte numbers:
>>> s = "234" # "2" is a double-byte integer
>>> if u"2" in s:
print "y"
>>> if u"2" in s:
print "y"
y
>>> print s[0]
>>> print s[:2]
2
>>> print s[:3]
23
Any advice would be greatly appreciated!
Solution
First of all, the comments are right: for the sake of your sanity, you should only ever work with unicode inside your Python code, decoding from Shift-JIS that comes in, and encoding back to Shift-JIS if that's what you need to output:
text = incoming_bytes.decode("shift_jis")
# ... do stuff ...
outgoing_bytes = text.encode("shift_jis")
See: Convert text at the border.
Now that you're doing it right re: unicode and encoded bytestrings, it's straightforward to get either "any digit" or "any double width digit" with a regex:
>>> import re
>>> s = u"234"
>>> digit = re.compile(r"\d", re.U)
>>> for d in re.findall(digit, s):
... print d,
...
2 3 4
>>> wdigit = re.compile(u"[0-9]+")
>>> for wd in re.findall(wdigit, s):
... print wd,
...
2
In case the re.U
flag is unfamiliar to you, it's documented here.
Answered By - Zero Piraeus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.