Issue
I am using Python 3 and I know all about hex
, int
, chr
, ord
, '\uxxxx'
escape and '\U00xxxxxx'
escape and Unicode has 1114111 codepoints...
How can I check if a Unicode codepoint is valid? That is, it is unambiguously mapped to a authoritatively defined character.
For example, codepoint 720 is valid; it is 0x2d0 in hex, and U+02D0 points to ː
:
In [135]: hex(720)
Out[135]: '0x2d0'
In [136]: '\u02d0'
Out[136]: 'ː'
And 888 is not valid:
In [137]: hex(888)
Out[137]: '0x378'
In [138]: '\u0378'
Out[138]: '\u0378'
And 127744 is valid:
In [139]: chr(127744)
Out[139]: '🌀'
And 0xe0000 is invalid:
In [140]: '\U000e0000'
Out[140]: '\U000e0000'
I have come up with a rather hacky solution: if a codepoint is valid, trying to convert it to a character will either result in the decoded character or the '\xhh'
escape sequence, else it will return the undecoded escape sequence exactly same as original, I can check the return value of chr
and check if it starts with '\u'
or '\U'
...
Now is the hacky part, chr
doesn't decode invalid codepoints but it doesn't raise exceptions either, and the escape sequences will have length of 1 since they are treated as a single character, I have to repr
the return value and check the results...
I have used this method to identify all invalid codepoints:
In [130]: invalid = []
In [131]: for i in range(1114112):
...: if any(f'{chr(i)!r}'.startswith(j) for j in ("'\\U", "'\\u")):
...: invalid.append(i)
In [132]: from pathlib import Path
In [133]: invalid = [(hex(i).removeprefix('0x'), i) for i in invalid]
In [134]: Path('D:/invalid_unicode.txt').write_text(',\n'.join(map(repr, invalid)))
Out[134]: 18574537
Can anyone offer a better solution?
Solution
I believe the most straight-forward approach is to use unicodedata.category()
.
The examples from the OP are unassigned codepoints, which have a category of Cn
("Other, not assigned").
>>> import unicodedata as ud
>>> ud.category('\u02d0')
'Lm'
>>> ud.category('\u0378') # unassigned
'Cn'
>>> ud.category(chr(127744))
'So'
>>> ud.category('\U000e0000') # unassigned
'Cn'
It also works for the control characters in the ASCII range:
>>> ud.category('\x00')
'Cc'
Further categories for invalid codepoints (according to comments) are Cs
("Other, surrogate") and Co
("Other, private use"):
>>> ud.category('\ud800') # lower surrogate
'Cs'
>>> ud.category('\uf8ff') # private use
'Co'
So a function for codepoint validity (as per the OP's definition) could look like this:
def is_valid(char):
return ud.category(char) not in ('Cn', 'Cs', 'Co')
Important caveat: Python's unicodedata
module embeds a certain version of Unicode, so the information is potentially out of date.
For example, in my installation of Python 3.8, the Unicode version is 12.1.0, so it doesn't know about codepoints assigned in later versions of Unicode:
>>> ud.unidata_version
'12.1.0'
>>> ud.category('\U0001fae0') # melting face emoji added in Unicode v14
'Cn'
If you need a more recent version of Unicode than the one of your Python version, you probably need to fetch an appropriate table directly from Unicode.
Answered By - lenz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.