Issue
I am porting some Python 2 code that calls split()
on strings, so I need to know its exact behavior. The documentation states that when you do not specify the sep
argument, "runs of consecutive whitespace are regarded as a single separator".
Unfortunately, it does not specify which characters that would be. There are some obvious contenders (like space, tab, and newline), but Unicode contains plenty of other candidates.
Which characters are considered to be whitespace by split()
?
Since the answer might be implementation-specific, I'm targeting CPython.
(Note: I researched the answer to this myself since I couldn't find it anywhere, so I'll be posting it here, hopefully for the benefit of others.)
Solution
Unfortunately, it depends on whether your string is an str
or a unicode
(at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).
If it is an str
, the answer is straightforward:
0x09
Tab0x0a
Newline0x0b
Vertical Tab0x0c
Form Feed0x0d
Carriage Return0x20
Space
Source: these are the characters with PY_CTF_SPACE
in Python/pyctype.c
, which are used by Py_ISSPACE
, which is used by STRINGLIB_ISSPACE
, which is used by split_whitespace
.
If it is a unicode
, there are 29 characters, which in addition to the above are:
U+001c
through0x001f
: File/Group/Record/Unit SeparatorU+0085
: Next LineU+00a0
: Non-Breaking SpaceU+1680
: Ogham Space MarkU+2000
through0x200a
: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not includedU+2028
: Line SeparatorU+2029
: Paragraph SeparatorU+202f
: Narrow No-Break SpaceU+205f
: Medium Mathematical SpaceU+3000
: Ideographic Space
Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str
or a unicode
!
Source: these are the characters listed in _PyUnicode_IsWhitespace
, which is used by Py_UNICODE_ISSPACE
, which is used by STRINGLIB_ISSPACE
(it looks like they use the same function implementations for both str
and unicode
, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:
Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'
Answered By - Aasmund Eldhuset
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.