Issue
According to the documentation:
"BufferedProtocol implementations allow explicit manual allocation and control of the receive buffer. Event loops can then use the buffer provided by the protocol to avoid unnecessary data copies. This can result in noticeable performance improvement for protocols that receive big amounts of data. Sophisticated protocol implementations can significantly reduce the number of buffer allocations."
Since my program reads large amounts of data from a socket connection and knows the amount of data it should receive; I was using a custom BufferedProtocol
in the hopes of avoiding the data being copied around an unnecessary number of times, but an exception in my get_buffer()
method led me to discover the ugly truth through the traceback:
The buffer is not actually being used! instead the data is still copied, and is actually copied an extra time into the buffer I provided!.
The get_buffer()
and buffer_updated()
methods are being called by protocols._feed_data_to_buffered_proto()
which receives a bytes
object and then feeds it into the buffer I provided, hence leading not to a reduction in the number of times the data is being copied but rather to the data being copied an extra time!.
After digging further into the guts of asyncio
I find that the actual reading of data from the raw socket is actually being done using socket.recv_into()
, but it is just not being received directly into the buffer I provided, instead it is later COPIED to the buffer I provided.
Is this correct? and if so WTF?! or am I completely missing something obvious?
In case my understanding is correct, how do I fix it? what would it take to make asyncio actually avoid the extra copying by using my buffer directly in socket.recv_into()? can this be done without completely monkey-patching asyncio or rewriting the eventloop?
...
Regarding calls for a "minimal reproducible example":
Raise an exception inside BufferedProtocol.get_buffer()
, connected using loop.create_connection()
, and the traceback will lead you to the following function inside asyncio.protocols
:
def _feed_data_to_buffered_proto(proto, data):
data_len = len(data)
while data_len:
buf = proto.get_buffer(data_len)
buf_len = len(buf)
if not buf_len:
raise RuntimeError('get_buffer() returned an empty buffer')
if buf_len >= data_len:
buf[:data_len] = data
proto.buffer_updated(data_len)
return
else:
buf[:buf_len] = data[:buf_len]
proto.buffer_updated(buf_len)
data = data[buf_len:]
data_len = len(data)
Not only is this function not avoiding unnecessary copying, in the worst case it is actually drastically increasing the number of times the data is being copied!
And no, data
is not a memoryview
as you might have hoped, it is a bytes
object!
And in case you don't believe me that this creates many more unnecessary copy-operations please see this SO question regarding slicing bytes
objects.
Solution
After spending a lot of time reading through the asyncio code and related issues I think I can at least partially answer the question myself, although I would still appreciate anyone smarter than me chiming in and elaborating or correcting me if I'm wrong:
The issue: There is an issued with the WSARecv()
Windows function used by ProactorEventLoop
which can cause data loss, and for reasons I don't fully understand this has so far prevented the buffer returned by get_buffer()
from being used directly when the ProactorEventLoop
is used, hence the helper function _feed_data_to_buffered_proto()
was created.
Workaround: If the event loop policy is instead set to WindowsSelectorEventLoopPolicy
then the BufferedProtocol
will behave as expected, I.e. the buffer returned by get_buffer()
will be used directly in recv_into()
, this however comes with some drawbacks (See Platform Support), the SelectorEventLoop
on Windows doesn't work with pipes or subprocesses!.
In my case since I'm not using pipes or subprocesses I solved it by adding the line: asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
So it seems that my issues is mainly with Windows and not with Python, what a surprise.
I still think it would be nice if this was mentioned in the docs so that I didn't have to find out myself the hard way.
Answered By - IHaveAName
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.