Issue
I'm trying to read data from a text file sent to my API built using fastapi. The files template is always the same and consists of three columns of numbers as shown in the picture below:
I tried solving the problem with the following code using numpy:
@app.post("/uploadfile/")
async def create_upload_file(file: UploadFile):
file_data = await file.read()
print(len(file_data))
print(file_data)
deserialized_bytes = np.frombuffer(file_data,float)
print(deserialized_bytes)
I got the following error:
ValueError: buffer size must be a multiple of element size
When printing file_data
I got the following:
b'0.01\t1.008298628\t-0.007582043\n0.012589254\t1.007411741\t-0.008969602\n0.015848932\t1.00632129\t-0.010491102\n0.019952623\t1.005019534\t-0.012152029\n0.025118864\t1.00349648\t-0.013967763\n0.031622777\t1.001734774\t-0.015961535\n0.039810717\t0.999706432\t-0.018160753\n0.050118723\t0.997371077\t-0.020592808\n0.063095734\t0.994675168\t-0.023280535\n0.079432823\t0.991552108\t-0.026237042\n0.1\t0.987923699\t-0.029459556\n0.125892541\t0.983703902\t-0.032922359\n0.158489319\t0.978806153\t-0.036569606\n0.199526231\t0.973155266\t-0.040309894\n0.251188643\t0.96670398\t-0.044015666\n0.316227766\t0.959452052\t-0.047531126\n0.398107171\t0.95146288\t-0.050691362\n0.501187234\t0.942870343\t-0.05335184\n0.630957344\t0.933869168\t-0.055422075\n0.794328235\t0.924687135\t-0.056892752\n1\t0.915545303\t-0.057845825\n1.258925412\t0.906618399\t-0.058443323\n1.584893192\t0.898007292\t-0.05889944\n1.995262315\t0.88972925\t-0.05944639\n2.511886432\t0.88172395\t-0.060304263\n3.16227766\t0.873868464\t-0.06166047\n3.981071706\t0.865994233\t-0.063658897\n5.011872336\t0.857901613\t-0.066395558\n6.309573445\t0.849370763\t-0.069916657\n7.943282347\t0.840170081\t-0.074215864\n10\t0.830064778\t-0.079229225\n12.58925412\t0.818828635\t-0.084828018\n15.84893192\t0.80626166\t-0.090811856\n19.95262315\t0.792215036\t-0.09690624\n25.11886432\t0.776622119\t-0.102770212\n31.6227766\t0.759530416\t-0.10801952\n39.81071706\t0.741125379\t-0.112267806\n50.11872336\t0.7217348\t-0.115182125\n63.09573445\t0.701805205\t-0.116541504\n79.43282347\t0.681849867\t-0.116282042\n100\t0.662379027\t-0.114513698\n125.8925412\t0.643830642\t-0.111502997\n158.4893192\t0.626519742\t-0.107628211\n199.5262315\t0.61061625\t-0.103322294\n251.1886432\t0.596150187\t-0.099019949\n316.227766\t0.583035191\t-0.095119782\n398.1071706\t0.571098913\t-0.091964781\n501.1872336\t0.560111017\t-0.089838356\n630.9573445\t0.549803457\t-0.08897027\n794.3282347\t0.539881289\t-0.089546481\n1000\t0.530024653\t-0.091717904\n1258.925412\t0.519883761\t-0.095604233\n1584.893192\t0.509069473\t-0.101289726\n1995.262315\t0.497142868\t-0.10880829\n2511.886432\t0.48360875\t-0.118115658\n3162.27766\t0.4679204\t-0.12904786\n3981.071706\t0.449505702\t-0.141268753\n5011.872336\t0.427826294\t-0.154216509\n6309.573445\t0.402477916\t-0.167069625\n7943.282347\t0.373326923\t-0.17876369\n10000\t0.340652906\t-0.188091157\n12589.25412\t0.305239137\t-0.193894836\n15848.93192\t0.268343956\t-0.195318306\n19952.62315\t0.231521442\t-0.192025496\n25118.86432\t0.196333603\t-0.184290646\n31622.7766\t0.164060493\t-0.17291366\n39810.71706\t0.135516045\t-0.15900294\n50118.72336\t0.111014755\t-0.143723646\n63095.73445\t0.090459787\t-0.128100619\n79432.82347\t0.073486071\t-0.11291483\n100000\t0.059599333\t-0.098684088\n125892.5412\t0.04827997\t-0.085696257\n158489.3192\t0.03904605\t-0.074063717\n199526.2315\t0.031483308\t-0.063778419\n251188.6432\t0.025253597\t-0.054757832\n316227.766\t0.020091636\t-0.046879426\n398107.1706\t0.01579663\t-0.040005154\n501187.2336\t0.012222187\t-0.033998672\n630957.3445\t0.009265484\t-0.028737822\n794328.2347\t0.006855045\t-0.024123472\n1000000\t0.00493654\t-0.020083713\n'
and the length of file_data
is 2897 which doesn't divide by 8 as it should.
Thinking that the problem originated from the Tab's and NewLine commands in the file, I tried removing the newLines, and replacing the tabs with spaces but I ended getting different numbers than the ones in the file.
I don't quite understand how to convert file_data
from bytes to a numpy array using the numpy library and not an entire function of my own which would be possible but much more complicated.
What would be the right way to read the data into an array? If you can help me find a quick way to insert each column into a separate array automatically with no additional loop that would be great.
Solution
Why the error
frombuffer
is to read raw, "binary" data. So if you are trying to read float64, for examples, it just read packets of 64 bits (as the internal representation of float64) and fills a numpy array of float64 with it.
For example
np.frombuffer(b'\x00\x01\x02\x03', dtype=np.uint8)
# → array([0,1,2,3], dtype=uint8)
# because each byte is the representation of 1 uint8 integer
np.frombuffer(b'\x00\x01\x02\x03', dtype=np.uint16)
# → array([256,770], dtype=uint16) on my machine
# because each pairs of bytes make the 16 bits of a uint16 16 bits integer
# 0 1, 00000000 00000001 in binary, with a little endian machine
# is 00000001 00000000 in binary = 256 in decimal.
# (on a big endian machine it would have been 1)
# then 2=00000010 3=00000011. So on my little endian machine that is
# 00000011 00000010 = 512+256+2 = 770
# on a big endian machine, that would have been 00000010 00000011 = 512+2+1=515
Etc. I could continue with examples of float32 etc. But that would be longer to detail, and useless, since understanding frombuffer is not really what you want. Point is, it is not what you think it is. In practice, frombuffer
is for reading numpy array from memory that was produced by a .tobytes()
previously (np.array([256,770]).tobytes() = b'\x00\x01\x02\x03'
). Or by some equivalent code for other libraries or language (for example if a C code fwrite
the content of a float *
array, then you could get the np.float32
back into a numpy array with .frombuffer
.
But of course, you can use .frombuffer
only if you have a consistent number of bytes. So, for a uint16
, the number of bits have to be a multiple of 16, so the number of bytes has to be a multiple of 2. And, in your case, for a float64
, the number of bits has to be a multiple of 64, so number of bytes a multiple of 8. Which is not the case. Which is lucky for you. Because if your data happened to contain a multiple of 8 bytes (it has a 12.5% probability to happen), it would have worked without error, and you would have some hard time understanding why, with no error message at all, you end up with a numpy array containing numbers that are not the good ones. (Just had 3 spaces at the end of your file...)
What to do then
The bytes you are trying to parse are obviously in ascii format, containing decimal representation of real numbers, separated by tabulations (\t
) and line feed (\n
). Sometimes called tsv
format.
So what you need is a function that read and parse tsv format. Those are not "ready to use" bytes representing numbers (it is a human readable format).
numpy.loadtxt
does just that.
Its normal usage is to open files. But it can also parse directly data, as long as you feed it with array (or generators) of lines.
So, your file_data
is a bytestring containing lines (each of them containing numbers separated by tabs) separated by line feed. Just split it with b'\n'
separator, to get an array of lines, and give that array to np.loadtxt
tl;dr
deserialized_bytes = np.loadtxt(file_data.split(b'\n'))
is what you want
Answered By - chrslg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.