Tuesday, January 23, 2024

[FIXED] How To use numpy.frombuffer to read a file sent using FastAPI?

January 23, 2024 fastapi, numpy, python No comments

Issue

I'm trying to read data from a text file sent to my API built using fastapi. The files template is always the same and consists of three columns of numbers as shown in the picture below:

file picture

I tried solving the problem with the following code using numpy:

@app.post("/uploadfile/")
async def create_upload_file(file: UploadFile):
   
    file_data = await file.read()
    
    print(len(file_data))
    print(file_data)
    
    deserialized_bytes = np.frombuffer(file_data,float)
    print(deserialized_bytes)

I got the following error:

ValueError: buffer size must be a multiple of element size

When printing file_data I got the following:

b'0.01\t1.008298628\t-0.007582043\n0.012589254\t1.007411741\t-0.008969602\n0.015848932\t1.00632129\t-0.010491102\n0.019952623\t1.005019534\t-0.012152029\n0.025118864\t1.00349648\t-0.013967763\n0.031622777\t1.001734774\t-0.015961535\n0.039810717\t0.999706432\t-0.018160753\n0.050118723\t0.997371077\t-0.020592808\n0.063095734\t0.994675168\t-0.023280535\n0.079432823\t0.991552108\t-0.026237042\n0.1\t0.987923699\t-0.029459556\n0.125892541\t0.983703902\t-0.032922359\n0.158489319\t0.978806153\t-0.036569606\n0.199526231\t0.973155266\t-0.040309894\n0.251188643\t0.96670398\t-0.044015666\n0.316227766\t0.959452052\t-0.047531126\n0.398107171\t0.95146288\t-0.050691362\n0.501187234\t0.942870343\t-0.05335184\n0.630957344\t0.933869168\t-0.055422075\n0.794328235\t0.924687135\t-0.056892752\n1\t0.915545303\t-0.057845825\n1.258925412\t0.906618399\t-0.058443323\n1.584893192\t0.898007292\t-0.05889944\n1.995262315\t0.88972925\t-0.05944639\n2.511886432\t0.88172395\t-0.060304263\n3.16227766\t0.873868464\t-0.06166047\n3.981071706\t0.865994233\t-0.063658897\n5.011872336\t0.857901613\t-0.066395558\n6.309573445\t0.849370763\t-0.069916657\n7.943282347\t0.840170081\t-0.074215864\n10\t0.830064778\t-0.079229225\n12.58925412\t0.818828635\t-0.084828018\n15.84893192\t0.80626166\t-0.090811856\n19.95262315\t0.792215036\t-0.09690624\n25.11886432\t0.776622119\t-0.102770212\n31.6227766\t0.759530416\t-0.10801952\n39.81071706\t0.741125379\t-0.112267806\n50.11872336\t0.7217348\t-0.115182125\n63.09573445\t0.701805205\t-0.116541504\n79.43282347\t0.681849867\t-0.116282042\n100\t0.662379027\t-0.114513698\n125.8925412\t0.643830642\t-0.111502997\n158.4893192\t0.626519742\t-0.107628211\n199.5262315\t0.61061625\t-0.103322294\n251.1886432\t0.596150187\t-0.099019949\n316.227766\t0.583035191\t-0.095119782\n398.1071706\t0.571098913\t-0.091964781\n501.1872336\t0.560111017\t-0.089838356\n630.9573445\t0.549803457\t-0.08897027\n794.3282347\t0.539881289\t-0.089546481\n1000\t0.530024653\t-0.091717904\n1258.925412\t0.519883761\t-0.095604233\n1584.893192\t0.509069473\t-0.101289726\n1995.262315\t0.497142868\t-0.10880829\n2511.886432\t0.48360875\t-0.118115658\n3162.27766\t0.4679204\t-0.12904786\n3981.071706\t0.449505702\t-0.141268753\n5011.872336\t0.427826294\t-0.154216509\n6309.573445\t0.402477916\t-0.167069625\n7943.282347\t0.373326923\t-0.17876369\n10000\t0.340652906\t-0.188091157\n12589.25412\t0.305239137\t-0.193894836\n15848.93192\t0.268343956\t-0.195318306\n19952.62315\t0.231521442\t-0.192025496\n25118.86432\t0.196333603\t-0.184290646\n31622.7766\t0.164060493\t-0.17291366\n39810.71706\t0.135516045\t-0.15900294\n50118.72336\t0.111014755\t-0.143723646\n63095.73445\t0.090459787\t-0.128100619\n79432.82347\t0.073486071\t-0.11291483\n100000\t0.059599333\t-0.098684088\n125892.5412\t0.04827997\t-0.085696257\n158489.3192\t0.03904605\t-0.074063717\n199526.2315\t0.031483308\t-0.063778419\n251188.6432\t0.025253597\t-0.054757832\n316227.766\t0.020091636\t-0.046879426\n398107.1706\t0.01579663\t-0.040005154\n501187.2336\t0.012222187\t-0.033998672\n630957.3445\t0.009265484\t-0.028737822\n794328.2347\t0.006855045\t-0.024123472\n1000000\t0.00493654\t-0.020083713\n'

and the length of file_data is 2897 which doesn't divide by 8 as it should.

Thinking that the problem originated from the Tab's and NewLine commands in the file, I tried removing the newLines, and replacing the tabs with spaces but I ended getting different numbers than the ones in the file.

I don't quite understand how to convert file_data from bytes to a numpy array using the numpy library and not an entire function of my own which would be possible but much more complicated.

What would be the right way to read the data into an array? If you can help me find a quick way to insert each column into a separate array automatically with no additional loop that would be great.

Solution

Why the error

frombuffer is to read raw, "binary" data. So if you are trying to read float64, for examples, it just read packets of 64 bits (as the internal representation of float64) and fills a numpy array of float64 with it.

For example

np.frombuffer(b'\x00\x01\x02\x03', dtype=np.uint8)
# → array([0,1,2,3], dtype=uint8)
# because each byte is the representation of 1 uint8 integer

np.frombuffer(b'\x00\x01\x02\x03', dtype=np.uint16)
# → array([256,770], dtype=uint16) on my machine
# because each pairs of bytes make the 16 bits of a uint16 16 bits integer
# 0 1, 00000000 00000001 in binary, with a little endian machine
# is 00000001 00000000 in binary = 256 in decimal.
# (on a big endian machine it would have been 1)
# then 2=00000010 3=00000011. So on my little endian machine that is
# 00000011 00000010 = 512+256+2 = 770
# on a big endian machine, that would have been 00000010 00000011 = 512+2+1=515

Etc. I could continue with examples of float32 etc. But that would be longer to detail, and useless, since understanding frombuffer is not really what you want. Point is, it is not what you think it is. In practice, frombuffer is for reading numpy array from memory that was produced by a .tobytes() previously (np.array([256,770]).tobytes() = b'\x00\x01\x02\x03'). Or by some equivalent code for other libraries or language (for example if a C code fwrite the content of a float * array, then you could get the np.float32 back into a numpy array with .frombuffer.

But of course, you can use .frombuffer only if you have a consistent number of bytes. So, for a uint16, the number of bits have to be a multiple of 16, so the number of bytes has to be a multiple of 2. And, in your case, for a float64, the number of bits has to be a multiple of 64, so number of bytes a multiple of 8. Which is not the case. Which is lucky for you. Because if your data happened to contain a multiple of 8 bytes (it has a 12.5% probability to happen), it would have worked without error, and you would have some hard time understanding why, with no error message at all, you end up with a numpy array containing numbers that are not the good ones. (Just had 3 spaces at the end of your file...)

What to do then

The bytes you are trying to parse are obviously in ascii format, containing decimal representation of real numbers, separated by tabulations (\t) and line feed (\n). Sometimes called tsv format.

So what you need is a function that read and parse tsv format. Those are not "ready to use" bytes representing numbers (it is a human readable format).

numpy.loadtxt does just that.
Its normal usage is to open files. But it can also parse directly data, as long as you feed it with array (or generators) of lines. So, your file_data is a bytestring containing lines (each of them containing numbers separated by tabs) separated by line feed. Just split it with b'\n' separator, to get an array of lines, and give that array to np.loadtxt

tl;dr

deserialized_bytes = np.loadtxt(file_data.split(b'\n'))

is what you want

Answered By - chrslg

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 23, 2024

[FIXED] How To use numpy.frombuffer to read a file sent using FastAPI?

Issue

Solution

Why the error

What to do then

tl;dr

0 comments:

Post a Comment

Popular Posts

Labels