Tuesday, January 9, 2024

[FIXED] Dealing with PyArrow decimal128 precision

January 09, 2024 dataframe, decimal, pandas, pyarrow, python No comments

Issue

[some example data at the end] I've just started working with PyArrow, so forgive me if I'm missing something obvious here.

I have a project that I'm updating to (hopefully) better handle calculations on money. Mostly, these calculations are multiplying a normal money amount by a percentage, like 9.94 * 0.04, things like that.

I had been using pandas v1.4.x and just had all the money as floats and was not consistent with rounding, which caused headaches. In the example above, I would want 9.94 * 0.04 = 0.40, using normal rounding to two digits.

I was going to start forcing decimal.Decimal objects in everywhere instead of floats, when I saw that pyarrow has a builtin decimal128 datatype that should work much better with pandas.

So, not I'm getting a lot of the following exception:

pyarrow.lib.ArrowInvalid: Rescaling Decimal128 value would cause data loss

I'm also getting changes to precision that, while not raising exceptions, I don't think I want.

For example, I have a pandas dataframe with a column called 'Pay Rate' with a dtype of pa.decimal128(12,2). When I do df['Pay Rate'] * decimal.Decimal('0.04'), the result is of type pa.decimal128(15,4). I'm assuming it is merging together the precisions of the two things being multiplied in a way that is reasonable but that I don't want. (Note: If i just do df['Pay Rate'] * 0.04, the result is a double[pyarrow] type.)

I want the end of my transformations here to result in columns that are type decimal128(12,2), and so I'm also then trying df['my_col'] = df['my_col'].astype(pd.ArrowDtype(pa.decimal128(12,2)), and that is then sometimes giving me the error above about data loss.

It makes sense to me that there is data loss because I am indeed telling it to just drop off some decimal points, but really what I want is it to round and then, yea, drop them.

Is there some switch of function to handle this that I'm missing?

some example data

import pandas as pd
import pyarrow as pa
from decimal import Decimal

data = {'col1': {0: Decimal('39.60'), 1: Decimal('39.60'), 2: Decimal('21.60'), 3: Decimal('7.20'), 4: Decimal('18.00'), 5: Decimal('18.00'), 6: Decimal('72.00'), 7: Decimal('30.60'), 8: Decimal('36.00'), 9: Decimal('41.40')}, 'col2': {0: Decimal('0.98'), 1: Decimal('1.00'), 2: Decimal('0.97'), 3: Decimal('0.46'), 4: Decimal('0.52'), 5: Decimal('1.00'), 6: Decimal('1.00'), 7: Decimal('1.00'), 8: Decimal('1.00'), 9: Decimal('1.00')}}
df = pd.DataFrame(data,dtype=pd.ArrowDtype(pa.decimal128(12, 2)))
df['col3'] = df['col1'] * df['col2']
#df['col3'] has a dtype of decimal128(25,4)
df['col3'].astype(pd.ArrowDtype(pa.decimal128(12, 2)))
#raises exception

Solution

You can call round before casting:

df['col3'].round(2).astype(pd.ArrowDtype(pa.decimal128(12, 2)))

Answered By - 0x26res

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] Dealing with PyArrow decimal128 precision

Issue

some example data

Solution

0 comments:

Post a Comment

Popular Posts

Labels