Issue
I am reading a large parquet file with int, string and date columns.
When using dtype_backend="pyarrow"
instead of default dtype_backend="numpy_nullable"
, I get 15.6 GB instead of 14.6 GB according to df.info()
.
Furthermore, I experienced even larger relative overhead using pyarrow for other datasets.
code:
pd.read_parquet("df.parquet", dtype_backend="numpy_nullable").info()
dtypes: Int16(1), Int32(2), datetime64ns, UTC, string(1), timedelta64ns memory usage: 14.6 GB
pd.read_parquet("df.parquet", dtype_backend="pyarrow").info()
dtypes: duration[ns]pyarrow, int16pyarrow, int32pyarrow, stringpyarrow, timestamp[ns, tz=UTC]pyarrow memory usage: 15.6 GB
Is this the expected behaviour or do I have to tweak other parameters as well?
I'm using pandas[parquet] ~= 2.1.3
Solution
I believe pd.DataFrame.info
gives you the shallow representation of memory when using numpy as a backend. So it won't give you an accurate representation of memory usage for string columns.
On the other hand, reported memory usage for pyarrow is always accurate.
You should use memory_usage(deep=True)
import pandas as pd
df = pd.DataFrame({"col1": ["abc", "efg"]})
(
df.memory_usage().sum(),
df.memory_usage(deep=True).sum(),
df.astype({"col1": "string[pyarrow]"}).memory_usage().sum(),
df.astype({"col1": "string[pyarrow]"}).memory_usage(deep=True).sum(),
)
Gives me: 144, 248, 142, 142
Answered By - 0x26res
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.