Issue
I am trying to understand how to set up a sparse pandas matrix to minimize memory usage and retain precision of all values. I did not find the answers in the pandas Sparse documentation. Below is an example which illustrates my questions:
Why does a
Sparse(int32)
dataframe take as much memory as aSparse(float32)
dataframe? Is there any advantage in specifying aSparse(int)
dtype if this is the case?How does pandas decide what specific
Sparse(int)
dtype to use, e.g.int8
orint32
? Given the example below (please see dataframessdf_int32
andsdf_high_int32
), it appearsSparse(int32)
is always chosen regardless of whetherSparse(int8)
might be more memory-efficient, orSparse(int32)
might truncate some values.Is the only way to avoid truncation and achieve minimum memory usage to specify
Sparse(intNN)
orSparse(floatNN)
dtype for each column?
import numpy as np
import pandas as pd
# Generate binary dense matrix with low density
df = pd.DataFrame()
for col in ['col1', 'col2', 'col3']:
df[col] = np.where(np.random.random_sample(100_000_000) > 0.98, 1, 0)
df.name = 'Dense'
# Replace one column by values too high for int32 dtype
df_high = df.copy()
df_high['col1'] = df_high['col1'] * 100_000_000_000
# Convert df to sparse of various dtypes
sdf_float32 = df.astype(pd.SparseDtype('float32', 0))
sdf_float32.name = 'Sparse, float32'
sdf_int8 = df.astype(pd.SparseDtype('int8', 0))
sdf_int8.name = 'Sparse, int8'
sdf_int32 = df.astype(pd.SparseDtype('int', 0))
sdf_int32.name = 'Sparse, int32'
sdf_int64 = df.astype(pd.SparseDtype('int64', 0))
sdf_int64.name = 'Sparse, int64'
# Convert df_high to Sparse(int)
sdf_high_int32 = df_high.astype(pd.SparseDtype('int', 0))
sdf_high_int32.dtypes
sdf_high_int32['col1'].value_counts()
sdf_high_int32.name = 'Sparse, int32 highval'
# Print info for all dataframes
print(f" {df.name} Dataframe; Memory size: {df.memory_usage(deep=True).sum() / 1024 ** 2:.1f} MB, {df['col1'].dtype}")
for data in [sdf_float32, sdf_int8, sdf_int32, sdf_high_int32, sdf_int64]:
print(f" {data.name} Dataframe; Memory size: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB,"
f"Density {data.sparse.density:.5%}, {data['col1'].dtype}")
"""
Dense Dataframe; Memory size: 1144.4 MB, int32
Sparse, float32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[float32, 0]
Sparse, int8 Dataframe; Memory size: 28.6 MB,Density 1.99980%, Sparse[int8, 0]
Sparse, int32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
Sparse, int32 highval Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
Sparse, int64 Dataframe; Memory size: 68.7 MB,Density 1.99980%, Sparse[int64, 0]
"""
# Show truncated values for sdf_high_int32
print(f"Values for sdf_high_int32, col1: \n {sdf_high_int32['col1'].value_counts()}")
"""
Values for sdf_high_int32, col1:
col1
0 98001473
1215752192 1998527
Name: count, dtype: int64
"""
Solution
There are two questions in your question, first about sparse matrices. Here is the pandas documentation:
pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
That means that only the value chosen not to be stored (0
in your case) is not stored. The other values are stored as the datatype you have chosen. float32
and int32
both use 32 bits to represent a value, so they consume the same memory. The difference is what values they can store at what precision. The same would hold true for int64
versus float64
.
Since you only stored 0s and 1s, in your case, you can pick int8
as well for storing df
.
Now, answering your int
question. Your platform seems to interpret int
as int32
. On my platform int
is equivalent to int64
. Numpy is responsible for this and here is a bit from the numpy documentation:
numpy.int [...] numpy.int64, or numpy.int32
...
For np.int a direct replacement with np.int_ or int is also good and will not change behavior, but the precision will continue to depend on the computer and operating system. If you want to be more explicit and review the current use, you have the following alternatives: [...]
Because in your case int32
was chosen, you see the values of 0
(obviously) and 1215752192
. The latter being 100_000_000_000
stored in an int32
, i.e., there was an overflow and it being stored as 100_000_000_000 % (2**32)
(run this in python) which gives 1215752192
.
BTW, here are the relevant parts from my python interpreter:
>>> df_high['col1'] = df_high['col1'] * 100_000_000_000
>>> sdf_int32 = df.astype(pd.SparseDtype('int', 0))
>>> sdf_int32.dtypes
col1 Sparse[int64, 0]
col2 Sparse[int64, 0]
col3 Sparse[int64, 0]
dtype: object
>>> sdf_high_int32 = df_high.astype(pd.SparseDtype('int', 0))
>>> sdf_high_int32.dtypes
col1 Sparse[int64, 0]
col2 Sparse[int64, 0]
col3 Sparse[int64, 0]
dtype: object
>>> 100_000_000_000 % (2**32)
1215752192
Answered By - Tom
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.