Issue
I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
frame.pivot('person','thing')
count
thing a b c d
person
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Solution
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]], dtype=int64)
dfs = pd.SparseDataFrame(sparse_matrix, \
index=person_c.categories, \
columns=thing_c.categories, \
default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
The main changes were:
.astype()
no longer accepts "categorical". You have to create a CategoricalDtype object.sort()
doesn't work anymore
Other changes were more superficial:
- using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
- the data input for the
csr_matrix
(frame["count"]
) doesn't need to be a list object - pandas
SparseDataFrame
accepts a scipy.sparse object directly now
Answered By - Alnilam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.