Issue
I am analyzing a group of stocks which share many intrinsic features and also adding external datasets that could expand data points in the original dataset. I have the following dataframe, using a made up example in Pandas:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#A = INTEL, #B = IBM, #C = MSFT, #D = AAPL, #E=AIG, #F=GS
df = pd.DataFrame({'A' : ['IBM', 'INTEL', 'MSFT', 'INTEL',
'AAPL', 'INTEL', 'MSFT', 'IBM','INTEL','AAPL'],
'B' : np.random.randn(10),
'C' : np.random.randn(10),
'D' : np.random.randn(10),
'E' : np.random.randn(10)})
which produces the following dataset:
My real dataset might contain >100 features (columns). The question: Is there a pythonic way to visualize salient features of the dataset so I work with a reduced matrix?
Solution
Not knowing much about your data but assuming it is a time series analysis, I would try to create a correlation matrix among all the features you have, and maybe merge features with very high correlation. However, in using that approach, you need to make sure that correlations hold over time, and check serial correlation.
If you want a quick visualization of the features, I'll do a RadViz like this:
pd.tools.plotting.radviz(df,"A")
Which generates this:
With your made up dataset, I could say, for example, that eliminating datapoints below the D-B segment could reduce the size of your matrix and still capture a lot of the features. Or, maybe you want to focus on those values below the D-B segment because they represent anomalies in your field of study, etc.
I have not found much documentation about RadViz in the official Pandas library, but I find it useful to quickly look at salient features of some datasets or as a quick visual data mining tool. There is a good paper about identification of clusters in multidimensional data and the RadViz algorithm here.
Hope my answer helps.
Answered By - Luis Miguel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.