Tuesday, May 17, 2022

[FIXED] basic feature selection or dimensionality reduction previous to machine learning

May 17, 2022 matrix, numpy, pandas, python No comments

Issue

I am analyzing a group of stocks which share many intrinsic features and also adding external datasets that could expand data points in the original dataset. I have the following dataframe, using a made up example in Pandas:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#A = INTEL, #B = IBM, #C = MSFT, #D = AAPL, #E=AIG, #F=GS
df = pd.DataFrame({'A' : ['IBM', 'INTEL', 'MSFT', 'INTEL',
                         'AAPL', 'INTEL', 'MSFT', 'IBM','INTEL','AAPL'],
                    'B' : np.random.randn(10),
                    'C' : np.random.randn(10),
                    'D' : np.random.randn(10),
                    'E' : np.random.randn(10)})

which produces the following dataset:

enter image description here

My real dataset might contain >100 features (columns). The question: Is there a pythonic way to visualize salient features of the dataset so I work with a reduced matrix?

Solution

Not knowing much about your data but assuming it is a time series analysis, I would try to create a correlation matrix among all the features you have, and maybe merge features with very high correlation. However, in using that approach, you need to make sure that correlations hold over time, and check serial correlation.

If you want a quick visualization of the features, I'll do a RadViz like this:

pd.tools.plotting.radviz(df,"A")

Which generates this: enter image description here

With your made up dataset, I could say, for example, that eliminating datapoints below the D-B segment could reduce the size of your matrix and still capture a lot of the features. Or, maybe you want to focus on those values below the D-B segment because they represent anomalies in your field of study, etc.

I have not found much documentation about RadViz in the official Pandas library, but I find it useful to quickly look at salient features of some datasets or as a quick visual data mining tool. There is a good paper about identification of clusters in multidimensional data and the RadViz algorithm here.

Hope my answer helps.

Answered By - Luis Miguel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, May 17, 2022

[FIXED] basic feature selection or dimensionality reduction previous to machine learning

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels