Issue
I have a dataset which contains a variable number of columns (The no of columns in each row is determined by a particular value in the row).
Here is the current method I am using:-
pd.read_csv(file_path, names=list(range(100)).dropna(axis=1, how='all')
This drops all columns which are completely empty.
The only problem is there can be columns in the middle which consist of empty values. Eg:-
abc | | def | 20 | 1 | 2 | ..... | x | | |
def | | ghi | 10 | 1 | 2 | ..... | | | |
ghi | | jkl | 20 | 1 | 2 | ..... | y | | |
Here, I want to keep the 2nd column, even if its completely empty, but remove the columns at the end which are completely empty. Basically, this should be converted to:-
abc | | def | 20 | 1 | 2 | ..... | x
def | | ghi | 10 | 1 | 2 | ..... |
ghi | | jkl | 30 | 1 | 2 | ..... | y
As the dataframe has thousands of rows, looping over would be too slow. Can anyone suggest how to solve this issue?
Solution
Assuming this example input as df
:
0 1 2 3 4 5
0 1 NaN 3.0 4 NaN NaN
1 1 NaN 3.0 4 NaN NaN
2 1 NaN NaN 4 NaN NaN
3 1 NaN 3.0 4 NaN NaN
you can compute if the column is empty using df.notna().any(0)
(or any other method if you prefer to have a threshold of a different condition), which gives (as array): [ True, False, False, True, False, False]
.
Then the trick is to use cumsum
on the reverse array to keep the False values at the end, but to fill the previous ones:
mask = df.notna().any(0)[::-1].cumsum()[::-1].astype(bool)
# [ True, True, True, True, False, False]
which you can use to slice the columns:
>>> df.loc[:,mask] # or df.loc(1)[mask]
0 1 2 3
0 1 NaN 3.0 4
1 1 NaN 3.0 4
2 1 NaN NaN 4
3 1 NaN 3.0 4
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.