Tuesday, February 22, 2022

[FIXED] Removing rows of duplicate headers or strings same columns and blank lines in pandas in python

February 22, 2022 dataframe, pandas, python, python-3.x, spyder No comments

Issue

I have a sample data (Data_sample_truncated.txt) which I truncated from a big data. It has 3 fields - "Index", "Time" and "RxIn.Density[**x**, ::]" Here I used x as integer as x can vary for any range. In this data it is 0-15. The combination of the 3 column fields is unique. For different "Index" field the "Time" and "RxIn.Density[**x**, ::]" can be same or different. For each new "Index" value the data has a blank line and almost similar column headers except for "RxIn.Density[**x**, ::]" where x is increasing when new "Index" value is reached. The data which I export from ADS (circuit simulation software) gives me like this format while exporting.

Now I want to format the data so that all the data are merged together under 3 unique column fields - "Index", "Time" and "RxIn.Density". You can see I want to remove the strings [**x**, ::] in the new dataframe of the 3rd column. Here is the sample final data file that I want after formatting (Data-format_I_want_after_formatting.txt). So I want the following -

The blank lines (or rows) to be removed
All the other header lines to be removed keeping the top header only and changing the 3rd column header to "RxIn.Density"
Keeping all the data merged under the unique column fields - "Index", "Time" and "RxIn.Density", even if the data values are duplicate.

My MATLAB code is in the below:

import pandas as pd

#create DataFrame from csv with columns f and v 
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+", names=['index','time','v'])

#boolean mask for identify columns of new df   
m = df['v'].str.contains('RxIn')

#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()

#get original ordering for new columns
#cols = df['g'].unique()

#remove rows with same values in v and g columns
#df = df[df['v'] != df['g']]

df = df.drop_duplicates(subset=['index', 'time'], keep=False)

df.to_csv('target.txt', index=False, sep='\t')

The generated target.txt file is not what I wanted. You can check it here. Can anyone help what is wrong with my code and what to do to fix it so that I wan my intended formatting?

I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.

Solution

You can just filter out rows, that you do not want(check this):

import pandas as pd
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+")
df.columns = ["index","time","RxIn.Density","1"]
del df["1"]
df = df[~df["RxIn.Density"].str.contains("Rx")].reset_index(drop=True)
df.to_csv('target.txt', index=False, sep='\t')

Answered By - Dmitriy Kisil

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 22, 2022

[FIXED] Removing rows of duplicate headers or strings same columns and blank lines in pandas in python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels