Thursday, June 9, 2022

[FIXED] Comma values in a column of CSV - not working code

June 09, 2022 csv, pandas, python, python-3.x No comments

Issue

I have CSV file with the below data.

**Source data:**

CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

Expected output:

But python pandas is not reading the data properly due to comma values in Name column. I used below link. But not working. XYZ is truncating in ABC (CEF) CO.,XYZ.

pattern = '[:;\?\.<\'/]' # I use \ to ignore characters that are used in regex :)
df['Name_Clean'] = df['Name'].str.replace(pattern, '').str.strip()

Unable to remove special characters ;:??/?<

Solution

i tried below which is one doable solution using Read a table of fixed-width formatted lines into DataFrame with pandas.read_fwf() method and assign a temporary column name col1.

You Raw Data:

$ cat source_data.csv
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE  CO., XYZ,20

DataFrame:

>>> df =  pd.read_fwf('source_data.csv', names=['col1'])
>>> df
                        col1
0            CODE,Name,Value
1    1,ABC (CEF) CO.,XYZ,500
2  2,GOOD VALUE  CO., XYZ,20

Solution:

So, when you will use str.extract, you will see NaN values which you can drop with dropna() and then use rename to assign the desired column Names as extracted names are Just integers ..

   >>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)')
     0                     1    2
0  NaN                   NaN  NaN
1    1     ABC (CEF) CO.,XYZ  500
2    2  GOOD VALUE  CO., XYZ   20

Desired:

>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns={0:'CODE', 1:'Name', 2:'Value'}).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

In case you want to rename the columns names creating a dict then try..

>>> cols={0:'CODE', 1:'Name', 2:'Value'}
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns=cols).dropna()
  CODE                  Name Value
1    1     ABC (CEF) CO.,XYZ   500
2    2  GOOD VALUE  CO., XYZ    20

Regex Explanation:

'(\d+)\,(\D+)\,(\d+)'


1st Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

2nd Capturing Group (\D+)
\D+ matches any character that\'s not a digit (equal to [^0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\, matches the character , literally (case sensitive)

3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Hope this will help.

Answered By - Karn Kumar

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 9, 2022

[FIXED] Comma values in a column of CSV - not working code

Issue

Solution

You Raw Data:

DataFrame:

Solution:

Regex Explanation:

0 comments:

Post a Comment

Popular Posts

Labels