Issue
I have CSV file with the below data.
**Source data:**
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE CO., XYZ,20
But python pandas is not reading the data properly due to comma values in Name column. I used below link. But not working. XYZ is truncating in ABC (CEF) CO.,XYZ.
pattern = '[:;\?\.<\'/]' # I use \ to ignore characters that are used in regex :)
df['Name_Clean'] = df['Name'].str.replace(pattern, '').str.strip()
Unable to remove special characters ;:??/?<
Solution
i tried below which is one doable solution using Read a table of fixed-width formatted lines into DataFrame
with pandas.read_fwf()
method and assign a temporary column name col1
.
You Raw Data:
$ cat source_data.csv
CODE,Name,Value
1,ABC (CEF) CO.,XYZ,500
2,GOOD VALUE CO., XYZ,20
DataFrame:
>>> df = pd.read_fwf('source_data.csv', names=['col1'])
>>> df
col1
0 CODE,Name,Value
1 1,ABC (CEF) CO.,XYZ,500
2 2,GOOD VALUE CO., XYZ,20
Solution:
So, when you will use str.extract
, you will see NaN
values which you can drop with dropna()
and then use rename
to assign the desired column Names as extracted names are Just integers ..
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)')
0 1 2
0 NaN NaN NaN
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
Desired:
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns={0:'CODE', 1:'Name', 2:'Value'}).dropna()
CODE Name Value
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
OR
In case you want to rename the columns names creating a dict then try..
>>> cols={0:'CODE', 1:'Name', 2:'Value'}
>>> df.col1.str.extract('(\d+)\,(\D+)\,(\d+)').dropna().rename(columns=cols).dropna()
CODE Name Value
1 1 ABC (CEF) CO.,XYZ 500
2 2 GOOD VALUE CO., XYZ 20
Regex Explanation:
'(\d+)\,(\D+)\,(\d+)'
1st Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
2nd Capturing Group (\D+)
\D+ matches any character that\'s not a digit (equal to [^0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Hope this will help.
Answered By - Karn Kumar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.