Issue
I have a Csv file which has merged columns in it and pandas dataframe is showing it in same way. need to split column as required out put
Csv Sample
my current input from csv file is:
"Date InformationIdNo.","Date out","Dr.","Cr."
"01 FEB Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0","250 - PQRT14225","","",""
"01 FEB Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0","20000 - DEFG12","","",""
"01 FEB TransferFT22032FQWE3","01 FEB 21","0","7,436.93","","","",""
also need to Index-1 with information column of 0th index
Required output needed :
| | Date | Information | IdNo. | Date out | Dr. | Cr. | Balance |
|---|-----------|-------------------------------|-----------------|-----------|-----------|-----------|------------|
| 0 | 01 FEB 21 | Mart Purchase 250 - PQRT14225 | MATRSC203255H | 30 DEC 21 | -3,535.61 | 0 | -3,978.61 |
| 1 | 01 FEB 21 | Cash Sales 20000 - DEFG1220 | MATRSC203255H | 30 DEC 21 | -34.33 | 0 | -3,944.29 |
| 2 | 01 FEB 21 | Transfer | FT22032FQWE3 | 01 FEB 21 | 0 | 7,426.93 | 3,482.65 |
CSV file when opened in notepad
Solution
I believe the idea of the code below is quite clear. First we need to correct data in the csv file to valid csv (comma separated value) format. After that we can create dataframe.
'data.csv' file content
"Date InformationIdNo."," Date out ","Dr."," Cr."
"01 FEB 21 Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0"
"250 - PQRT14225","","",""
"01 FEB 21 Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0"
"20000 - DEFG12","","",""
"01 FEB 21 TransferFT22032FQWE3"," 01 FEB 21","0","7,426.93"
"","","",""
"","","",""
"","","",""
"","","",""
"","","",""
Possible (quick) solution is the following:
#pip install pandas
import re
import pandas as pd
from io import StringIO
with open("data.csv", "r", encoding='utf-8') as file:
raw_data = file.read()
# convert txt to valid csv (comma separated values) format
raw_data = raw_data.replace(' - ', '-')
raw_data = raw_data.replace('Date InformationIdNo.', 'Date","Information","IdNo.')
raw_data = raw_data.replace('" Cr."', '"Cr","Information_add"')
raw_data = re.sub('(\d{2} [A-Z]{3} \d{2})', r'\1","', raw_data)
raw_data = re.sub('\n"([A-Z0-9-]+)","","",""\n', r',"\1"\n', raw_data)
raw_data = re.sub(r',""{2,}', '', raw_data)
raw_data = re.sub('([A-Z0-9]{3,}",")', r'","\1","', raw_data)
raw_data = re.sub(',""+', r'', raw_data)
raw_data = re.sub('\n""+', r'', raw_data)
# # create dataframe and replace NaN with ""
df = pd.read_csv(StringIO(raw_data), sep=",")
df.fillna("", inplace=True)
# merge columns and drop temporary column
df['Information'] = df['Information'] + df['Information_add']
df.drop(['Information_add'], axis=1, inplace=True)
# cleanup column headers
df.columns = [name.strip() for name in df.columns]
# convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df['Date out'] = pd.to_datetime(df['Date out'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df
Returns
Answered By - GreyMurav
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.