Issue
I have a csv file in which the first line reads something like the following:
Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]...
and so on. What I'd like to do is create filtered value such that the first column is
Ref
and it takes all the values after the # sign like 001,50,51,52... The second column name is ID and it takes the value after =
like XYZ_0,M3_0,M3_1,M3_2,M3_3... And finally make a third column which takes all the values present in the square brackets like 1234,112,154,254,...
header_pattern = r'Pyscip_V(\d+\.\d+) Ref:'
version_match = re.search(header_pattern, first_line.iloc[0, 0])
version_number = version_match.group(1) if version_match else ''
matches = re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]', first_line.iloc[0, 0])
parsed_df = []
for match in matches:
row_dict = {
'Ref': match[0] if match[0] else '',
'ID': match[1] if match[1] else '',
'Ser_No': match[2] if match[2] else ''
}
parsed_df.append(row_dict)
new_df = pd.DataFrame(parsed_df)
However, I only get enpty dataframe. What seems to be the problem here?
Edit: the data from 3rd row looks like the following:
ID Date XYZ_0 M3_0 M3_1 M3_2
1 22.12.2023 12.6 0.5 1.2 2.3
The expected outcome is
Ref ID Num
001 XYZ_0 1234
50 M3_0 112
51 M3_1 154
Solution
I would open the csv file, extract the first line and process it, only after read the rest of the CSV with pandas.
For that, your original approach and regex are fine.
import re
import pandas as pd
with open('my_csv.csv') as f:
first_line = next(f)
header_df = pd.DataFrame(re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]',
first_line),
columns=['Ref', 'ID', 'Num'])
data_df = pd.read_csv(f, sep=r'\s+')
print(header_df)
print(data_df)
Output:
# header_df
Ref ID Num
0 001 XYZ_0 1234
1 50 M3_0 112
2 51 M3_1 154
3 52 M3_2 254
# data_df
ID Date XYZ_0 M3_0 M3_1 M3_2
0 1 22.12.2023 12.6 0.5 1.2 2.3
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.