Wednesday, January 24, 2024

[FIXED] create filtered value using pandas

January 24, 2024 pandas, python No comments

Issue

I have a csv file in which the first line reads something like the following:

Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]...

and so on. What I'd like to do is create filtered value such that the first column is Ref and it takes all the values after the # sign like 001,50,51,52... The second column name is ID and it takes the value after = like XYZ_0,M3_0,M3_1,M3_2,M3_3... And finally make a third column which takes all the values present in the square brackets like 1234,112,154,254,...

header_pattern = r'Pyscip_V(\d+\.\d+) Ref:'

version_match = re.search(header_pattern, first_line.iloc[0, 0])
version_number = version_match.group(1) if version_match else ''

matches = re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]', first_line.iloc[0, 0])

parsed_df = []
for match in matches:
    row_dict = {
        'Ref': match[0] if match[0] else '',
        'ID': match[1] if match[1] else '',
        'Ser_No': match[2] if match[2] else ''
    }
    parsed_df.append(row_dict)

new_df = pd.DataFrame(parsed_df)

However, I only get enpty dataframe. What seems to be the problem here?

Edit: the data from 3rd row looks like the following:

ID  Date    XYZ_0  M3_0   M3_1  M3_2    
1   22.12.2023  12.6  0.5 1.2   2.3

The expected outcome is

Ref ID Num
001 XYZ_0 1234
50 M3_0 112
51 M3_1 154

Solution

I would open the csv file, extract the first line and process it, only after read the rest of the CSV with pandas.

For that, your original approach and regex are fine.

import re
import pandas as pd

with open('my_csv.csv') as f:
    first_line = next(f)
    header_df = pd.DataFrame(re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]',
                                        first_line),
                             columns=['Ref', 'ID', 'Num'])
    
    data_df = pd.read_csv(f, sep=r'\s+')

print(header_df)
print(data_df)

Output:

# header_df
   Ref     ID   Num
0  001  XYZ_0  1234
1   50   M3_0   112
2   51   M3_1   154
3   52   M3_2   254

# data_df
   ID        Date  XYZ_0  M3_0  M3_1  M3_2
0   1  22.12.2023   12.6   0.5   1.2   2.3

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 24, 2024

[FIXED] create filtered value using pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels