Monday, November 7, 2022

[FIXED] Get values based on variable character position and length

November 07, 2022 dataframe, pandas, python No comments

Issue

Hope you can help me figure out the best solution to this. I have this file with a certain column('RAW') which contains data like this - STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001M.....END

Basically after START, a segment for Name details should be extracted. With segment identifier as NA03N01, the 0110 that followed indicates First Name data with 01 as identifier for First Name while 10 is the number of characters that I should take to get the actual First_Name value. Then 0211 is for Middle Name with 02 as identifier of Middle Name, and 11 as the number of characters of actual Middle_Name. 03 is for last name with 09 as the number of characters to get. Please take note that middle name is not mandatory, hence some records won't contain 02XX value.

Notice that there is no 04XX, that is because that pertains to Mother's Maiden Name, which again is not mandatory. 0501S is for Civil Status with 05 as CS identifier followed by 01 which means the value contains 1 character, which turned out to be S (Single).

There are other segments after the Name (NA03N01) like ID(ID03I01), Address(PA03A01), Employment(EA03E01), etc. that is why this particular column has to be parsed correctly. Again, each segment varies as there are variables that are not mandatory.

I tried to make columns based on segments:

rows_2d = ["NA03N", "ID03I", "PA03A","EA03E"]

all_codes = []

for code in rows_2d:
    for i in range(20):
        all_codes.append(code + ("{0:0=2d}".format(i+1)))

I also defined a checker:

def isInt(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

And the method for separating the actual values:

def get_next_characters_w_columns(code, string, columns):
    
    done = 0
    output = ""
    
    code_index = string.find(code)
    
    if code_index == -1:
        return "None"
    
    a = code_index + len(code)
    while(done < len(columns)):   
        if isInt(string[a:a+2]):
            fieldNo = int(string[a:a+2])
            temp_output = string[a+4:a+4+int(string[a+2:a+4])]
            a = a + int(string[a+2:a+4]) + 4
            
            if fieldNo in columns:
                output = output + temp_output
                done += 1
                if done < len(columns):
                    output = output + " "
                
            if fieldNo > max(columns):
                return output
        else:
            return output
    return output

` I am calling the method by this:

def get_info(row, code, place=[i+1 for i in range(35)]):
    if "START" in row["RAW"]:
        info = get_next_characters_w_columns(code, row["RAW"], place)
        if (info == "-1") or (info == "-5"):
            return "No Info"
        else:
            return info
    else:
        return "None"

` This is my output

My real intention is to get the values directly from RAW column (and put them in separate columns - FNAME, MNAME, LNAME, etc.) since the identifiers and length of characters are already there. The second option is to take individual values from each table I generated, but I guess this will pose another challenge since there are values that are not mandatory.

Sorry, this has been a long one, but I hope I presented it clearly. Thanks in advance!

Solution

You might want to have a look at regular expressions as they allow to extract only certain set/number of characters, or to implement restrictions like 0-1 times a certain pattern.

Update - Approach of Iteratively building up the regex similiar to what your code is already doing:

import re
raw = 'STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'

raw1 = 'STARTNA03N010110FIRST_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'

pattern = r'STARTNA03N01'
segment = r'(\d{2})(\d{2})'
while re.search(pattern + segment, raw):
    pattern += segment
    match = re.search(pattern, raw)
    count = match.groups()[-1]
    pattern = f'{pattern}(\w' + '{' + count +'})'
    match = re.search(pattern, raw)

print(match.groups()[::3])
print(match.groups()[2::3])
# Output
('01', '02', '03', '05', '06', '07', '08', '09', '10')
('FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 'S', '29041987', '1', 'PHL', '0', 'M')

# Output for raw1
('01', '03')
('FIRST_NAME', 'LAST_NAME')

count = match.groups()[-1] identifies the number of following characters to use it for the next regex. All in all its comparable to your code, just more compact. Also, if you change the start pattern to one of your all_codes elements, you should get immediately the corresponding fields.

Update - providing code for my last comments

def get_code_elements(code, raw):
    pattern = code
    segment = r'(\d{2})(\d{2})'
    match = []
    while re.search(pattern + segment, raw):
        pattern += segment
        match = re.search(pattern, raw)
        count = match.groups()[-1]
        pattern = f'{pattern}(\w' + '{' + count +'})'
        match = re.search(pattern, raw)
    if match:
        return match.groups()
    else:
        return []
        # alternatively return a result dictionary
        # return {a: b for a,b in zip(match.groups()[::3], match.groups()[2::3])}

raw = 'STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'
all_codes = ['STARTNA03N01', 'ID03I01', 'PA03A01']
for code in all_codes:
    elements = get_code_elements(code, raw)
    # process elements as you need

As I see it your codes are all unique and will not be present more than once in your raw string. That means your raw string can be processed as described above.

Answered By - Flow

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 7, 2022

[FIXED] Get values based on variable character position and length

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels