Issue
Hope you can help me figure out the best solution to this. I have this file with a certain column('RAW') which contains data like this - STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001M.....END
Basically after START, a segment for Name details should be extracted. With segment identifier as NA03N01, the 0110 that followed indicates First Name data with 01 as identifier for First Name while 10 is the number of characters that I should take to get the actual First_Name value. Then 0211 is for Middle Name with 02 as identifier of Middle Name, and 11 as the number of characters of actual Middle_Name. 03 is for last name with 09 as the number of characters to get. Please take note that middle name is not mandatory, hence some records won't contain 02XX value.
Notice that there is no 04XX, that is because that pertains to Mother's Maiden Name, which again is not mandatory. 0501S is for Civil Status with 05 as CS identifier followed by 01 which means the value contains 1 character, which turned out to be S (Single).
There are other segments after the Name (NA03N01) like ID(ID03I01), Address(PA03A01), Employment(EA03E01), etc. that is why this particular column has to be parsed correctly. Again, each segment varies as there are variables that are not mandatory.
I tried to make columns based on segments:
`
rows_2d = ["NA03N", "ID03I", "PA03A","EA03E"]
all_codes = []
for code in rows_2d:
for i in range(20):
all_codes.append(code + ("{0:0=2d}".format(i+1)))
`
I also defined a checker:
def isInt(s):
try:
int(s)
return True
except ValueError:
return False
And the method for separating the actual values:
`
def get_next_characters_w_columns(code, string, columns):
done = 0
output = ""
code_index = string.find(code)
if code_index == -1:
return "None"
a = code_index + len(code)
while(done < len(columns)):
if isInt(string[a:a+2]):
fieldNo = int(string[a:a+2])
temp_output = string[a+4:a+4+int(string[a+2:a+4])]
a = a + int(string[a+2:a+4]) + 4
if fieldNo in columns:
output = output + temp_output
done += 1
if done < len(columns):
output = output + " "
if fieldNo > max(columns):
return output
else:
return output
return output
` I am calling the method by this:
`
def get_info(row, code, place=[i+1 for i in range(35)]):
if "START" in row["RAW"]:
info = get_next_characters_w_columns(code, row["RAW"], place)
if (info == "-1") or (info == "-5"):
return "No Info"
else:
return info
else:
return "None"
My real intention is to get the values directly from RAW column (and put them in separate columns - FNAME, MNAME, LNAME, etc.) since the identifiers and length of characters are already there. The second option is to take individual values from each table I generated, but I guess this will pose another challenge since there are values that are not mandatory.
Sorry, this has been a long one, but I hope I presented it clearly. Thanks in advance!
Solution
You might want to have a look at regular expressions as they allow to extract only certain set/number of characters, or to implement restrictions like 0-1 times a certain pattern.
Update - Approach of Iteratively building up the regex similiar to what your code is already doing:
import re
raw = 'STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'
raw1 = 'STARTNA03N010110FIRST_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'
pattern = r'STARTNA03N01'
segment = r'(\d{2})(\d{2})'
while re.search(pattern + segment, raw):
pattern += segment
match = re.search(pattern, raw)
count = match.groups()[-1]
pattern = f'{pattern}(\w' + '{' + count +'})'
match = re.search(pattern, raw)
print(match.groups()[::3])
print(match.groups()[2::3])
# Output
('01', '02', '03', '05', '06', '07', '08', '09', '10')
('FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 'S', '29041987', '1', 'PHL', '0', 'M')
# Output for raw1
('01', '03')
('FIRST_NAME', 'LAST_NAME')
count = match.groups()[-1]
identifies the number of following characters to use it for the next regex.
All in all its comparable to your code, just more compact. Also, if you change the start pattern to one of your all_codes
elements, you should get immediately the corresponding fields.
Update - providing code for my last comments
def get_code_elements(code, raw):
pattern = code
segment = r'(\d{2})(\d{2})'
match = []
while re.search(pattern + segment, raw):
pattern += segment
match = re.search(pattern, raw)
count = match.groups()[-1]
pattern = f'{pattern}(\w' + '{' + count +'})'
match = re.search(pattern, raw)
if match:
return match.groups()
else:
return []
# alternatively return a result dictionary
# return {a: b for a,b in zip(match.groups()[::3], match.groups()[2::3])}
raw = 'STARTNA03N010110FIRST_NAME0211MIDDLE_NAME0309LAST_NAME0501S060829041987070110803PHL090101001MEND'
all_codes = ['STARTNA03N01', 'ID03I01', 'PA03A01']
for code in all_codes:
elements = get_code_elements(code, raw)
# process elements as you need
As I see it your code
s are all unique and will not be present more than once in your raw string. That means your raw string can be processed as described above.
Answered By - Flow
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.