Issue
I am trying to convert a bad text file with space delimited values to a clean csv file. Please guide me.
Below is my Data. Data hasn't been properly matched to the output columns in csv.
HP TRA ID CL ID IN/EId No Loop Element Name Freq STATUS Error Severity Error ID Message Report Source
13ZI 20712800032 1 Denied Error HP_DOSOlderTh Date of service is older than 12 months HP
13ZI 20712800032 1 1 Rejected Error CA16 Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
13ZI 20712800032 2 1 Rejected Error CA16 Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
13ZI 20712800032 3 1 Rejected Error CA16 Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716. HP
1P8TY0J25 20712805263 1 Denied Error HP_DOSOlderTh Date of service is older than 12 months
I have tried the below code but no luck.
df = pd.read_csv('file1.txt', sep='\t', index_col=False, dtype='object')
df.to_csv(r'Report.csv', index = None)
Also below line. Data hasn't been properly matched to the columns in csv
df = pd.read_csv("file1.txt", sep=r"\s{2,}", engine="python")
df.to_csv(r'Report.csv', index = None)
I am expecting the output like this
Solution
Yes, that's a bad txt file. I wouldn't expect any built-in function to be able to handle it. You might be able to write some custom code to convert it to a proper CSV file. E.g., here's some code that works (very specifically) on the sample input, and you might be able to modify to work on the real file. But it's also possible that the file is basically ambiguous, and you'll have to hand-tweak any conversion result using your knowledge of what the data means.
import sys, re, pprint
table = []
for (i, line) in enumerate(open('input.txt').readlines()):
if line[-1] == '\n': line = line[:-1]
if i == 0:
# header
# In the header, we can assume that no 'cell' would be empty,
# so we can just split on runs of 2 or more spaces.
row = re.split(r' {2,}', line)
table.append(row)
continue
if i == 1:
assert line == ''
# blank line between header and data
continue
# For all other lines, we have to look at runs of 2+ spaces
# and decide what they 'mean'.
def replfunc(mo):
L = len(mo.group(0))
# Some 'Message' values say:
# "Rejected at level. DupKeyID..."
# i.e., there's a run of 2 spaces *within* a cell value.
# Deal with this particular case.
if L == 2:
(start, end) = mo.span()
if (
line[:start].endswith('Rejected at')
and
line[end:].startswith('level.')
):
# Replace it with a single space.
return ' '
# Otherwise, this run of spaces is equivalent to
# one or more field-separators.
# We'll replace it with tabs and then split on tabs.
if L < 2:
assert 0
elif 2 <= L <= 12:
return '\t'
elif L == 17:
return '\t\t'
elif L == 43:
return '\t\t\t'
elif L == 61:
return '\t\t\t\t\t'
elif L == 120:
return '\t'
elif L == 263:
return '\t'
else:
return f'<{L}>'
tabbed_line = re.sub(r'\s{2,}', replfunc, line)
row = tabbed_line.split('\t')
table.append(row)
# ---------------------------
# The rest is just to display the resulting table nicely.
max_n_fields = max(
len(row)
for row in table
)
field_widths = []
for j in range(max_n_fields):
field_width = max(
len(row[j])
for row in table
if j < len(row)
)
field_widths.append(field_width)
for (i, row) in enumerate(table):
for (field, field_width) in zip(row, field_widths):
print(field.ljust(field_width), end='|')
print()
Here's the output:
HP TRA ID|CL ID |IN/EId|No|Loop|Element Name|Freq|STATUS |Error Severity|Error ID |Message |Report Source|
13ZI |20712800032| | | | |1 |Denied |Error |HP_DOSOlderTh|Date of service is older than 12 months |HP ||
13ZI |20712800032| |1 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
13ZI |20712800032| |2 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
13ZI |20712800032| |3 | | |1 |Rejected|Error |CA16 |Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP ||
1P8TY0J25|20712805263| | | | |1 |Denied |Error |HP_DOSOlderTh|Date of service is older than 12 months |
Answered By - Michael Dyck
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.