Tuesday, January 2, 2024

[FIXED] Fix csv file rows with more columns than others, in Python

January 02, 2024 csv, pandas, python No comments

Issue

I have to upload an Excel file into Teradata. So I took the tab I needed to upload and saved it as a csv file. (I was advised to use Teradata BTEQ after multiple failed attempts at using FastLoad through the Teradata Studio GUI.)

Issues:

Some rows have more columns than others.
When using BTEQ, some characters were misinterpreted.
I can print out some of the counts, but I eventually get the error
UnicodeDecodeError: 'charmap' codec can't decode byte ... in position ...: character maps to <undefined> and I'm not sure what to do about that.

I was advised to use Python to count the commas/delimiters in each row to find the ones with too many columns and fix them, but there are 125,000 rows and 66 columns in each row. (It is very dirty data that was manually entered without much use of Excel's data validation options.)

It would best if I could print out only the row numbers (not line numbers) of the rows that need to be fixed and fix them on the spot with a conditional statement.

The code I have now will print the file path, then the number of columns in each row on a new line, then stop processing and return a UnicodeDecodeError.

Code:

with open('Data.csv', 'r') as csv_file:  
    for line in csv_file:  
        print( line.count(','))

Solution

I'd recommend you start out with the logging method, kind of like was advised. If "fix them on the spot with a conditional statement" means stop reading/importing this row and let you somehow change it before resuming reading/importing... that seems very difficult. Also, you might like to know how much work you're in for before starting.

You only need to print out rows that don't meet expectations. Since you believe the data contains 66 columns, just print out rows that don't meet this expectation.

For a CSV that should have 3 columns:

Col1,Col2,Col3
r1c1,r1c2
r2c1,r2c2,r2c3
r3c1
r4c1
r5c1
r6c1,r6c2,r6c3
r7c1,r7c2,r7c3
r8c1,r8c2
r9c1,r9c2

generate a report like:

| Row # | N cols  |
| ----: | ------: |
|     1 |       2 |
|     3 |       1 |
|     4 |       1 |
|     5 |       1 |
|     8 |       2 |
|     9 |       2 |

import csv

N_COLS = 3

f_out = open("output_flat.csv", "w")
writer = csv.writer(f_out)
writer.writerow(["Row #", "N cols"])

# newline='' necessary if your fields have embedded newlines (good practice anyways)
f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

next(reader)  # skip header

for i, row in enumerate(reader, start=1):
    if len(row) != N_COLS:
        writer.writerow([i, len(row)])

But for bigger data:

Col_1,Col_2,Col_3
r01c1,r01c2
r02c1,r02c2,r02c3
r03c1
r04c1
r05c1
r06c1,r06c2,r06c3
r07c1,r07c2,r07c3
r08c1,r08c2
r09c1,r09c2
r10c1,r10c2,r10c3
r11c1,r11c2,r11c3
r12c1,r12c2,r12c3
r13c1,r13c2,r13c3
r14c1,r14c2,r14c3
r15c1,r15c2,r15c3
r16c1
r17c1,r17c2
r18c1,r18c2
r19c1,r19c2
r20c1,r20c2
r21c1,r21c2
r22c1,r22c2,r22c3
r23c1,r23c2
r24c1,r24c2,r24c3
r25c1,r25c2
r26c1,r26c2,r26c3
r27c1,r27c2
r28c1,r28c2,r28c3
r29c1,r29c2
r30c1,r30c2
r31c1
r32c1,r32c2
r33c1
r34c1,r34c2,r34c3

It might be more helpful to try and condense the bad-ness into ranges, like:

| N cols  | Row start | Row end |
| ------: | --------: | ------: |
|       2 |         1 |         |
|       1 |         3 |       5 |
|       2 |         8 |       9 |
|       1 |        16 |         |
|       2 |        17 |      21 |
|       2 |        23 |         |
|       2 |        25 |         |
|       2 |        27 |         |
|       2 |        29 |      30 |
|       1 |        31 |         |
|       2 |        32 |         |
|       1 |        33 |         |

f_out = open("output_ranges1.csv", "w")
writer = csv.writer(f_out)
writer.writerow(["N cols", "Row start", "Row end"])

def write_row(row: tuple[int, int, int]):
    """
    Write the col ct and the beg and end rows for the range with that col ct;
    unless beg and end are the same (one row), then just write beg row
    """
    if row[1] == row[2]:
        writer.writerow([row[0], row[1], ""])
    else:
        writer.writerow(row)

f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

ncols = len(next(reader))

NO_TRACK = -1

tracking = False
row_num = NO_TRACK
cols_ct = NO_TRACK

i = 0  # scoped outside of loop, enumerate will increment

for i, row in enumerate(reader, start=1):
    _ncols = len(row)

    if _ncols != cols_ct:
        if tracking:
            write_row((cols_ct, row_num, i - 1))  # i-1 for prev row

        if _ncols == ncols:
            tracking = False
            row_num = NO_TRACK
            cols_ct = NO_TRACK
        else:
            tracking = True
            row_num = i
            cols_ct = _ncols

# flush final tracked range
if tracking:
    write_row((cols_ct, row_num, i))

Answered By - Zach Young

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 2, 2024

[FIXED] Fix csv file rows with more columns than others, in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels