Sunday, November 21, 2021

[FIXED] Best way to find out number of rows in csv without loading the full thing

November 21, 2021 csv, pandas, python No comments

Issue

I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.

At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.

Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.

And this is a function I've been using to read csvs:

def read_csv_with_progressbar(filename: str,
                              chunksize: int = 50000) -> pd.DataFrame:

    length = pd.read_csv(filename, usecols=[0])
    length = length.values.shape[0]

    total = length//chunksize
    chunk_list = []

    chunks = pd.read_csv(filename, chunksize=chunksize)
    with tqdm(total=total, file=sys.stdout) as pbar:
        for chunk in chunks:
            chunk_list.append(chunk)
            pbar.set_description('Reading source csv file')
            pbar.update(1)

        df = pd.concat([i for i in chunk_list], axis=0)

    return df

Any way to get the number of rows in a csv faster that using my flawed method?

Solution

Assuming there are no quoted strings (with newlines in them) or other shenanigans in your CSV file an accurate (but hacky) solution is to not even parse the file but simply count the number of newlines in the file:

chunk = 1024*1024   # Process 1 MB at a time.
f = np.memmap("test.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
                   for i in range(0, len(f), chunk))
del f

Answered By - orlp

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 21, 2021

[FIXED] Best way to find out number of rows in csv without loading the full thing

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels