Monday, November 14, 2022

[FIXED] Using `split` on columns too slow - how can I get better performance?

November 14, 2022 pandas, performance, python, split No comments

Issue

I've a dataset (around 10Gb) of call records. There's column with ip addresses that I want to split into four new columns. I'm trying to use:

df['ip'].fillna('0.0.0.0', inplace=True)
df = df.join(df['ip'].apply(lambda x: Series(x.split('.'))))

but it's tooooo slow... the fillna is fast, like 10s, but then it stays in the split for like 5 hours...

Is there any better way to do it?

Solution

This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split method should now be fast.

It turns out that the str.split in Pandas (in core/strings.py as str_split) is actually very slow; it isn't any more efficient, and still iterates through using Python, offering no speedup whatsoever.

Actually, see below. Pandas performance on this is simply miserable; it's not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!

Interestingly, though, there's a trick solution that's much faster: writing the Series out to text, and then reading it in again, with '.' as the separator:

df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')

To compare, I use Marius' code and generate 20,000 ips:

import pandas as pd
import random
import numpy as np
from StringIO import StringIO

def make_ip():
    return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop

Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:

iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop

What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop

There's something very wrong here.

Answered By - cge

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 14, 2022

[FIXED] Using `split` on columns too slow - how can I get better performance?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels