Issue
I've a dataset (around 10Gb) of call records. There's column with ip addresses that I want to split into four new columns. I'm trying to use:
df['ip'].fillna('0.0.0.0', inplace=True)
df = df.join(df['ip'].apply(lambda x: Series(x.split('.'))))
but it's tooooo slow... the fillna
is fast, like 10s, but then it stays in the split for like 5 hours...
Is there any better way to do it?
Solution
This answer is outdated, as is this question. The problem identified below was fixed some time ago. The pandas str.split
method should now be fast.
It turns out that the str.split
in Pandas (in core/strings.py
as str_split
) is actually very slow; it isn't any more efficient, and still iterates through using Python, offering no speedup whatsoever.
Actually, see below. Pandas performance on this is simply miserable; it's not just Python vs C iteration, as doing the same thing with Python lists is the fastest method!
Interestingly, though, there's a trick solution that's much faster: writing the Series out to text, and then reading it in again, with '.' as the separator:
df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
To compare, I use Marius' code and generate 20,000 ips:
import pandas as pd
import random
import numpy as np
from StringIO import StringIO
def make_ip():
return '.'.join(str(random.randint(0, 255)) for n in range(4))
df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
# 1 loops, best of 3: 3.06 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
# 1 loops, best of 3: 3.1 s per loop
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.',header=None)
# 10 loops, best of 3: 46.4 ms per loop
Ok, so I wanted to compare all of this to just using a Python list and the Python split, which should be slower than using the more efficient Pandas:
iplist = list(df['ip'])
%timeit [ x.split('.') for x in iplist ]
100 loops, best of 3: 10 ms per loop
What!? Apparently, the best way to do a simple string operation on a large number of strings is to throw out Pandas entirely. Using Pandas makes the process 400 times slower. If you want to use Pandas, though, you may as well just convert to a Python list and back:
%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
# 100 loops, best of 3: 18.4 ms per loop
There's something very wrong here.
Answered By - cge
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.