Issue
I am working with Pandas extract()
method to feed the capture group into groupby
. A small example of the result of this process is something like the one illustrated below:
import pandas as pd
import re
from io import StringIO
DATA = StringIO('''
colA;colB;colC
Foo;1;1
Bar;2;2
Foo,Bar;3;3
''')
df = pd.read_csv(DATA, sep=';')
m1 = df['colA'].str.extract('(Bar|Foo)', flags=re.IGNORECASE, expand=False)
for t, _d in df.groupby(m1):
t
_d
# 'Bar'
# colA colB colC
# 1 Bar 2 2
#
# 'Foo'
# colA colB colC
# 0 Foo 1 1
# 2 Foo,Bar 3 3
However, the row with index 2
(third row) is only captured in the Foo
group, whereas I wanted to capture it in both Foo
and Bar
.
Playing around with the extractall()
method seems to capture all matched groups, but apparently cannot be used together with groupby()
because pandas complaints about the grouper not being 1-dimensional: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
m2 = df['colA'].str.extractall('(Bar|Foo)', flags=re.IGNORECASE)
# 0
# match
# 0 0 Foo
# 1 0 Bar
# 2 0 Foo
# 1 Bar
The desired output for groupby()
would be something like the following:
for t, _d in df.groupby(somematch):
t
_d
# 'Bar'
# colA colB colC
# 1 Bar 2 2
# 2 Foo,Bar 3 3
#
# 'Foo'
# colA colB colC
# 0 Foo 1 1
# 2 Foo,Bar 3 3
Any suggestions are welcome.
Solution
groupby
is designed so that each row belongs to exactly one group, which is not true in your use-case. You'd need to extract the rows manually. One way to do so:
words = ['Foo','Bar']
for word in words:
print(df.loc[df['colA'].str.contains(word)])
Output:
colA colB colC
0 Foo 1 1
2 Foo,Bar 3 3
colA colB colC
1 Bar 2 2
2 Foo,Bar 3 3
Or, if your data is given as in the example, you can use get_dummies
to split the keywords within each cell:
keys = df['colA'].str.lower().str.get_dummies(',')
for key,data in keys.items():
print(df.loc[data.eq(1)])
Answered By - Quang Hoang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.