Wednesday, February 7, 2024

[FIXED] Pandas groupby with capture groups from extractall

February 07, 2024 pandas, python, python-3.x No comments

Issue

I am working with Pandas extract() method to feed the capture group into groupby. A small example of the result of this process is something like the one illustrated below:

import pandas as pd
import re
from io import StringIO

DATA = StringIO('''
colA;colB;colC
Foo;1;1
Bar;2;2
Foo,Bar;3;3
''')
df = pd.read_csv(DATA, sep=';')

m1 = df['colA'].str.extract('(Bar|Foo)', flags=re.IGNORECASE, expand=False)

for t, _d in df.groupby(m1):
    t
    _d

# 'Bar'
#   colA  colB  colC
# 1  Bar     2     2
#
# 'Foo'
#       colA  colB  colC
# 0      Foo     1     1
# 2  Foo,Bar     3     3

However, the row with index 2 (third row) is only captured in the Foo group, whereas I wanted to capture it in both Foo and Bar.

Playing around with the extractall() method seems to capture all matched groups, but apparently cannot be used together with groupby() because pandas complaints about the grouper not being 1-dimensional: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

m2 = df['colA'].str.extractall('(Bar|Foo)', flags=re.IGNORECASE)

#            0
#   match
# 0 0      Foo
# 1 0      Bar
# 2 0      Foo
#   1      Bar

The desired output for groupby() would be something like the following:

for t, _d in df.groupby(somematch):
    t
    _d

# 'Bar'
#       colA  colB  colC
# 1      Bar     2     2
# 2  Foo,Bar     3     3
#
# 'Foo'
#       colA  colB  colC
# 0      Foo     1     1
# 2  Foo,Bar     3     3

Any suggestions are welcome.

Solution

groupby is designed so that each row belongs to exactly one group, which is not true in your use-case. You'd need to extract the rows manually. One way to do so:

words = ['Foo','Bar']
for word in words:
    print(df.loc[df['colA'].str.contains(word)])

Output:

      colA  colB  colC
0      Foo     1     1
2  Foo,Bar     3     3
      colA  colB  colC
1      Bar     2     2
2  Foo,Bar     3     3

Or, if your data is given as in the example, you can use get_dummies to split the keywords within each cell:

keys = df['colA'].str.lower().str.get_dummies(',')
for key,data in keys.items():
    print(df.loc[data.eq(1)])

Answered By - Quang Hoang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 7, 2024

[FIXED] Pandas groupby with capture groups from extractall

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels