Sunday, September 11, 2022

[FIXED] Removing rows from a Data Frame column which contains lists if a specific string is within the list

September 11, 2022 pandas, python No comments

Issue

Suppose I have a DataFrame pd with a column called 'elements' which contains a list of a list of objects as shown below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element B, Element Dy, Element Os]

I would like to search through the column and if, for example, Element Mo is in that row delete the whole row to look like this:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element Al, Element B, Element Lu]
3       [Element B, Element Dy, Element Os]

I'm currently trying to do it with a for loop and if statements like this:

for entry in df2['elements']:
    if 'Element Mo' in entry:
        df2.drop(index=[entry],axis=0, inplace=True)
    else:
        continue

But it is not working and giving me a KeyError: [] not found in axis.

Update:

I just realized that the if and in statement route I showed does not search for exact string matches, but also strings that contain target string, so for example with the updated df below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element Mop, Element B, Element Lu]      
5       [Element B, Element Dy, Element Os]

If I run a for loop with if/in statements like this:

for ind in df2.index.values:
    entry = df2.loc[ind, 'elements']
    if 'Element Mo' in entry:
        df2.drop(index=ind ,axis=0, inplace=True)

Both row 2 and 5 will be dropped from the df because the string 'Element Mop' contains the string 'Element Mo', but I don't want this to happen. I tried updating the code above with regex like the one below, but it doesn't work.

for ind in df2.index.values:
        entry = df2.loc[ind, 'elements']
        if '\bElement Mo\b' in entry:
            df2.drop(index=ind ,axis=0, inplace=True)

Edit #2: Here is the dictionary of the first 25 items of the column:

df2_dict = df2['elements'].head(25).to_dict()

{0: '[Element B, Element Cr, Element Re]', 1: '[Element B, Element Rh, Element Sc]', 2: '[Element B, Element Mo, Element Y]', 3: '[Element Al, Element B, Element Lu]', 4: '[Element B, Element Dy, Element Os]', 5: '[Element B, Element Fe, Element Sc]', 6: '[Element B, Element Cr, Element W]', 7: '[Element B, Element Ni]', 9: '[Element B, Element Pr, Element Re]', 10: '[Element B, Element Cr, Element V]', 11: '[Element B, Element Co, Element Si]', 12: '[Element B, Element Co, Element Yb]', 13: '[Element B, Element Lu, Element Yb]', 14: '[Element B, Element Ru, Element Yb]', 15: '[Element B, Element Mn, Element Pd]', 16: '[Element B, Element Co, Element Tm]', 17: '[Element B, Element Fe, Element W]', 19: '[Element B, Element Ru, Element Y]', 20: '[Element B, Element Ga, Element Ta]', 21: '[Element B, Element Ho, Element Re]', 22: '[Element B, Element Si]', 23: '[Element B, Element Ni, Element Te]', 24: '[Element B, Element Nd, Element S]', 25: '[Element B, Element Ga, Element Rh, Element Sc]', 26: '[Element B, Element Co, Element La]'}

The actual issue here is that if I try to drop rows that contain the string 'Element S' (in row 25) all entries with elements like 'Element Sc' or 'Element Si' are also removed.

Solution

A pandas Series is sort of like a dictionary, where the keys are the index and the values are the series values.

So, entry isn't in the index. You could loop over the index, use the index to reference the values, e.g.:

for ind in df2.index.values:
    entry = df2.loc[ind, "elements"]
    if 'Element Mo' in entry:
        df2.drop(index=ind, axis=0, inplace=True)

However, it would be far better to use a vectorized solution. This isn't really possible with a series of lists (this really breaks the pandas data model), but you could at least subset your series once instead of iteratively reshaping. For example:

in_values = df2["elements"].apply(lambda x: "Element Mo" in x)
dropped = df2.loc[~in_values]

Update

After your edits, it looks like we're actually dealing with strings which look like lists! In that case, you're probably looking for a regular expression to make sure you match a complete "Element", bounded by either whitespace, a comma, or a bracket character. Pandas has a number of string methods, and regular expressions may be passed to pd.Series.str.contains with the flag regex=True.

I'll use the following regular expression to match strings preceeded by a [ or ,, as well as any amount of whitespace, then matching on Element Mo, followed by any amount of whitespace and either of the characters ] or ,:

r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])"

Pandas uses the same syntax as the builtin python re module - see that module's documentation for the full mini-language reference.

Applying this as a filter allows us to see the exact matches:

In [12]: df2[df2.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[12]:
2    [Element B, Element Mo, Element Y]
dtype: object

Similarly, we can invert the match and exclude any rows matching our filter:

In [13]: df2[~df2.elements.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[13]:
0                 [Element B, Element Cr, Element Re]
1                 [Element B, Element Rh, Element Sc]
3                 [Element Al, Element B, Element Lu]
4                 [Element B, Element Dy, Element Os]
5                 [Element B, Element Fe, Element Sc]
6                  [Element B, Element Cr, Element W]
7                             [Element B, Element Ni]
9                 [Element B, Element Pr, Element Re]
10                 [Element B, Element Cr, Element V]
11                [Element B, Element Co, Element Si]
12                [Element B, Element Co, Element Yb]
13                [Element B, Element Lu, Element Yb]
14                [Element B, Element Ru, Element Yb]
15                [Element B, Element Mn, Element Pd]
16                [Element B, Element Co, Element Tm]
17                 [Element B, Element Fe, Element W]
19                 [Element B, Element Ru, Element Y]
20                [Element B, Element Ga, Element Ta]
21                [Element B, Element Ho, Element Re]
22                            [Element B, Element Si]
23                [Element B, Element Ni, Element Te]
24                 [Element B, Element Nd, Element S]
25    [Element B, Element Ga, Element Rh, Element Sc]
26                [Element B, Element Co, Element La]
dtype: object

Answered By - Michael Delgado

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, September 11, 2022

[FIXED] Removing rows from a Data Frame column which contains lists if a specific string is within the list

Issue

Solution

Update

0 comments:

Post a Comment

Popular Posts

Labels