Issue
Suppose I have a DataFrame pd with a column called 'elements' which contains a list of a list of objects as shown below:
print(df2['elements'])
0 [Element B, Element Cr, Element Re]
1 [Element B, Element Rh, Element Sc]
2 [Element B, Element Mo, Element Y]
3 [Element Al, Element B, Element Lu]
4 [Element B, Element Dy, Element Os]
I would like to search through the column and if, for example, Element Mo is in that row delete the whole row to look like this:
print(df2['elements'])
0 [Element B, Element Cr, Element Re]
1 [Element B, Element Rh, Element Sc]
2 [Element Al, Element B, Element Lu]
3 [Element B, Element Dy, Element Os]
I'm currently trying to do it with a for loop and if statements like this:
for entry in df2['elements']:
if 'Element Mo' in entry:
df2.drop(index=[entry],axis=0, inplace=True)
else:
continue
But it is not working and giving me a KeyError: [] not found in axis.
Update:
I just realized that the if and in statement route I showed does not search for exact string matches, but also strings that contain target string, so for example with the updated df below:
print(df2['elements'])
0 [Element B, Element Cr, Element Re]
1 [Element B, Element Rh, Element Sc]
2 [Element B, Element Mo, Element Y]
3 [Element Al, Element B, Element Lu]
4 [Element Mop, Element B, Element Lu]
5 [Element B, Element Dy, Element Os]
If I run a for loop with if/in statements like this:
for ind in df2.index.values:
entry = df2.loc[ind, 'elements']
if 'Element Mo' in entry:
df2.drop(index=ind ,axis=0, inplace=True)
Both row 2 and 5 will be dropped from the df because the string 'Element Mop' contains the string 'Element Mo', but I don't want this to happen. I tried updating the code above with regex like the one below, but it doesn't work.
for ind in df2.index.values:
entry = df2.loc[ind, 'elements']
if '\bElement Mo\b' in entry:
df2.drop(index=ind ,axis=0, inplace=True)
Edit #2: Here is the dictionary of the first 25 items of the column:
df2_dict = df2['elements'].head(25).to_dict()
{0: '[Element B, Element Cr, Element Re]', 1: '[Element B, Element Rh, Element Sc]', 2: '[Element B, Element Mo, Element Y]', 3: '[Element Al, Element B, Element Lu]', 4: '[Element B, Element Dy, Element Os]', 5: '[Element B, Element Fe, Element Sc]', 6: '[Element B, Element Cr, Element W]', 7: '[Element B, Element Ni]', 9: '[Element B, Element Pr, Element Re]', 10: '[Element B, Element Cr, Element V]', 11: '[Element B, Element Co, Element Si]', 12: '[Element B, Element Co, Element Yb]', 13: '[Element B, Element Lu, Element Yb]', 14: '[Element B, Element Ru, Element Yb]', 15: '[Element B, Element Mn, Element Pd]', 16: '[Element B, Element Co, Element Tm]', 17: '[Element B, Element Fe, Element W]', 19: '[Element B, Element Ru, Element Y]', 20: '[Element B, Element Ga, Element Ta]', 21: '[Element B, Element Ho, Element Re]', 22: '[Element B, Element Si]', 23: '[Element B, Element Ni, Element Te]', 24: '[Element B, Element Nd, Element S]', 25: '[Element B, Element Ga, Element Rh, Element Sc]', 26: '[Element B, Element Co, Element La]'}
The actual issue here is that if I try to drop rows that contain the string 'Element S' (in row 25) all entries with elements like 'Element Sc' or 'Element Si' are also removed.
Solution
A pandas Series is sort of like a dictionary, where the keys are the index and the values are the series values.
So, entry isn't in the index. You could loop over the index, use the index to reference the values, e.g.:
for ind in df2.index.values:
entry = df2.loc[ind, "elements"]
if 'Element Mo' in entry:
df2.drop(index=ind, axis=0, inplace=True)
However, it would be far better to use a vectorized solution. This isn't really possible with a series of lists (this really breaks the pandas data model), but you could at least subset your series once instead of iteratively reshaping. For example:
in_values = df2["elements"].apply(lambda x: "Element Mo" in x)
dropped = df2.loc[~in_values]
Update
After your edits, it looks like we're actually dealing with strings which look like lists! In that case, you're probably looking for a regular expression to make sure you match a complete "Element", bounded by either whitespace, a comma, or a bracket character. Pandas has a number of string methods, and regular expressions may be passed to pd.Series.str.contains
with the flag regex=True
.
I'll use the following regular expression to match strings preceeded by a [
or ,
, as well as any amount of whitespace, then matching on Element Mo
, followed by any amount of whitespace and either of the characters ]
or ,
:
r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])"
Pandas uses the same syntax as the builtin python re
module - see that module's documentation for the full mini-language reference.
Applying this as a filter allows us to see the exact matches:
In [12]: df2[df2.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[12]:
2 [Element B, Element Mo, Element Y]
dtype: object
Similarly, we can invert the match and exclude any rows matching our filter:
In [13]: df2[~df2.elements.str.contains(r"(?<=[\[,])\s*Element Mo\s*(?=[,\]])", regex=True)]
Out[13]:
0 [Element B, Element Cr, Element Re]
1 [Element B, Element Rh, Element Sc]
3 [Element Al, Element B, Element Lu]
4 [Element B, Element Dy, Element Os]
5 [Element B, Element Fe, Element Sc]
6 [Element B, Element Cr, Element W]
7 [Element B, Element Ni]
9 [Element B, Element Pr, Element Re]
10 [Element B, Element Cr, Element V]
11 [Element B, Element Co, Element Si]
12 [Element B, Element Co, Element Yb]
13 [Element B, Element Lu, Element Yb]
14 [Element B, Element Ru, Element Yb]
15 [Element B, Element Mn, Element Pd]
16 [Element B, Element Co, Element Tm]
17 [Element B, Element Fe, Element W]
19 [Element B, Element Ru, Element Y]
20 [Element B, Element Ga, Element Ta]
21 [Element B, Element Ho, Element Re]
22 [Element B, Element Si]
23 [Element B, Element Ni, Element Te]
24 [Element B, Element Nd, Element S]
25 [Element B, Element Ga, Element Rh, Element Sc]
26 [Element B, Element Co, Element La]
dtype: object
Answered By - Michael Delgado
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.