Tuesday, January 2, 2024

[FIXED] Dropping rows from dataframe if they contain a string within a series

January 02, 2024 dataframe, pandas No comments

Issue

I've got a dataframe with one column called "text" that is a series of strings, ie. [Joe, Biden, Is, President]

I'm trying to drop every row that contains the word "Joe" in the column "text". To do this I wrote:

dfl[~dfl['text'].str.contains("Joe", na=False)]

I thought this would work, but it's just returning the full dataframe again.

I would also like to create a new dataframe with just the rows that contain "Joe" in the text column. any help with that would be appreciated too!

To clarify, the table looks like:

index	label	score	text	ID
0	NEGATIVE	0.983319103717804	perrosloja,Expresoec,Es,del,partido,social,cristiano,la,lista,6,SenateGOP,SenateDems,JoeBiden,SenBobCasey,AsambleaEcuador,OEADDOT,ONUGeneve,ONUecuador,NoticiasONU,ilo,ONUes,Hay,2,bandos,de,mafias,izquierda,la,derecha,mafias,polc3adticas,ActualidadRT,TelemundoNews,CNNEE,soyfdelrincon,httpstcoYQnZsBKKdF	0
1	NEGATIVE	0.990364134311676	MolinaPvanya,JoeBiden,httpstcojAsJ08durF	1
2	NEGATIVE	0.8683468103408813	Iowa4Nikki,Whoes,Best,Person,PresidentnnDonald,Trump,Robert,F,Kennedy,Vivek,Ramaswamy,rest,presidential,candidates,except,Joe,Biden,good,candidates,due,respect,none,best,best,one,job,United,States,needse280a6	2
3	POSITIVE	0.999308705329895	amazing,JoeBiden,calls,Dick,f09fa4a3,httpstcopsV0uqG8aL	3
4	NEGATIVE	0.7860859036445618	ChrisDJackson,Whoes,Best,Person,PresidentnnDonald,Trump,Robert,F,Kennedy,Vivek,Ramaswamy,rest,presidential,candidates,except,Joe,Biden,good,candidates,due,respect,none,best,best,one,job,United,States,needse280a6	4
5	NEGATIVE	0.9982330799102783	PalBint,JoeBiden,much,blow,around,fake,White,House,probably,none	5
6	POSITIVE	0.842793345451355	thehill,e2809cJoeBiden,got,81million,votes,us,presidential,historye2809d,Even,Barack,Hussein,nnSo,wtf,Bolsheviks,afraid,actual,democracy,nf09fa4a3f09fa4a3f09fa4a3f09fa4a3f09fa4a3	6
7	NEGATIVE	0.998753547668457	tfbow,JimJordan,Weaponization,HouseGOP,JudiciaryGOP,HouseDemocrats,SenateGOP,SenateDems,MaineSenateGOP,2020,Election,Lawfully,CertifiablenJoe,Biden,Win,2020,ElectionnnThe,2022,AZ,Gubernatorial,Election,CertifiablenKatie,Hobbs,Win,2022,ElectionnnThe,Brunsons,Heroes,Others,Follow,LitigationnnDry,WeepyEye	7
8	NEGATIVE	0.9963979721069336	JoeBiden,Hows,youre,boss,httpstcogfuDElJdG1	8
9	NEGATIVE	0.9973702430725098	mitchellvii,Joe,Biden,committed,treason,intentionally,ensuring,stream,illegal,immigrants,remains,unhindered,least,take,ballot	9
10	NEGATIVE	0.9728578925132751	mirandadevine,DavidHo71155831,JoeBiden,BarackObama,AliMayorkas,SecBlinken,belong,prison	10

Here is the output of : df.loc[0, 'text']

['perrosloja', 'Expresoec', 'Es', 'del', 'partido', 'social', 'cristiano', 'la', 'lista', '6', 'SenateGOP', 'SenateDems', 'JoeBiden', 'SenBobCasey', 'AsambleaEcuador', 'OEADDOT', 'ONUGeneve', 'ONUecuador', 'NoticiasONU', 'ilo', 'ONUes', 'Hay', '2', 'bandos', 'de', 'mafias', 'izquierda', 'la', 'derecha', 'mafias', 'polc3adticas', 'ActualidadRT', 'TelemundoNews', 'CNNEE', 'soyfdelrincon', 'httpstcoYQnZsBKKdF']

So I guess yes it is a list of strings

Solution

IIUC, each row contains a list of words. You can try:

m = df.loc[df['text'].notna(), 'text'].map(' '.join).str.contains('Joe', case=False)

joe = df.loc[m[m].index]

The code above concatenate all words to make a sentence then use str.contains to find the word 'Joe` (case insensitive)

>>> df['text'].map(' '.join)
0    Joe Biden Sent Spinning Federal Court
1                           Election Cycle
2           Donald Trump running next year
Name: text, dtype: object

>>> df['text'].map(' '.join).str.contains('Joe', case=False)
0     True
1    False
2    False
Name: text, dtype: bool

Output:

>>> joe
                                           text
0  [Joe, Biden, Sent, Spinning, Federal, Court]

Details:

Input:

>>> df
                                           text
0  [Joe, Biden, Sent, Spinning, Federal, Court]
1                             [Election, Cycle]
2          [Donald, Trump, running, next, year]

Update

If you don't have nan values in text column and you want to create 2 dataframes for each case, you can use this code:

m = df['text'].map(' '.join).str.contains('Joe', case=False)

joe = df[m]
oth = df[~m]

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 2, 2024

[FIXED] Dropping rows from dataframe if they contain a string within a series

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels