Issue
I have a column in dataframe consisting of lists of URLs.
index url_all
0 ['https://google.com/7TU4za', 'http://twitter.com/d']
1 ['https://google.com/7TU4bb', 'facebook.com']
2 ['https://google.com/7TU4bc', 'https://twitter.com/a']
3 ['http://google.com/7TU4ad', 'https://twitter.com/b']
4 ['https://google.com/7TU4ze', 'twitter.com/c']
I want to remove elements in the list if it starts with 'http' or 'https'. The desired output is here.
index url_all
0 []
1 ['facebook.com']
2 []
3 []
4 ['twitter.com/c']
So far I have tried the following, but it did not work.
df['url_all'] = df['url_all'].apply(lambda lst: [x for x in lst if not x.startswith("'http|'https")])
It gives this output as below (for brevity, only the first few rows of the output are shown):
url_all
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, :, /, /, g, o, o, g, ..]]
[[, ', h, t, t, p, s, :, /, /, g, o, o, g, ..]]
How can I do that please?
Solution
You can use .apply()
with ast.literal_eval()
(note that anything that starts with "https"
will also start with "http"
, per a suggestion from Nick):
import ast
df['url_all'] = (df['url_all']
.apply(ast.literal_eval)
.apply(lambda lst: [x for x in lst if not x.startswith("http")]))
This outputs:
index url_all
0 0 []
1 1 [facebook.com]
2 2 []
3 3 []
4 4 [twitter.com/c]
Answered By - BrokenBenchmark
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.