Issue
i have multi-column dataframe of Flickr tags with 41,000 rows, and in one of the column i want to remove all the a href tags. I find that BeautifulSoup is a superb package to parsing HTML documents. But I find it hard to apply BeautifulSoup in only one column, leaving other columns intact, with as simple as possible python code.
this is my code for BeautifulSoup:
from bs4 import BeautifulSoup
def remove_link(text):
soup = BeautifulSoup(text, 'html.parser')
return soup.get_text()
my dataframe looks like this:
column1 column2 column3
<a href="www.asia.com>Breda</a> <a href="www.stackoverflow.com>result</a> 25,000
but i couldn't figure out yet how to apply this function with lambda, because i want the code as simple as possible.
I want to remove the a href tags only from column2, so the output should be like this:
column1 column2 column3
<a href="www.asia.com>Breda</a> result 25,000
Solution
apparently, it is as easy as this:
df['column2'] = df['column2'].apply(lambda x: remove_link(x))
I was confused about the remove_link(x) that's why i posted here. it works!!!
Answered By - Jack Zaki Zakiul Fahmi Jailani
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.