Issue
Please help me solve the problem with clearing text from unnecessary parts.
I have an example of dataset:
df = pd.DataFrame({'addressfrom': ['Hüseyinağa, Rexee Hotel, Büyük Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})
and a list of:
keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']
I'm trying to extract a part of a string with keywords. At the same time, I need to eliminate the text that surrounds the desired part. I'm attempting to achieve this using a separator ',' in some cases it may be '-'. In the end, I want to achieve the following format.
index | addressfrom |
---|---|
0 | Rexee Hotel |
1 | Rixos Premium |
2 | Hotel Hilton Antalya |
3 | Residence Hotel & SPA |
The best I managed to achieve was this
`df = pd.DataFrame({'addressfrom': ['Hüseyinağa, Rexee Hotel, Büyük Bayram Sokak', 'Rixos Premium', '123 Main St, Hotel Hilton Antalya', 'Residence Hotel & SPA, 1234']})
keywords = ['hotel', 'resort', 'hilton', 'novotel', 'rixos', 'palace', 'residence', 'radisson', 'holiday', 'apartments', 'plaza', 'inn', 'club', 'spa']
pattern = f'[^,]*({"|".join(keywords)})[^,]*'
df['addressfrom'] = df['addressfrom'].str.extract(pattern, flags=re.IGNORECASE)
print(df)`
Output:
index | addressfrom |
---|---|
0 | Hotel |
1 | Resort |
2 | Hilton |
3 | Rixos |
Solution
One way to achieve this as per me is to split the address string using a comma as the separator, and then appliy the regex pattern to each part. Then extract the matched parts and join them back into a single string. Something like:
def extract_keywords(s, keywords):
pattern = f'[^,]*\\b({"|".join(keywords)})\\b[^,]*'
match = re.search(pattern, s, flags=re.IGNORECASE)
return match.group(0) if match else None
df['addressfrom'] = df['addressfrom'].apply(lambda x: extract_keywords(x, keywords))
Answered By - mandy8055
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.