Issue
I have a piece of code that extracts the texts from several PDFs and puts them into a list of lists, called pages_text
Now I have my text in lists, I'm trying to clean it of special characters using this code:
for i in len(pages_text):
pages_text[i] = pages_text[i].lower()
re.sub('™', "", pages_text[i])
re.sub('[\n]', "", pages_text[i])
re.sub("'\n'", "", pages_text[i])
re.sub('[™]', '', pages_text[i])
re.sub('fl', '', pages_text[i])
re.sub('\nŒ', '', pages_text[i])
re.findall(r"\s+", pages_text[i])
print(pages_text)
But it isn't quite working to remove the special characters.
My question is :
- Can someone help me troubleshoot my cleaning process?
Grateful for any help pointing me in the right direction!
**Edited for concision and clarity
Solution
Python strings are not mutable, and re.sub
does not modify them in-place. You have to replace the original string with the new one returned by re.sub()
.
Also, instead of using multiple regular expressions you can much more efficiently combine these into a single regexp. For example:
special_chars_re = re.compile('[™flŒ\n]')
for idx, line in enumerate(pages_text):
pages_text[idx] = special_chars_re.sub('', line.lower())
For the rest of your questions, please keep posts to one question at a time to not risk your question being closed as too broad.
Answered By - Iguananaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.