Issue
From a pdf file I extract all the text as a string, and convert it into the list by removing all the double white spaces, newlines (two or more), spaces (if two or more), and on every dot (.). Now in my list I want, if a value of a list consists of only special characters, that value should be excluded.
pdfFileObj = open('Python String.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()
z =re.split("\n+|[.]|\s{2,}",text)
while("" in z) :
z.remove("")
print(z)
My output is
['split()', 'method in Python split a string into a list of strings after breaking the', 'given string by the specified separator', 'Syntax', ':', 'str', 'split(separator, maxsplit)', 'Parameters', ':', 'separator', ':', 'This is a delimiter', ' The string splits at this specified separator', ' If is', 'no', 't provided then any white space is a separator', 'maxsplit', ':', 'It is a number, which tells us to split the string into maximum of provi', 'ded number of times', ' If it is not provided then the default is', '-', '1 that means there', 'is no limit', 'Returns', ':', 'Returns a list of s', 'trings after breaking the given string by the specifie', 'd separator']
Here are some values that contain only special characters and I want to remove those. Thanks
Solution
Use a regular expression that tests if a string contains any letters or numbers.
import re
z = [x for x in z if re.search(r'[a-z\d]', x, flags=re.I)]
In the regexp, a-z
matches letters, \d
matches digits, so [a-z\d]
matches any letter or digit (and the re.I
flag makes it case-insensitive). So the list comprehension includes any elements of z
that contain a letter or digit.
Answered By - Barmar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.