Issue
I need to parse specific strings from a free text field in an .xlsx file. I am using Python 2.7 in Spyder.
I escaped the '.' in the regex formulas but I am still getting the same error.
To do that, I used pandas to convert the .xslx file into a pandas dataframe:
data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)
Then, I used the extract function of pandas to extract my string fields FMEA, Rev and Line using regex patterns.
fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)
The string fields I need to parse can be inputted in various ways and I accounted for each way in the regex formulas and for each variation of a word; for example, I accounted for line, Line, LINE, lines, Lines, etc. I have tested the regex formulas individually and separately and they are working properly. However, when I combine all of them in the code above, I get the following error message:
Also, is there another way to account for variations of the same word at the same time(lower case, upper case and title case)?
Solution
The main error in this case is due to the fact you are using a possessive quantifier instead of a regular, non-possessive quantifier.
It is a common mistake when users test their patterns in the online PCRE regex testers. You need to make sure you ALWAYS test your regexps in the environment (or with a regex engine option) that is compatible with your target environment.
Python re
does not support possessive quantifiers:
{5}+
{5,}+
{5,10}+
++
?+
*+
In this case, you just need to remove the trailing +
from \D{1,2}+
:
rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
It seems you may just use
rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev
See the regex demo at Regex101, note the Python
option selected on the left.
Also, note that it is possible to make the whole pattern case insensitive by adding (?i)
at the start of the pattern, or by compiling the regex with re.I
or re.IGNORECASE
arguments. This will "account for variations of the same word at the same time(lower case, upper case and title case)".
NOTE: if you actually are looking to use a possessive quantifier you may emulate a possessive quantifier with the help of a positive lookahead and a backreference. However, in Python, you would need re.finditer
to get access to the whole match values.
Answered By - Wiktor Stribiżew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.