Tuesday, November 2, 2021

[FIXED] Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

November 02, 2021 pandas, parsing, python, regex, spyder No comments

Issue

I need to parse specific strings from a free text field in an .xlsx file. I am using Python 2.7 in Spyder.

I escaped the '.' in the regex formulas but I am still getting the same error.

To do that, I used pandas to convert the .xslx file into a pandas dataframe:

data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)

Then, I used the extract function of pandas to extract my string fields FMEA, Rev and Line using regex patterns.

fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
    
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
    


line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)

The string fields I need to parse can be inputted in various ways and I accounted for each way in the regex formulas and for each variation of a word; for example, I accounted for line, Line, LINE, lines, Lines, etc. I have tested the regex formulas individually and separately and they are working properly. However, when I combine all of them in the code above, I get the following error message:

Error Message Screenshot

Also, is there another way to account for variations of the same word at the same time(lower case, upper case and title case)?

Solution

The main error in this case is due to the fact you are using a possessive quantifier instead of a regular, non-possessive quantifier.

It is a common mistake when users test their patterns in the online PCRE regex testers. You need to make sure you ALWAYS test your regexps in the environment (or with a regex engine option) that is compatible with your target environment.

Python re does not support possessive quantifiers:

{5}+
{5,}+
{5,10}+
++
?+
*+

In this case, you just need to remove the trailing + from \D{1,2}+:

rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'

It seems you may just use

rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev

See the regex demo at Regex101, note the Python option selected on the left.

Also, note that it is possible to make the whole pattern case insensitive by adding (?i) at the start of the pattern, or by compiling the regex with re.I or re.IGNORECASE arguments. This will "account for variations of the same word at the same time(lower case, upper case and title case)".

NOTE: if you actually are looking to use a possessive quantifier you may emulate a possessive quantifier with the help of a positive lookahead and a backreference. However, in Python, you would need re.finditer to get access to the whole match values.

Answered By - Wiktor Stribiżew

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 2, 2021

[FIXED] Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels