Monday, December 4, 2023

[FIXED] List all files containing a string between two specific strings (not on the same line)

December 04, 2023 python, regex No comments

Issue

I'd like to recursively find all .md files of the current directory that contain the “Narrow No-Break Space” U+202F Unicode character between the two strings \begin{document} and \end{document}, possibly (and in fact essentially) not on the same line as U+202F.

A great addition would be to replace such U+202Fs by normal spaces.

I already find a way to extract text between \begin{document} and \end{document} with a Python regexp (which I used to find easier for multi-line substitutions. I tried to use it just to list files with this pattern (planning to afterwards chain with grep to at least get the files where this pattern contains U+202F) but my attempts with:

def finds_files_whose_contents_match_a_regex(filename):
    textfile = open(filename, 'r')
    filetext = textfile.read()
    textfile.close()
    matches = re.findall("\\begin{document}\s*(.*?)\s*\\end{document}", filetext)

for root, dirs, files in os.walk("."):
    for filename in files:
        if filename.endswith(".md"):
            filename=os.path.join(root, filename)
            finds_files_whose_contents_match_a_regex(filename)

but I got unintelligible (for me) errors:

Traceback (most recent call last):
  File "./test-bis.py", line 14, in <module>
    finds_files_whose_contents_match_a_regex(filename)
  File "./test-bis.py", line 8, in finds_files_whose_contents_match_a_regex
    matches = re.findall("\\begin{document}\s*(.*?)\s*\\end{document}", filetext)
  File "/usr/lib64/python3.10/re.py", line 240, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/lib64/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib64/python3.10/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib64/python3.10/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib64/python3.10/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib64/python3.10/sre_parse.py", line 526, in _parse
    code = _escape(source, this, state)
  File "/usr/lib64/python3.10/sre_parse.py", line 427, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \e at position 27

Solution

Assuming you are correctly reading or decoding an encoded file...

I would do something along these lines.

from pathlib import Path 
import re 

p=Path('/tmp')  # Use your root path here

def replace_non_break_spaces(fn):
    with open(fn,"r") as f:
        cont=f.read()
    
    cont_update=re.sub(r"\\begin{document}[\s\S]*?\\end{document}", 
        lambda m: m.group(0).replace("\u202F", "!"), cont)
    
    if cont!=cont_update:
        # at this point, write 'cont_update' back to the same file. 
        # File is only updated if the re.sub changes the string
        pass

for fn in (x for x in p.glob("**/*.md") if x.is_file()):
    replace_non_break_spaces(fn)

Given your example on regex101 (which I modified as seen):

\documentclass{article}
\usepackage[width=7cm]{geometry}
  <=there are u202F there
\pagestyle{empty}
\begin{document}
Du texte aligné à droite :
  <=there are u202F there
\raggedleft
cet exemple ne brille sans
doute pas par sa complexité.

Clair, non ? 
\end{document}

The result is:

\documentclass{article}
\usepackage[width=7cm]{geometry}
  <=there are u202F there
\pagestyle{empty}
\begin{document}
Du texte aligné à droite :
!!<=there are u202F there
\raggedleft
cet exemple ne brille sans
doute pas par sa complexité.

Clair, non!?
\end{document}

(The non-breaking spaces are replaced with ! for clarity...)

From the comment running python test.py doesn't change test.md:

from pathlib import Path 
import re 

p=Path('/tmp')  # Use your root path here

def replace_non_break_spaces(fn):
    with open(fn,"r") as f:
        cont=f.read()
    
    cont_update=re.sub(r"\\begin{document}[\s\S]*?\\end{document}", 
        lambda m: m.group(0).replace("\u202F", "!"), cont)
    
    if cont!=cont_update:
        print(f"Updating {fn}")
        # make a backup...
        with open(f"{fn}.bak", "w") as f:
            f.write(cont)
        with open(fn,"w") as f:
            f.write(cont_update)

for fn in (x for x in p.glob("**/*.md") if x.is_file()):
    replace_non_break_spaces(fn)

CAREFUL!!! This code will recursively change every .md file in a tree (it does make backups as updated.)

Answered By - dawg

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] List all files containing a string between two specific strings (not on the same line)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels