Issue
I'd like to recursively find all .md
files of the current directory that contain the “Narrow No-Break Space” U+202F
Unicode character between the two strings \begin{document}
and \end{document}
, possibly (and in fact essentially) not on the same line as U+202F
.
A great addition would be to replace such U+202F
s by normal spaces.
I already find a way to extract text between \begin{document}
and \end{document}
with a Python regexp (which I used to find easier for multi-line substitutions. I tried to use it just to list files with this pattern (planning to afterwards chain with grep
to at least get the files where this pattern contains U+202F
) but my attempts with:
def finds_files_whose_contents_match_a_regex(filename):
textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("\\begin{document}\s*(.*?)\s*\\end{document}", filetext)
for root, dirs, files in os.walk("."):
for filename in files:
if filename.endswith(".md"):
filename=os.path.join(root, filename)
finds_files_whose_contents_match_a_regex(filename)
but I got unintelligible (for me) errors:
Traceback (most recent call last):
File "./test-bis.py", line 14, in <module>
finds_files_whose_contents_match_a_regex(filename)
File "./test-bis.py", line 8, in finds_files_whose_contents_match_a_regex
matches = re.findall("\\begin{document}\s*(.*?)\s*\\end{document}", filetext)
File "/usr/lib64/python3.10/re.py", line 240, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib64/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib64/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib64/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib64/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib64/python3.10/sre_parse.py", line 526, in _parse
code = _escape(source, this, state)
File "/usr/lib64/python3.10/sre_parse.py", line 427, in _escape
raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \e at position 27
Solution
Assuming you are correctly reading or decoding an encoded file...
I would do something along these lines.
from pathlib import Path
import re
p=Path('/tmp') # Use your root path here
def replace_non_break_spaces(fn):
with open(fn,"r") as f:
cont=f.read()
cont_update=re.sub(r"\\begin{document}[\s\S]*?\\end{document}",
lambda m: m.group(0).replace("\u202F", "!"), cont)
if cont!=cont_update:
# at this point, write 'cont_update' back to the same file.
# File is only updated if the re.sub changes the string
pass
for fn in (x for x in p.glob("**/*.md") if x.is_file()):
replace_non_break_spaces(fn)
Given your example on regex101 (which I modified as seen):
\documentclass{article}
\usepackage[width=7cm]{geometry}
<=there are u202F there
\pagestyle{empty}
\begin{document}
Du texte aligné à droite :
<=there are u202F there
\raggedleft
cet exemple ne brille sans
doute pas par sa complexité.
Clair, non ?
\end{document}
The result is:
\documentclass{article}
\usepackage[width=7cm]{geometry}
<=there are u202F there
\pagestyle{empty}
\begin{document}
Du texte aligné à droite :
!!<=there are u202F there
\raggedleft
cet exemple ne brille sans
doute pas par sa complexité.
Clair, non!?
\end{document}
(The non-breaking spaces are replaced with !
for clarity...)
From the comment running python test.py doesn't change test.md:
from pathlib import Path
import re
p=Path('/tmp') # Use your root path here
def replace_non_break_spaces(fn):
with open(fn,"r") as f:
cont=f.read()
cont_update=re.sub(r"\\begin{document}[\s\S]*?\\end{document}",
lambda m: m.group(0).replace("\u202F", "!"), cont)
if cont!=cont_update:
print(f"Updating {fn}")
# make a backup...
with open(f"{fn}.bak", "w") as f:
f.write(cont)
with open(fn,"w") as f:
f.write(cont_update)
for fn in (x for x in p.glob("**/*.md") if x.is_file()):
replace_non_break_spaces(fn)
CAREFUL!!! This code will recursively change every .md
file in a tree (it does make backups as updated.)
Answered By - dawg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.