Issue
I have a file named input.txt
.
<hello script="2.5">
<welcome>
<hgsdhjaghjdghjagdjhgjdhgdajhgdajhgdhjjgfkjg
<number new="0x0000-0x3FF" Id="bhi" Range="4" no_id="hello" />
<----jsdjhsdjndkjjdhjdJHksdkjdnknnddnekfgrejgjorgj jregjgkrjglrjgojggjorjg--->
<number new="0x02" Id="bhi" Unit="0" Range="4" info="0x00000012" no_id="hi all" />
<number new="0x04" Id="bhi" Unit="0" Range="4" info="0x0000023f" no_id="dbhwd" />
<---- dfiuhdwiudi iwqdidffenfj odwqjdjqwgru jdqkkjwfkjfwn odHHOIJD JSDNKS nsk---->
<number new="0x06" Id="bhi" Unit="0" Range="4" info="0x00000f22" no_id="sjkdnkl jdsnj" />
<number new="0x08" Id="bhi" Unit="0" Range="4" info="0x00000f1b" no_id="dm o" />
<---bdheuh jwdhjwdkiwh---->
</abc>
<abc data="CS"
<number new="0x32" Id="bhi" Range="4" info="0x000012f5" no_id="he d kd" />
<number new="0x336" Id="bhi" Range="4" info="0x00000df2" no_id="dnkwn" />
<number new="0x428" Id="bhi" Range="4" info="0x0001cbf2" no_id="h nd" />
</abc>
<abc data="CS1">
<number new="0x35" Id="bhi" Range="4" info="0x000013A5" no_id="lm fed" />
<number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
<number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
</abc>
<abc data="P1">
<number new="0x06" Id="bhi" Unit="0" unit_id="hi_all" Range="2" info="0x0f22" no_id="sjkdnkl jdsnj' />
<number new="0x08" Id="bhi" Unit="0" unit_id="this new" Range="4" info="0x00000f1b" no_id="dm o" />
</abc>
<--adhhj jdwjdkkj jsSDjkasdj jefnflefk kjsjfoekfle kajfofkp ksaokdfpef---->
<---the end of file---->
From this file i need to create another file named output.txt which will only contain new and info values.
this is my current attempt:
import gzip, re
pattern = r'(new=\"\w+\").*(info=\"\w+\")'
with gzip.open("input.txt.gz", "rb") as fin:
with open("output.txt", "w") as fout:
for line in fin:
for match_new, match_info in re.findall(pattern, line.decode('utf-8')):
fout.write(f'{match_new} {match_info}\n')
This code produces an output file like this:
output.txt
new="0x02" info="0x00000012"
new="0x04" info="0x0000023f"
new="0x06" info="0x00000f22"
new="0x08" info="0x00000f1b"
new="0x32" info="0x000012f5"
new="0x336" info="0x00000df2"
new="0x428" info="0x0001cbf2"
new="0x35" info="0x000013A5"
new="0x326" info="0x00003252"
new="0x466" info="0x0001cbf2"
new="0x06" info="0x0f22"
new="0x08" info="0x00000f1b"
However i need only new and info values under abc data="CS1". so my output.txt should look like this:
expected output:
output.txt
new="0x35" info="0x000013A5"
new="0x326" info="0x00003252"
new="0x466" info="0x0001cbf2"
How to solve this?
Solution
OK, I am going to post this, minus the gzip stuff and minus the nice extraction from the regex. It uses a basic State Machine to decide when to pay attention and when not to.
text="""
<hello script="2.5">
<welcome>
<hgsdhjaghjdghjagdjhgjdhgdajhgdajhgdhjjgfkjg
<number new="0x0000-0x3FF" Id="bhi" Range="4" no_id="hello" />
<----jsdjhsdjndkjjdhjdJHksdkjdnknnddnekfgrejgjorgj jregjgkrjglrjgojggjorjg--->
<number new="0x02" Id="bhi" Unit="0" Range="4" info="0x00000012" no_id="hi all" />
<number new="0x04" Id="bhi" Unit="0" Range="4" info="0x0000023f" no_id="dbhwd" />
<---- dfiuhdwiudi iwqdidffenfj odwqjdjqwgru jdqkkjwfkjfwn odHHOIJD JSDNKS nsk---->
<number new="0x06" Id="bhi" Unit="0" Range="4" info="0x00000f22" no_id="sjkdnkl jdsnj" />
<number new="0x08" Id="bhi" Unit="0" Range="4" info="0x00000f1b" no_id="dm o" />
<---bdheuh jwdhjwdkiwh---->
</abc>
<abc data="CS"
<number new="0x32" Id="bhi" Range="4" info="0x000012f5" no_id="he d kd" />
<number new="0x336" Id="bhi" Range="4" info="0x00000df2" no_id="dnkwn" />
<number new="0x428" Id="bhi" Range="4" info="0x0001cbf2" no_id="h nd" />
</abc>
<abc data="CS1">
<number new="0x35" Id="bhi" Range="4" info="0x000013A5" no_id="lm fed" />
<number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
<number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
</abc>
<abc data="P1">
<number new="0x06" Id="bhi" Unit="0" unit_id="hi_all" Range="2" info="0x0f22" no_id="sjkdnkl jdsnj' />
<number new="0x08" Id="bhi" Unit="0" unit_id="this new" Range="4" info="0x00000f1b" no_id="dm o" />
</abc>
<--adhhj jdwjdkkj jsSDjkasdj jefnflefk kjsjfoekfle kajfofkp ksaokdfpef---->
<---the end of file---->
"""
import re
class WaitState:
"this is a fairly basic state machine"
LOOKFOR = {'<abc data="CS1">'}
marker = None
pattern = re.compile(r'(new=\"\w+\").*(info=\"\w+\")')
end = None
def __init__(self):
self.found = {}
def feed(self, line):
line = line.strip()
if line in self.LOOKFOR:
#flip into read mode
self.__class__ = ReadState
self.marker = line
self.found[line] = []
#marker to show how it ends
self.end = line.split()[0].replace("<","</") + ">"
class ReadState(WaitState):
#read state
def feed(self, line):
line = line.strip()
hit = self.pattern.search(line)
if hit:
self.found[self.marker].append(line)
else:
if line == self.end:
self.__class__ = WaitState
reader = WaitState()
for line in text.splitlines():
reader.feed(line)
#printing out output - no not necessarily as intended
for k, lines in reader.found.items():
print(k, ":")
for line in lines:
print(" ",line)
output
(no it's not what you want - I just grabbed the entire line on a hit - but what counts is the state management)
<abc data="CS1"> :
<number new="0x35" Id="bhi" Range="4" info="0x000013A5" no_id="lm fed" />
<number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
<number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
p.s. the self.__class__ = XXX
stuff works fine for basic state machines in Python. Bit odd-looking, but as long as your classes are made for this, this works.
This supports more than just 1 type of tag, but it doesn't support nested tags. That would just be some extra state to look after.
To write the results out to file, again whole-line rather than using regex named groups to isolate new
and info
:
lines_out = reader.found.get('<abc data="CS1">',[])
with open("output.txt","w") as fo:
for line in lines_out:
fo.write(f"{line}\n")
which gives:
% cat output.txt
<number new="0x35" Id="bhi" Range="4" info="0x000013A5" no_id="lm fed" />
<number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
<number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
Data structure is reader.found : dict[str,list[str]]
. To get back what you originally had in your desired output just modify the lines
if hit:
self.found[self.marker].append(line)
to extract new
and info
from the regex result in hit
.
Answered By - JL Peyret
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.