Monday, November 14, 2022

[FIXED] How to get a data under a specific string in a file?

November 14, 2022 python, python-3.x No comments

Issue

I have a file named input.txt .

<hello script="2.5">
<welcome>
     <hgsdhjaghjdghjagdjhgjdhgdajhgdajhgdhjjgfkjg
     <number new="0x0000-0x3FF" Id="bhi" Range="4" no_id="hello" />
               
          <----jsdjhsdjndkjjdhjdJHksdkjdnknnddnekfgrejgjorgj jregjgkrjglrjgojggjorjg--->
          <number new="0x02" Id="bhi" Unit="0" Range="4" info="0x00000012" no_id="hi all" />
          <number new="0x04" Id="bhi" Unit="0" Range="4" info="0x0000023f" no_id="dbhwd" />
          <---- dfiuhdwiudi iwqdidffenfj odwqjdjqwgru jdqkkjwfkjfwn odHHOIJD JSDNKS nsk---->
          <number new="0x06" Id="bhi" Unit="0" Range="4" info="0x00000f22" no_id="sjkdnkl jdsnj" />
          <number new="0x08" Id="bhi" Unit="0" Range="4" info="0x00000f1b" no_id="dm o" />
    <---bdheuh jwdhjwdkiwh---->
</abc>
<abc data="CS"
          <number new="0x32" Id="bhi"  Range="4" info="0x000012f5" no_id="he d kd" />
          <number new="0x336" Id="bhi" Range="4" info="0x00000df2" no_id="dnkwn" />
          <number new="0x428" Id="bhi" Range="4" info="0x0001cbf2" no_id="h nd" />
</abc>
<abc data="CS1">
          <number new="0x35" Id="bhi"  Range="4" info="0x000013A5" no_id="lm fed" />
          <number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
          <number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
</abc>
<abc data="P1">
      <number new="0x06" Id="bhi" Unit="0" unit_id="hi_all" Range="2" info="0x0f22" no_id="sjkdnkl jdsnj' />
       <number new="0x08" Id="bhi" Unit="0" unit_id="this new" Range="4" info="0x00000f1b" no_id="dm o" />
</abc>

<--adhhj jdwjdkkj jsSDjkasdj jefnflefk kjsjfoekfle kajfofkp ksaokdfpef---->
<---the end of file---->

From this file i need to create another file named output.txt which will only contain new and info values.

this is my current attempt:

import gzip, re
pattern = r'(new=\"\w+\").*(info=\"\w+\")'

with gzip.open("input.txt.gz", "rb") as fin:
    with open("output.txt", "w") as fout:
        for line in fin:
            for match_new, match_info in re.findall(pattern, line.decode('utf-8')):
                fout.write(f'{match_new} {match_info}\n')

This code produces an output file like this: output.txt

new="0x02" info="0x00000012"
new="0x04" info="0x0000023f"
new="0x06" info="0x00000f22"
new="0x08" info="0x00000f1b"
new="0x32" info="0x000012f5"
new="0x336" info="0x00000df2"
new="0x428" info="0x0001cbf2"
new="0x35"  info="0x000013A5"
new="0x326" info="0x00003252"
new="0x466" info="0x0001cbf2"
new="0x06" info="0x0f22"
new="0x08" info="0x00000f1b"

However i need only new and info values under abc data="CS1". so my output.txt should look like this: expected output: output.txt

new="0x35"  info="0x000013A5"
new="0x326" info="0x00003252"
new="0x466" info="0x0001cbf2"

How to solve this?

Solution

OK, I am going to post this, minus the gzip stuff and minus the nice extraction from the regex. It uses a basic State Machine to decide when to pay attention and when not to.

text="""
<hello script="2.5">
<welcome>
     <hgsdhjaghjdghjagdjhgjdhgdajhgdajhgdhjjgfkjg
     <number new="0x0000-0x3FF" Id="bhi" Range="4" no_id="hello" />
               
          <----jsdjhsdjndkjjdhjdJHksdkjdnknnddnekfgrejgjorgj jregjgkrjglrjgojggjorjg--->
          <number new="0x02" Id="bhi" Unit="0" Range="4" info="0x00000012" no_id="hi all" />
          <number new="0x04" Id="bhi" Unit="0" Range="4" info="0x0000023f" no_id="dbhwd" />
          <---- dfiuhdwiudi iwqdidffenfj odwqjdjqwgru jdqkkjwfkjfwn odHHOIJD JSDNKS nsk---->
          <number new="0x06" Id="bhi" Unit="0" Range="4" info="0x00000f22" no_id="sjkdnkl jdsnj" />
          <number new="0x08" Id="bhi" Unit="0" Range="4" info="0x00000f1b" no_id="dm o" />
    <---bdheuh jwdhjwdkiwh---->
</abc>
<abc data="CS"
          <number new="0x32" Id="bhi"  Range="4" info="0x000012f5" no_id="he d kd" />
          <number new="0x336" Id="bhi" Range="4" info="0x00000df2" no_id="dnkwn" />
          <number new="0x428" Id="bhi" Range="4" info="0x0001cbf2" no_id="h nd" />
</abc>
<abc data="CS1">
          <number new="0x35" Id="bhi"  Range="4" info="0x000013A5" no_id="lm fed" />
          <number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
          <number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />
</abc>
<abc data="P1">
      <number new="0x06" Id="bhi" Unit="0" unit_id="hi_all" Range="2" info="0x0f22" no_id="sjkdnkl jdsnj' />
       <number new="0x08" Id="bhi" Unit="0" unit_id="this new" Range="4" info="0x00000f1b" no_id="dm o" />
</abc>

<--adhhj jdwjdkkj jsSDjkasdj jefnflefk kjsjfoekfle kajfofkp ksaokdfpef---->
<---the end of file---->
"""

import re

class WaitState:
    "this is a fairly basic state machine"

    LOOKFOR = {'<abc data="CS1">'}

    marker = None
    pattern = re.compile(r'(new=\"\w+\").*(info=\"\w+\")')
    end = None

    def __init__(self):
        self.found = {}

    def feed(self, line):
        line = line.strip()
        if line in self.LOOKFOR:
            #flip into read mode
            self.__class__ = ReadState
            self.marker = line
            self.found[line] = []
            #marker to show how it ends 
            self.end = line.split()[0].replace("<","</") + ">"

class ReadState(WaitState):
    #read state
    def feed(self, line):
        line = line.strip()
        hit = self.pattern.search(line)
        if hit:
            self.found[self.marker].append(line)
        else:
            if line == self.end:
                self.__class__ = WaitState


reader = WaitState()
for line in text.splitlines():
    reader.feed(line)   

#printing out output - no not necessarily as intended
for k, lines in reader.found.items():
    print(k, ":")
    for line in lines:
        print("  ",line)

output

(no it's not what you want - I just grabbed the entire line on a hit - but what counts is the state management)

<abc data="CS1"> :
   <number new="0x35" Id="bhi"  Range="4" info="0x000013A5" no_id="lm fed" />
   <number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
   <number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />

p.s. the self.__class__ = XXX stuff works fine for basic state machines in Python. Bit odd-looking, but as long as your classes are made for this, this works.

This supports more than just 1 type of tag, but it doesn't support nested tags. That would just be some extra state to look after.

To write the results out to file, again whole-line rather than using regex named groups to isolate new and info:

lines_out = reader.found.get('<abc data="CS1">',[])
with open("output.txt","w") as fo:
    for line in lines_out:
        fo.write(f"{line}\n")

which gives:

% cat output.txt     
<number new="0x35" Id="bhi"  Range="4" info="0x000013A5" no_id="lm fed" />
<number new="0x326" Id="bhi" Range="4" info="0x00003252" no_id="dk bop" />
<number new="0x466" Id="bhi" Range="4" info="0x00011BF2" no_id="mj ghd" />

Data structure is reader.found : dict[str,list[str]] . To get back what you originally had in your desired output just modify the lines

if hit: 
    self.found[self.marker].append(line)

to extract new and info from the regex result in hit.

Answered By - JL Peyret

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 14, 2022

[FIXED] How to get a data under a specific string in a file?

Issue

Solution

output

0 comments:

Post a Comment

Popular Posts

Labels