Issue
I have a text where there are repeated blocks of texts(multiple lines) of different sizes.
Example:
Lorem Ipsum is simply dummy text
Lorem Ipsum is simply dummy text
Lorem Ipsum is simply dummy text
alpha game beta teta
Anvil Bravil Bruma
alpha game beta teta
Anvil Bravil Bruma
alpha game beta teta
Anvil Bravil Bruma
generator
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text
Instead of having repeated blocks, I want to change to something like:
Lorem Ipsum is simply dummy text == @3 times
==== repeated 3 times ====
alpha game beta teta
Anvil Bravil Bruma
generator == @1 time
======== repeated 2 times ===
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum is simply dummy text
is simply dummy text
===============
The one line block, has a little different marking, because one line block can be repeated inside another block, recursive.
Lorem Ipsum
Lorem Ipsum
hamdal
Lorem Ipsum
Lorem Ipsum
hamdal
becomes
==== repeated 2 times ====
Lorem Ipsum == @2 times
hamdal
===========
When the same line is repeated multiple times, is easy to do (compare to the previous one - in memory), but is hard to find a solution when the repeated block is formed of multiple lines and the number of lines variate.
I'm thinking on something that do some type of backtracking with a max limit for the number of lines that a block to check can have, but I didn't have a proper logic solution, so I'm looking for idea, some starting code.
Solution
You will need to set a maximum block size to look ahead for repetitions. It could be up to half the total number of lines but the larger it is the less efficient the process will be.
Advance in the lines by the largest possible repeating block from the current line. Recurse for multi-line blocks.
From your expected output I gather that a single non-repeating line following a multi-line block must indicate "== @ 1 times" in order to delimit the end of the previous block.
def compact(lines,maxBlock=8):
compacted = [] # resulting lines
def addBlock(start,size,count): # add a block
if size == 1: # single line
compacted.append(lines[start])
if count: # zero count == no repeat
compacted[-1] += f"== @{count} times" # flag repeats
else: # multi-line
compacted.append(f"======== repeated {count} times ===")
compacted.extend(compact(lines[start:start+size])) # recurse
i = 0
force1 = False # force "@1 times" after end of multi-line block
while i<len(lines):
size = next(s for s in range(maxBlock,-1,-1)
if lines[i:i+s]==lines[i+s:i+2*s])
if not size:
addBlock(i,1,1*force1) # No-repeat (except to signal end of block)
force1 = False
i += 1
continue
count = next( (j-i)//size for j in range(i,len(lines)+size,size)
if lines[i:i+size] != lines[j:j+size])
addBlock(i,size,count)
i += size*count
force1 = size>1 # will force "@ 1 times" if next is single line
return compacted
output:
lines = """Lorem Ipsum is simply dummy text
Lorem Ipsum is simply dummy text
Lorem Ipsum is simply dummy text
alpha game beta teta
Anvil Bravil Bruma
alpha game beta teta
Anvil Bravil Bruma
alpha game beta teta
Anvil Bravil Bruma
generator
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text""".split("\n")
for line in compact(lines):print(line)
Lorem Ipsum is simply dummy text == @3 times
======== repeated 3 times ===
alpha game beta teta
Anvil Bravil Bruma
generator== @1 times
======== repeated 2 times ===
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text
...
lines = """Lorem Ipsum
Lorem Ipsum
hamdal
Lorem Ipsum
Lorem Ipsum
hamdal""".split("\n")
for line in compact(lines):print(line)
======== repeated 2 times ===
Lorem Ipsum== @2 times
hamdal
Note, it may be a good idea to strip trailing spaces from the lines before calling the function so that invisible differences don't prevent groupings
Unless the notation was imposed, it may be easier to understand the compacted lines if the addBlock
function indicated the number of lines that are repeated. This way you wouldn't need the "@ 1 times" exception
def compact(lines,maxBlock=8):
compacted = [] # resulting lines
def addBlock(start,size,count): # add a block
if size == 1: # single line
compacted.append(lines[start])
if count: compacted[-1] += f" == @{count} times" # flag repeats
else:
block = compact(lines[start:start+size])
compacted.append(f"=== next {len(block)} lines repeated {count} times ===")
compacted.extend(block) # recurse
i = 0
while i<len(lines):
size = next(s for s in range(maxBlock,-1,-1) if lines[i:i+s]==lines[i+s:i+2*s])
if not size:
addBlock(i,1,0) # No-repeat (except to signal end of block)
i += 1
continue
count = next( (j-i)//size for j in range(i,len(lines)+size,size)
if lines[i:i+size] != lines[j:j+size])
addBlock(i,size,count)
i += size*count
return compacted
Alternate outputs:
Lorem Ipsum is simply dummy text == @3 times
=== next 2 lines repeated 3 times ===
alpha game beta teta
Anvil Bravil Bruma
generator
=== next 3 lines repeated 2 times ===
There are many variations of passages of Lorem Ipsum available,
Lorem Ipsum
is simply dummy text
...
=== next 2 lines repeated 2 times ===
Lorem Ipsum == @2 times
hamdal
You could do something similar for the single line repetitions showing === next line repeated 3 times ===
before the line (for example)
Answered By - Alain T.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.