Friday, February 2, 2024

[FIXED] Find unique answers from list of strings

February 02, 2024 list, python, string No comments

Issue

I have 2 lists:

texts = ["this is string 1", "this is string 2", "this is string 3"]
answers = ["this", "string 1", "string 3"]

I need to create a function that returns as many possible answers from answers (and their corresponding text) that exist in (exact string match) exactly one of the returned texts. That is, each of the returned answers must be in exactly one of the returned texts.

It should return a list of lists, where each sublist has the [answer, corresponding text].

For example, in this example, only "string 1" appear in the first text ("this is string 1"), only "string 3" appear in the last text ("this is string 3"), and the first answer ("this") appear in all of the texts, so it shouldn't be returned.

Another example:

texts = ["this is string 1 this is string 2", "this is string 1", "this is string 3"]
answers = ["this", "string 1", "string 3"]

The returned output in this case can either be

[["string 1", "this is string 1"],["string 3", "this is string 3"]]

[["string 1", "this is string 1 this is string 2"],["string 3", "this is string 3"]]

As "string 1" does not appear in the returned text "this is string 3".

I currently have the following code:

def find_unique_answers(texts, answers):
    result = []
    for answer in answers:
        text_found = None
        for text in texts:
            if answer in text:
                if text_found is None:
                    text_found = text
                else:
                    text_found = None
                    break
        if text_found is not None:
            result.append([answer, text_found])
    return result

texts = ["this is string 1", "this is string 2", "this is string 3"]
answers = ["this", "string 1", "string 3"]

output = find_unique_answers(texts, answers)
print(output)  # Output: [['string 1', 'this is string 1'], ['string 3', 'this is string 3']]

But, if I change the lists to

texts = ["this is string 1 this is string 2", "this is string 1", "this is string 3"]
answers = ["this", "string 1", "string 3"]

But the output is [['string 3', 'this is string 3']]

Not entirely sure what I'm doing wrong.

Solution

Edit: with the new explanations, I think I know understand the requirements better. It's an optimization problem, which is quite harder.

This is a brute-force approach that tries all combinations of answers. It's not completely stupid, but still takes exponential time. Given your comment on number of answers and texts it should be ok, though.

It always returns the largest amount of answers possible.

def find_all_answer_combinations(texts, answers, used_texts=()):
    if not texts or not answers:
        # Nothing to do, return an empty combination.
        yield ()
        return
    
    # Pick an answer to process.
    answer, *rest_answers = answers
    # Find solutions that *don't* use the selected answer.
    yield from find_all_answer_combinations(texts, rest_answers, used_texts)

    if any(answer in text for text in used_texts):
        # The answer already appears in a returned text, so we can't use it again.
        return

    # Find texts that have this answer.
    matched_texts = [text for text in texts if answer in text]
    # Only continue checking texts that don't contain the answer,
    # otherwise we'd have two or more matches for this answer.
    other_texts = [text for text in texts if answer not in text]
    # Try generating combinations with each of the matched texts.
    for matched_text in matched_texts:
        # Assuming we use `matched_text`, generate combinations
        # for the rest of answers/texts.
        for result in find_all_answer_combinations(other_texts, rest_answers, used_texts+(matched_text,)):
            yield (answer, matched_text), *result

def find_best_answers(texts, answers):
    # Finds the combination with most answers.
    return max(find_all_answer_combinations(texts, answers), key=len)

texts = ["this is string 1", "this is string 2", "this is string 3"]
answers = ["this", "string 1", "string 3"]
print(find_best_answers(texts, answers)) # (('string 1', 'this is string 1'), ('string 3', 'this is string 3'))

texts = ["this is string 1 this is string 2", "this is string 1", "this is string 3"]
answers = ["this", "string 1", "string 3"]
print(find_best_answers(texts, answers)) # (('string 1', 'this is string 1 this is string 2'), ('string 3', 'this is string 3'))

texts = ["this is string 1 this is string 2", "tis is string 1", "this is string 3"]
answers = ["tis", "string 1", "string 3"]
print(find_best_answers(texts, answers)) # (('string 1', 'this is string 1 this is string 2'), ('string 3', 'this is string 3'))

And this is a greedy algorithm that finds one combination of answers/texts pairs without conflicts. No guarantee on number of answers/texts returned:

def find_unique_answers(texts, answers):
    result = []
    forbidden_texts = []
    used_texts = []
    # Reversed order is a hack to get better results with the given examples.
    for answer in reversed(answers):
        # If the answer already matches a returned text, including
        # it here would result in two matching returned texts. So, we skip.
        if any(answer in text for text in used_texts): continue
        # Find all matching texts, ignoring the "forbidden ones"
        # that would conflict with used answers.
        matches = [text for text in texts if answer in text and text not in forbidden_texts]
        # No valid text, nothing to do. Skip.
        if not matches: continue
        # Arbitraryly select the first matching text.
        selected_text = matches[0]
        # We got ourselves a valid pair!
        result.append((answer, selected_text))
        # Remember the text we just used.
        used_texts.append(selected_text)
        # And add the other valid texts to "forbidden_texts",
        # so that future answers don't add extra matches to this one.
        forbidden_texts.extend(matches)
    return result

texts = ["this is string 1", "this is string 2", "this is string 3"]
answers = ["this", "string 1", "string 3"]
output = find_unique_answers(texts, answers)
print(output)  # Output: [('string 3', 'this is string 3'), ('string 1', 'this is string 1')]

texts = ["this is string 1 this is string 2", "this is string 1", "this is string 3"]
answers = ["this", "string 1", "string 3"]
output = find_unique_answers(texts, answers)
print(output) # [('string 3', 'this is string 3'), ('string 1', 'this is string 1 this is string 2')]

Answered By - BoppreH

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 2, 2024

[FIXED] Find unique answers from list of strings

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels