Sunday, January 21, 2024

[FIXED] Langchain / ChromaDB: Why does VectorStore return so many duplicates?

January 21, 2024 chromadb, langchain, openai-api, py-langchain, python No comments

Issue

import os
from langchain.llms import OpenAI
import bs4
import langchain
from langchain import hub
from langchain.document_loaders import UnstructuredFileLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

os.environ["OPENAI_API_KEY"] = "KEY"

loader = UnstructuredFileLoader(
    'path_to_file'
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.get_relevant_documents(
    "What is X?"
)

This returns:

[Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932})]

Which is all seemingly the same document.

When I first ran this code in Google Colab/Jupyter Notebook, it returned different documents...as I ran it more, it started returning the same documents. Makes me feel like this is a database issue, where the same entry is being inserted into the database with each run.

How do I return 6 different unique documents?

Solution

the issue is here:

Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

everytime you execute the file, you are inserting the same documents into the database.

you could comment out that part of code if you are inserting from same file. or you could detect the similar vectors using EmbeddingsRedundantFilter

Filter that drops redundant documents by comparing their embeddings.

Answered By - Yilmaz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 21, 2024

[FIXED] Langchain / ChromaDB: Why does VectorStore return so many duplicates?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels