Enhancing AI Accuracy with Reader-Retriever Models + Example Code

As AI continues to advance, it has the power to transform the way we live and work. One example is ChatGPT, a chatbot developed by OpenAI that can answer questions, write blog posts, create computer programs, respond to emails, and much more.

While ChatGPT and other AI systems have impressive capabilities, they are not immune to mistakes. However, there are ways to improve the accuracy and credibility of AI-generated answers.

One way to improve the truthfulness of AI is the reader-retriever model. This system analyzes and interprets large amounts of text data to provide accurate answers. They can also be configured to display the sources of their answers, which adds transparency and reliability to their responses.

In this blog post, we will build a reader-retriever model and explore its capabilities in providing accurate and well-sourced answers to your questions.

What is a Reader-Retriever Model?

The reader-retriever model takes a question and relevant context (e.g., sentences, paragraphs, or text documents) from a knowledge database and uses this context to answer the question.

A reader-retriever model combines a document retriever and a language model. To use it, you must first create a knowledge database by converting all the relevant information that the model should have access to (e.g., documents, books, etc.) into vectors. These vectors are then stored in a database.

When you ask a question, the retriever model converts the question into a question vector and compares this vector to the context vectors in the database. It then returns the context vectors most similar to the question vector.

The similarity of word vectors is used because similar vectors are likely to occur in texts that discuss similar topics or themes. By comparing the similarity of word vectors, it is possible to identify texts likely to cover similar subjects and hopefully help answer the question.

The reader model examines the provided information to find the part that most likely will answer the question.

Overview of a Reader-Retriever System. Source: https://www.pinecone.io/learn/question-answering

An analogy for a reader-retriever model is a detective searching for a suspect in a crowded room. The detective has a description of the suspect, knows the room they are in, and must use that information to narrow the search to find the right suspect.

What is the “Retriever” in a Reader-Retriever Model?

The retriever model is used to locate documents that can provide answers to a given question. It first converts documents (e.g., texts, books, research papers, etc.) and questions into vectors which are then stored in a database.

The retriever model then searches for vectors similar to the query vector and returns the documents corresponding to the most similar vectors. These documents can then be used to answer the question.

What is the “Reader” in a Reader-Retriever Model?

A reader model takes a question and some context (the documents returned by the retriever) and attempts to answer the question based on the provided information. The answer could be a part of a document (e.g., a sentence containing the answer) or text generated using a language model like GPT-3.

What is a Vector Database?

A vector database is a type of database that stores numerical vectors, and they allow for fast searches for similar vectors. This is particularly useful for the retriever model, which uses vector similarity to locate relevant documents responding to a given question or prompt.

What is a Vector Space?

A vector space is a collection of vectors. In natural language processing, vector spaces are also called embedding spaces. These spaces represent data points in a higher-dimensional space where the distances between the points are preserved. This means that if two data points are similar in the original space, they will be close together in the embedding space, and if they are dissimilar, they will be farther apart.

An example of a vector space: Since a dog and a cat have more in common than, for example, a dog and an apple, the vectors in the vector space representing a dog and a cat are closer together.

If you want to know more about vectors and their use in data science and NLP, you are welcome to read my other blog post, where I covered vectors and vector spaces in more detail: What are Tokens, Vectors, and Embeddings & How do you create them?

How do you train a Reader-Retriever Model?

To train a reader-retriever model, you can use reinforcement learning algorithms. These algorithms are designed to identify which documents contain the correct answers and to generate the correct answer from these documents.

The retriever takes care of the first part (finding the most relevant documents), and the reader is responsible for the second part (returning or generating the answer). The training process of a reader-retriever system can be broken down into six steps:

The training process of a reader-retriever system. Image by the author.

  1. Gather a large dataset of relevant documents, such as articles or news, to train the model.
  2. Pre-process the dataset by cleaning and formatting the text, such as removing special characters, converting it to lowercase, and breaking the text into individual words, sentences, or paragraphs.
  3. Divide the dataset into a training set and a validation set. The training set will be used to train the model, and the validation set will be used to evaluate the model’s performance during training.
  4. Train the reader-retriever model using the training set. This typically involves giving the model queries and the corresponding relevant documents from the training set and using an optimization algorithm to adjust the model’s parameters to minimize an error function.
  5. Evaluate the model’s performance on the validation set. This might involve measuring the model’s ability to correctly identify relevant documents for a given query or other evaluation metrics such as precision and recall.
  6. If the model’s performance is not satisfactory, adjust the model’s hyperparameters or modify the model architecture and repeat the training process. If the model’s performance is sufficient, the training process is complete, and the model is ready for deployment.

A typical dataset for this task is the Stanford Question Answering Dataset. Each element of this dataset contains a context, multiple questions, and ground truth answers.

To create your system, you can train the retriever to find the most relevant sentences in the provided context, and the reader should return the correct answer from these sentences.

An example element of the Stanford Question Answering Dataset. Source: https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Black_Death.html?model=nlnet (single model) (Microsoft Research Asia)&version=v2.0

Training the Retriever

To train a retriever model, you need to consider the complexity of the questions you want to ask. Questions can range from easy (e.g., “What is the name of the first planet in our Solar System?”) to complex (e.g., “What is dilated cardiomyopathy?”).

If you plan on asking more complex questions, using a retriever model that has been specifically trained on your data may be necessary to achieve accurate results.

Remember, the goal of the retriever model is to understand your question and find the most relevant documents in a vector database. It encodes your queries and the documents into the same vector space and returns the most similar vectors.

For instance, a custom retriever model trained on medical documents will likely be more proficient at understanding and interpreting these documents.

A model trained on a standard dataset may not be as effective in understanding and interpreting them. This is because the model may return similar documents in the vector space but which do not contain the correct answer. This is because the model may not have a complete understanding and a lack of knowledge of medical vocabulary.

If you’re interested in learning more about how to train or fine-tune your retriever model, here is an excellent tutorial on this topic: https://huggingface.co/blog/how-to-train-sentence-transformers.

Training the Reader Model

We use a supervised learning approach to train the reader model that provides labeled examples of correct answers and the corresponding context. When given a question and supporting documents, the model uses this context to generate the correct answer, much like humans take open-book exams with a set of resources to extract the answers from.

How to Evaluate A Reader-Retriever Model?

To evaluate a reader-retriever model, you should check the reader’s accuracy in answering questions and examine the retriever’s ability to find relevant articles or passages to support the reader.

The evaluation process can be divided into the following steps:

  1. Define the evaluation criteria: Establish the metrics that will be used to evaluate the system’s performance, such as accuracy, precision, and recall.
  2. Prepare a test dataset: Assemble a diverse set of test questions and the corresponding documents that contain the answers.
  3. Run the system on the test dataset and analyze the results: Use the established evaluation criteria to assess the system’s performance and identify areas for improvement.

An illustration of the evaluation process of a reader-retriever system.

To evaluate the performance of a retriever, you can use several metrics, including precision, recall, and the F1 score.

Precision measures the number of relevant documents retrieved by the retriever compared to the total number of documents retrieved. The recall measures the number of relevant documents retrieved by the retriever compared to the total number of relevant documents in the collection. The F1 score combines precision and recall to consider both aspects of the performance.

Other metrics for evaluating the reader include the model’s speed, the quality and coherence of its answers, and its ability to handle complex or poorly-formatted questions.

If you want an in-depth guide for evaluating reader-retriever systems, this tutorial will show you everything you need step-by-step: https://haystack.deepset.ai/tutorials/05_evaluation.

Example: Open-domain question answering

Open-domain question-answering systems are a type of system that can provide answers to questions related to a particular article or domain of information. It relies on a large amount of data and knowledge, such as the documents in a vector database, to locate the answer to a given question.

For example, if the system is asked a question about a particular article, such as “What did Albert Einstein win the Nobel Prize for?” it will search through its data to find the correct answer.

Reader-retriever models are often used for these systems. The retriever narrows down the search by identifying related documents that may contain the answer, and the reader uses those documents to find the answer.

The following code snippets show the main steps in building a reader-retriever system. The complete source code for this tutorial is available on this Google Colab.

# We use the haystack framework in this tutorial.
# Haystack is a framework that allows you to quickly build NLP systems.

# First we need to create a vector database. In haystack these are called document stores.
document_store = InMemoryDocumentStore()

# Next we need some data.
# Fetch 10 wikipedia pages with the search term "Artificial Intelligence".
wiki_pages = []
for result in wikipedia.search("Artificial Intelligence"):
    try:
        wiki_pages.append(wikipedia.page(result))
    except:
        print("Error: Could not get page for {}".format(result))
        pass

# Create the retriever based on the all-mpnet-base-v2 sentence transformer model.
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-mpnet-base-v2",
    model_format="sentence_transformers",
    progress_bar=False,
)

# Create the reader - here we us a roberta transformer which was fine-tuned
# on the Stanford Question Answering Dataset.
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

# The pre-processor function is used to clean our datasets and to split each
# document into paragraphs that are max. four sentences long with one sentence overlap.
processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="sentence",
    split_length=4,
    split_respect_sentence_boundary=False,
    split_overlap=1
)

# Preprocess the documents
processed_documents = processor.process(documents)

# Write documents to the document store
document_store.write_documents(processed_documents)

# Create embeddings from our documents in the document store
document_store.update_embeddings(retriever)


# Now the fun part. Here we create the QA pipeline.
pipe = ExtractiveQAPipeline(reader, retriever)

query = "When was Artificial intelligence founded?"

prediction = pipe.run(
    query=query,
    params={"Retriever": {"top_k": 25}, "Reader": {"top_k": 3}}
)

print_answers(prediction, details="minimum")

""" Output:
Query: When was Artificial intelligence founded?
Answers:
[   {   'answer': '1956',
        'context': '.The field of AI research was born at a workshop at '
                   'Dartmouth College in 1956. The attendees became the '
                   'founders and leaders of AI research. They and '},
    {   'answer': '1956',
        'context': 'rtificial intelligence research was founded as an academic '
                   'discipline in 1956. === Cybernetics and early neural '
                   'networks === The earliest research int'},
    {   'answer': '1956',
        'context': 'rkshop held on the campus of Dartmouth College, USA during '
                   'the summer of 1956. Those who attended would become the '
                   'leaders of AI research for decades.'}]
"""

The output of the print_answers function provides three answers to the query, along with the context from which each answer was derived. If you want to view the sources for each answer, you can examine the predictions.

"""
 'answer': '1956',
 'type': 'extractive',
 'score': 0.9879469871520996,
 'context': '.The field of AI research was born at a workshop at Dartmouth College in 1956. The attendees became the founders and leaders of AI research. They and ',
 'offsets_in_document': [{'start': 227, 'end': 231}],
 'offsets_in_context': [{'start': 73, 'end': 77}],
 'document_id': 'db69c006cfe7878004adcef161c14ae6',
 'meta': {
 'name': 'Artificial intelligence',
 'url': 'https://en.wikipedia.org/wiki/Artificial_intelligence',
 '_split_id': 10}
"""

The answer object includes all the necessary information about the answers and their sources. Within this object, you can access the meta field to view the name of the Wikipedia article and the URL from which the context was derived. This information can help verify the accuracy and credibility of the answers.

Conclusion

The reader-retriever model is a valuable resource for researchers and individuals who need to find accurate and reliable answers to questions on a particular topic. It allows users to quickly locate and verify the sources of its response, which can help them fact-check and learn about new subjects.

As AI technology advances, the reader-retriever model will likely become a widely used tool in artificial intelligence. It has already been applied in many applications, and its potential for future use is still being explored as we work toward more complex projects.


References: