Évaluation des systèmes RAG

Ismail Mebsout
23 octobre 2024
21 min
Sommaire

Le RAG est une application LLM très populaire permettant le Q&A sur vos données en tirant parti des capacités de raisonnement du modèle. La question est généralement soit :

  • Objective : vrai-faux, choix multiple, mots-nombres exacts, etc., principalement utilisée dans les pipelines d'extraction de points clés.
  • Subjective : court paragraphe élaboré utilisé dans les use-cases généraux de Q&A

De nombreux développeurs ont cherché des outils pour évaluer les pipelines RAG via une approche TDD habituelle, étant donné la grande variabilité des entrées (qui peuvent également changer de version au fil du temps) et la volatilité de la sortie générée par les LLMs.

Dans cet article, nous passons en revue les différents frameworks d'évaluation qui peuvent être utilisés lors du travail sur des systèmes RAG.

Le sommaire est le suivant :

  1. Concept d'évaluation
  2. Outlines
  3. RAGAS
  4. Custom Framework

Concept d'évaluation

L'évaluation d'un pipeline RAG est un processus itératif qui consiste à faire correspondre sa sortie à des cas de test prédéfinis. La database de test contient généralement le tuple (question, context, ground_truth, predicted_answer) et peut être soit :

  • Réelle : générée manuellement ou collectée à partir des feedbacks utilisateurs
  • Synthétique : générée à l'aide d'un LLM qui produit un ensemble de questions et de réponses à partir d'un paragraphe de texte.

Les données d'évaluation doivent naturellement contenir des questions uniques et diverses représentatives à la fois de la tâche et de la connaissance interrogée. Il est également recommandé d'inclure des cas de test où les questions subjectives vont droit au but, ce qui rend les réponses plus courtes et factuellement uniques.

MDD, Metric-Driven Development

Étant donné la nature des questions (objectives et subjectives), deux options sont envisagées :

  • TDD, Test Driven Development : propulsé par Outlines, une librairie Python qui exploite les LLMs pour renvoyer un format de sortie spécifique (basé sur des Regex par exemple). Il peut être combiné avec unit testing pour évaluer les questions objectives.
  • MDD, Metric Driven Development : une approche produit propulsée par un framework d'évaluation (Package ou custom). Elle repose sur l'évaluation de KPIs spécifiques pour surveiller en continu la performance du pipeline RAG. Les KPIs sont généralement des scores entre 0 et 1 et générés par un LLM étant donné la variété grammaticale des manières de répondre aux questions subjectives.
Processus d'évaluation

Outlines

Outlines est une librairie python utilisée pour forcer le format de sortie de l'étape de completion en s'appuyant sur un autre Large Language Model (OpenAI, transformers, llama.cpp, …). Cela garantit la fiabilité du pipeline RAG en permettant une sortie plus contrôlable et prévisible du LLM. Le format peut être : type, regex, modèle Pydantic ou basé sur JSON.

La sortie générée peut être évaluée à l'aide de l'unit testing. Vous pouvez consulter mon article précédent qui aborde leur implémentation à l'aide du framework Pytest.

Framework Outlines

Script Python

À titre d'illustration, nous considérerons un UC RAG où nous essayons de récupérer des informations relatives à une entreprise à partir d'une knowledge database.

from pydantic import BaseModel, constr
import outlines
import os

COMLPETION_ENGINE='gpt-4-turbo'
API_VERSION=OPENAI_CRED[COMLPETION_ENGINE]['API_VERSION']
API_BASE=OPENAI_CRED[COMLPETION_ENGINE]['AZURE_ENDPOINT']
API_KEY=OPENAI_CRED[COMLPETION_ENGINE]['API_KEY']

os.environ["AZURE_OPENAI_BASE"] = API_BASE
os.environ["AZURE_OPENAI_API_VERSION"]=API_VERSION
os.environ["AZURE_OPENAI_API_KEY"]=API_KEY

class Field(str, Enum):
    technology= "technology"
    healthcare="healthcare"
    finance="finance"
    consumer_good="consumer_good"
    energy="energy"


class Company(BaseModel):
    name: constr(max_length=24)
    street_number: int
    street_name: str
    office_number: str
    zip_code: str
    city: constr(max_length=24)
    country_code: constr(max_length=5)
    cin:int #corportate identification number
    field:Field

model=outlines.models.openai(COMLPETION_ENGINE)

# Construct structured sequence generator
generator = outlines.generate.json(model, Company)

#Prompt
prompt_template="""
Given the following context, extract the information related to the company:
context:{{context}}
"""

#context if the output of the retriever
result = generator(prompt_template.replace("{{context}}",context), rng=rng)

RAGAS

RAGAS est un framework utilisé pour l'évaluation du pipeline RAG par composant (Retriever et Generator) ainsi que de bout en bout pour surveiller et améliorer sa qualité.

Globalement, il nécessite 4 entrées : Question, Ground truths, Contexts, Answer.

Framework RAGAS

Context Precision

Definition : Context Precision est une métrique qui évalue si tous les éléments pertinents ground-truth présents dans les contexts sont classés plus haut ou non. Idéalement, tous les chunks pertinents doivent apparaître aux premiers rangs. Cette métrique est calculée à l'aide de la question et du context, avec des valeurs comprises entre 0 et 1, où des scores plus élevés indiquent une meilleure précision.

Où k est le nombre total de chunks dans les contexts

Input : (context, ground truth)

Exemple :

  • Question : Where is France and what is its capital?
  • Ground truth : France is in Western Europe and its capital is Paris.
  • High score : [“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”, “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.”]
  • Low score : [“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”,]

Context Recall

Definition : Context recall mesure dans quelle mesure le context récupéré s'aligne avec la réponse annotée, traitée comme la ground truth. Elle est calculée sur la base de la ground truth et du context récupéré, et les valeurs sont comprises entre 0 et 1, des valeurs plus élevées indiquant de meilleures performances.

Input : (context, ground truth)

Exemple :

  • Question : Where is France and what is its capital?
  • Ground truth : France is in Western Europe and its capital is Paris.
  • High score : France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the EiffelTower.
  • Low score : France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Relevancy

Definition : Cette métrique évalue la pertinence du context récupéré, calculée à partir à la fois de la question et du context. Les valeurs se situent dans la plage (0, 1), des valeurs plus élevées indiquant une meilleure pertinence. Idéalement, le context récupéré devrait contenir exclusivement les informations essentielles pour répondre à la requête fournie.

Input : (question, contexts)

Exemple :

  • Question : What is the capital of France?
  • High score : France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.
  • Low score : France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Entity Recall

Definition : Cette métrique donne la mesure du recall du context récupéré, basée sur le nombre d'entités présentes à la fois dans les ground truths et le context par rapport au nombre d'entités présentes dans les ground truths seules. En termes simples, c'est une mesure de la fraction d'entités rappelées à partir des ground truths. Cette métrique est utile dans des use cases factuels comme un helpdesk touristique, du QA historique, etc. Cette métrique peut aider à évaluer le mécanisme de récupération des entités, basé sur la comparaison avec les entités présentes dans les ground truths, car dans les cas où les entités importent, nous avons besoin des contexts qui les couvrent.

Où CE : Context Entities, GE : Ground truth Entities

Input : (ground truths, context)

Exemple :

  • Ground Truth : The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.
  • High score : The Taj Mahal is a symbol of love and an architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
  • Low score : The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.

Faithfulness

Definition : Cela mesure la cohérence factuelle de la réponse générée par rapport au context donné. Elle est calculée à partir de la réponse et du context récupéré. La réponse est mise à l'échelle dans la plage (0,1). Plus c'est élevé, mieux c'est. La réponse générée est considérée comme faithful si toutes les affirmations qui sont faites dans la réponse peuvent être inférées du context donné

Input : (contexts, answer)

Exemple :

  • Question : Where and when was Einstein born?
  • Context : Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
  • High score : Einstein was born in Germany on 14th March 1879.
  • Low score : Einstein was born in Germany on 20th March 1879.

Answer Relevancy

Definition : La métrique d'évaluation, Answer Relevancy, se concentre sur l'évaluation de la pertinence de la réponse générée par rapport au prompt donné. Un score plus bas est attribué aux réponses incomplètes ou contenant des informations redondantes. Cette métrique est calculée à l'aide de la question et de la réponse, avec des valeurs comprises entre 0 et 1, où des scores plus élevés indiquent une meilleure pertinence.

Où generatedQuestion est générée par le LLM à partir de la réponse

Input : (question, answer)

Exemple :

  • Question : Where is France and what is its capital?
  • High score : France is in western Europe and Paris is its capital.
  • Low score : France is in Western Europe.

Answer Semantic Similarity

Definition : Le concept d'Answer Semantic Similarity se rapporte à l'évaluation de la ressemblance sémantique entre la réponse générée et la ground truth. Cette évaluation est basée sur la ground truth et la réponse, avec des valeurs comprises entre 0 et 1. Un score plus élevé signifie un meilleur alignement entre la réponse générée et la ground truth

Input : (Ground truth, Answer)

Exemple :

  • Ground truth : Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”
  • High score : Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.
  • Low score : Isaac Newton’s laws of motion greatly influenced classical physics.

Answer correctness

Definition : l'évaluation de l'Answer Correctness consiste à mesurer la précision de la réponse générée par rapport à la ground truth. Cette évaluation repose sur la ground truth et la réponse, avec des scores allant de 0 à 1. Un score plus élevé indique un meilleur alignement entre la réponse générée et la ground truth, signifiant une meilleure correctness

Input : (Answer, Ground Truth)

Exemple :

  • Ground truth : Einstein was born in 1879 in Germany.
  • High score : High answer correctness: In 1879, Einstein was born in Germany.
  • Low score : Low answer cor4rectness: Einstein was born in Spain in 1879.

NB : les définitions ci-dessus sont inspirées de la documentation officielle du framework RAGAS

Script Python

Ci-dessous le script pour exécuter le framework d'évaluation RAGAS avec Azure OpenAI. Vous pouvez également utiliser un LLM open-source.

## Imports
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from ragas.llms import LangchainLLM
from ragas import evaluate
from datasets import Dataset
from ragas.metrics import (
    answer_correctness,
    answer_relevancy,
    answer_similarity,
    context_precision,
    context_recall,
    context_relevancy,
    faithfulness,
)import pandas as pd
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

## Set openai env vars
COMLPETION_ENGINE='gpt-4-turbo'
API_VERSION=OPENAI_CRED[COMLPETION_ENGINE]['API_VERSION']
API_BASE=OPENAI_CRED[COMLPETION_ENGINE]['AZURE_ENDPOINT']
API_KEY=OPENAI_CRED[COMLPETION_ENGINE]['API_KEY']

os.environ["AZURE_OPENAI_BASE"] = API_BASE
os.environ["AZURE_OPENAI_API_VERSION"]=API_VERSION
os.environ["AZURE_OPENAI_API_KEY"]=API_KEY

## Embedding and Completion models
def get_azure_openai_embed_gener():
    embeddings = AzureOpenAIEmbeddings(
        openai_api_version=API_VERSION,
        azure_endpoint=API_BASE,
        azure_deployment=EMBDEDING_ENGINE,
        model=EMBDEDING_ENGINE,
    )
    llm = AzureChatOpenAI(
        openai_api_version=API_VERSION,
        azure_endpoint=API_BASE,
        azure_deployment=COMLPETION_ENGINE,
        model=COMLPETION_ENGINE,
    )
    return embeddings, llm

## Ragas metrics
def get_metrics(
  embeddings,
  llm,
):
    metrics = [
        answer_correctness,
        answer_relevancy,
        answer_similarity,
        context_precision,
        context_recall,
        context_relevancy,
        faithfulness,
    ]
    for m in metrics:
        m.__setattr__("llm", llm)
        if hasattr(m, "embeddings"):
            m.__setattr__("embeddings", embeddings)
    answer_correctness.faithfulness = faithfulness
    answer_correctness.answer_similarity = answer_similarity
    return metrics

## Main script
#Load embedding and LLM models
embeddings, llm = get_azure_openai_embed_gener()
llm = LangchainLLM(llm=llm)
metrics = get_metrics(embeddings, llm)
#df has columns:
  #question:str
  #ground_truths:List[str]
  #answer:List[str]
  #contexts:List[str]
#Launch evaluation
ragas_eval= evaluate(
    Dataset.from_pandas(df),
    metrics=metrics,
    raise_exceptions=False,
).to_pandas()

Custom Framework

Les custom frameworks sont une autre façon d'automatiser votre évaluation en fonction de tâches spécifiques dans votre pipeline RAG. Cela s'avère pratique lorsqu'il est combiné à du few-shots prompting lié à votre use case. Ceux-ci spécifieront des exemples d'entrées et leurs scores associés.

Cette approche MDD consiste à définir pour chaque métrique custom :

  • Definition : comparaison explicite qui doit être inférée par le LLM
  • Inputs : qui sont comparés
  • Model : embedding ou completion
  • Few shots : cas de référence pour l'étape de scoring
  • Output : généralement un score entre 0 et 1 ou 1 et 5
Custom Evaluation framework

Script Python

Considérons cinq principales métriques similaires à celles de RAGAS comme exemple :

  • Answer similarity : À quel point les faits des réponses sont proches ?
  • Answer fluency : Quelle est la qualité de la réponse générée ?
  • Answer groundedness : À quel point la réponse est basée sur le context fourni ?
  • Answer relevancy : à quel point la réponse est pertinente par rapport à la question ?
  • Answer coherence : À quel point les différentes phrases de la réponse sont cohérentes les unes avec les autres ?

Les métriques ci-dessus seront calculées en utilisant gpt-4-turbo comme LLM judge.

NB : N'oubliez pas d'adapter les few shots avec des exemples de votre use-case.

#Few shots inspired from the prompt flow evaluation of Microsoft
from dataclasses import dataclass

@dataclass
class answer_similarity:
    question: str
    ground_truth: str
    answer: str
    name: str = "answer_similarity"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given a definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your task is to compute an accurate evaluation score using the provided metric. Provide only the metric value without any additional text or explanation.

    User:
    Equivalence measures the similarity between the predicted answer and the correct answer. If the information and facts in the predicted answer are similar or equivalent to the correct answer, the Equivalence metric value should be high; otherwise, it should be low. Given the question, correct answer, and predicted answer, determine the Equivalence metric value using the following rating scale:

    One star: the predicted answer is not at all similar to the correct answer
    Two stars: the predicted answer is mostly not similar to the correct answer
    Three stars: the predicted answer is somewhat similar to the correct answer
    Four stars: the predicted answer is mostly similar to the correct answer
    Five stars: the predicted answer is completely similar to the correct answer
    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    correct answer: {{ground_truth}}
    predicted answer: {{answer}}
    stars:
    """
    few_shots: str = """
    The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.

    question: What is the role of ribosomes?
    correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
    predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
    stars: 1

    question: Why did the Titanic sink?
    correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
    predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
    stars: 2

    question: What causes seasons on Earth?
    correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
    predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
    stars: 3

    question: How does photosynthesis work?
    correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
    predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
    stars: 4

    question: What are the health benefits of regular exercise?
    correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
    predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{ground_truth}}", self.ground_truth)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_fluency:
    question: str
    answer: str
    name: str = "answer_fluency"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given a definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Fluency measures the quality of individual sentences in the answer and whether they are well-written and grammatically correct. Given the question and answer, score the fluency of the answer between one to five stars using the following rating scale:

    One star: the answer completely lacks fluency
    Two stars: the answer mostly lacks fluency
    Three stars: the answer is partially fluent
    Four stars: the answer is mostly fluent
    Five stars: the answer has perfect fluency
    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    question: What did you have for breakfast today?
    answer: Breakfast today, me eating cereal and orange juice very good.
    stars: 1

    question: How do you feel when you travel alone?
    answer: Alone travel, nervous, but excited also. I feel adventure and like its time.
    stars: 2

    question: When was the last time you went on a family vacation?
    answer: Last family vacation, it took place in last summer. We traveled to a beach destination, very fun.
    stars: 3

    question: What is your favorite thing about your job?
    answer: My favorite aspect of my job is the chance to interact with diverse people. I am constantly learning from their experiences and stories.
    stars: 4

    question: Can you describe your morning routine?
    answer: Every morning, I wake up at 6 am, drink a glass of water, and do some light stretching. After that, I take a shower and get dressed for work. Then, I have a healthy breakfast, usually consisting of oatmeal and fruits, before leaving the house around 7:30 am.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_groundedness:
    context: str
    answer: str
    name: str = "answer_groundedness"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following ratings:

    5: The ANSWER follows logically from the information contained in the CONTEXT.
    1: The ANSWER is logically false from the information contained in the CONTEXT.
    An integer score between 1 and 5 if such a score exists; otherwise, use 1: It is not possible to determine whether the ANSWER is true or false without further information.
    Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

    Note the ANSWER is generated by a computer system and may contain certain symbols, which should not negatively affect the evaluation.

    Reminder: The return values for each task should be correctly formatted as an integer between 1 and 5. Do not repeat the context.

    {{few_shots}}

    ## Actual Task Input:
    {"CONTEXT": {{context}}, "ANSWER": {{answer}}}

    Actual Task Output:
    """
    few_shots: str = """
    Independent Examples:
    ## Example Task #1 Input:
    {"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is presented every other two years"}
    ## Example Task #1 Output:
    1

    ## Example Task #2 Input:
    {"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is very important awards in the entertainment industry in the United States. And it's also significant worldwide"}
    ## Example Task #2 Output:
    5

    ## Example Task #3 Input:
    {"CONTEXT": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is neither French nor English.", "ANSWER": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is not French."}
    ## Example Task #3 Output:
    5

    ## Example Task #4 Input:
    {"CONTEXT": "Some are reported as not having been wanted at all.", "ANSWER": "All are reported as being completely and fully wanted."}
    ## Example Task #4 Output:
    1
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{context}}", self.context)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_relevancy:
    question: str
    context: str
    answer: str
    name: str = "answer_relevancy"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
    - One star: the answer completely lacks relevance
    - Two stars: the answer mostly lacks relevance
    - Three stars: the answer is partially relevant
    - Four stars: the answer is mostly relevant
    - Five stars: the answer has perfect relevance

The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    context: {{context}}
    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
    question: What field did Marie Curie excel in?
    answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
    stars: 1

    context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
    question: Where were The Beatles formed?
    answer: The band The Beatles began their journey in London, England, and they changed the history of music.
    stars: 2

    context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
    question: What are the main goals of Perseverance Mars rover mission?
    answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
    stars: 3

    context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
    question: What are the main components of the Mediterranean diet?
    answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
    stars: 4

    context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
    question: What are the main attractions of the Queen's Royal Castle?
    answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{context}}", self.context)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_coherence:
    question: str
    answer: str
    name: str = "answer_coherence"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of the answer between one to five stars using the following rating scale:
    - One star: the answer completely lacks coherence
    - Two stars: the answer mostly lacks coherence
    - Three stars: the answer is partially coherent
    - Four stars: the answer is mostly coherent
    - Five stars: the answer has perfect coherency

    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    question: What is your favorite indoor activity and why do you enjoy it?
    answer: I like pizza. The sun is shining.
    stars: 1

    question: Can you describe your favorite movie without giving away any spoilers?
    answer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.
    stars: 2

    question: What are some benefits of regular exercise?
    answer: Regular exercise improves your mood. A good workout also helps you sleep better. Trees are green.
    stars: 3

    question: How do you cope with stress in your daily life?
    answer: I usually go for a walk to clear my head. Listening to music helps me relax as well. Stress is a part of life, but we can manage it through some activities.
    stars: 4

    question: What can you tell me about climate change and its effects on the environment?
    answer: Climate change has far-reaching effects on the environment. Rising temperatures result in the melting of polar ice caps, contributing to sea-level rise. Additionally, more frequent and severe weather events, such as hurricanes and heatwaves, can cause disruption to ecosystems and human societies alike.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res
import pandas as pd
from tqdm import tqdm
import numpy as np
from openai import AzureOpenAI
import time

COMPLETION_ENGINE="gpt-4-turbo"

class RAGCUSTOMEVAL:
    def __init__(self, openai_cred):
        self.available_metrics = {
            "answer_similarity": {
                "metric": answer_similarity,
                "params": ["question", "ground_truth", "answer"],
            },
            "answer_fluency": {
                "metric": answer_fluency,
                "params": ["question", "answer"],
            },
            "answer_groundedness": {
                "metric": answer_groundedness,
                "params": ["context", "answer"],
            },
            "answer_relevancy": {
                "metric": answer_relevancy,
                "params": ["question", "context", "answer"],
            },
            "answer_coherence": {
                "metric": answer_coherence,
                "params": ["question", "answer"],
            },
        }
        self.openai = AzureOpenAI(
            api_key=openai_cred["API_KEY"],
            api_version=openai_cred["API_VERSION"],
            azure_endpoint=openai_cred["AZURE_ENDPOINT"],
        )
        self.model = openai_cred["ENGINE"]

    def completion(
        self,
        prompt: str,
        temperature: float = 0.2,
        max_tokens: int = 2,
    ) -> str:
        messages = [{"role": "user", "content": prompt}]
        runtime_output = self.openai.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
        )

        res_text = runtime_output.choices[0].message.content

        return res_text

    def evaluate(self, df: pd.DataFrame, metrics: list, use_few_shots: bool = False):
        dict_list = []
        for i in tqdm(range(len(df))):
            row = df.iloc[i].to_dict()
            for mx in metrics:
                print("Evaluation {} ...".format(mx))
                data_tmp = {x: row[x] for x in self.available_metrics[mx]["params"]}
                mx_tmp = self.available_metrics[mx]["metric"](**data_tmp)
                prompt_tmp = mx_tmp.prompt(use_few_shots)
                answer_tmp = self.completion(prompt_tmp)
                row[mx] = answer_tmp
            dict_list.append(row)
            time.sleep(1) #to avoid hitting api calls limits

        result = pd.DataFrame(dict_list)

        return result

RAG_EVAL_PIPE=RAGCUSTOMEVAL(OPENAI_CRED[COMPLETION_ENGINE])

results=RAG_EVAL_PIPE.evaluate(
    df=eval_df,
    metrics=[
    'answer_similarity',
     'answer_fluency',
     'answer_groundedness',
     'answer_relevancy',
     'answer_coherence'
    ]
)

Conclusion

L'évaluation des RAG est un exercice très complexe étant donné la fluctuation importante à la fois de l'entrée et de la sortie, en particulier lorsqu'on traite des requêtes subjectives. C'est un domaine de recherche très actif puisque de nombreuses personnes peinent à quantifier de manière holistique l'impact des changements qu'elles ont apportés au pipeline.

L'évaluation humaine peut également s'avérer délicate, car deux réponses à la même requête peuvent se voir attribuer deux scores différents par deux êtres humains différents. Elle reste néanmoins la méthodologie la plus fiable et peut également fournir des few shots dignes de confiance pour l'évaluation humaine.

Le RAG est une application LLM très populaire permettant le Q&A sur vos données en tirant parti des capacités de raisonnement du modèle. La question est généralement soit :

Restons en contact

Vous avez une question ? Nous serions ravis d'échanger avec vous.