تقييم أنظمة RAG

Ismail Mebsout
٢٣ أكتوبر ٢٠٢٤
٢١ دقيقة
جدول المحتويات

RAG تطبيق LLM شائع جدًّا يُمكّن من إجراء Q&A على بياناتك بالاستفادة من قدرات التفكير المنطقي للنموذج. عادةً ما يكون السؤال إمّا:

  • موضوعي: صحّ-خطأ، اختيارات متعدّدة، كلمات-أرقام محدّدة...، يُستخدم بشكل رئيسي في pipelines استخراج النقاط الأساسية.
  • ذاتي: فقرة قصيرة-متطوّرة تُستخدم في حالات Q&A العامّة

يبحث العديد من المطوّرين عن أدوات لتقييم pipelines الخاصّة بـ RAG عبر مقاربة TDD التقليدية بالنظر إلى التغيّر العالي في المدخلات (التي قد تُغيّر إصداراتها بمرور الوقت أيضًا) وتقلّب الإخراج المُولَّد بواسطة LLMs.

في هذا المقال، نتطرّق إلى أُطر التقييم المختلفة التي يمكن استخدامها عند العمل على أنظمة RAG.

الملخّص كالتالي:

  1. مفهوم التقييم
  2. Outlines
  3. RAGAS
  4. إطار عمل مخصّص

مفهوم التقييم

تقييم pipeline لـ RAG هو عملية تكرارية تتمثّل في مطابقة إخراجها مع حالات اختبار محدّدة مسبقًا. تحتوي قاعدة بيانات الاختبار عادةً على الرباعية (question, context, ground_truth, predicted_answer) ويمكن أن تكون إمّا:

  • حقيقية: مُولَّدة يدويًا أو مُجمَّعة من تعليقات المستخدمين
  • اصطناعية: مُولَّدة باستخدام LLM يُخرج مجموعة من الأسئلة والأجوبة من فقرة نصّ.

يجب أن تكون لبيانات التقييم بطبيعة الحال أسئلة فريدة ومتنوّعة تُمثّل كلًّا من المهمّة والمعرفة التي يُستفسر عنها. يُوصى أيضًا بتضمين حالات اختبار حيث تكون الأسئلة الذاتية مباشرة إلى الموضوع ممّا يجعل الأجوبة أقصر وفريدة الحقائق.

MDD, Metric-Driven Development

بالنظر إلى طبيعة الأسئلة (موضوعية وذاتية)، يُنظَر في خيارَين:

  • TDD, Test Driven Development: مدعوم بـ Outlines، مكتبة Python تستفيد من LLMs لإعادة تنسيق إخراج محدّد (يستند إلى Regex مثلًا). يمكن دمجه مع unit testing لتقييم الأسئلة الموضوعية.
  • MDD, Metric Driven Development: مقاربة منتج مدعومة بإطار عمل تقييم (Package أو مخصّص). تعتمد على تقييم KPIs محدّدة لرصد أداء pipeline الخاصّ بـ RAG باستمرار. KPIs بشكل عامّ هي درجات بين 0 و1 يُولّدها LLM بالنظر إلى التنوّع النحوي لطرق الإجابة على الأسئلة الذاتية.
عملية التقييم

Outlines

Outlines هي مكتبة python تُستخدم لفرض تنسيق إخراج خطوة الإكمال بالاستفادة من Large Language Model آخر (OpenAI، transformers، llama.cpp، …). يضمن هذا الموثوقية في pipeline الخاصّ بـ RAG ممّا يُتيح إخراجًا أكثر تحكّمًا وقابلية للتنبّؤ من LLM. يمكن أن يكون التنسيق: type، regex، Pydantic model أو قائم على JSON.

يمكن تقييم الإخراج المُولَّد باستخدام unit testing. يمكنك مراجعة مقالي السابق الذي يناقش تنفيذها باستخدام إطار Pytest.

إطار عمل Outlines

سكربت Python

للتوضيح، سننظر في حالة استخدام RAG حيث نحاول استرجاع المعلومات المتعلّقة بشركة من قاعدة بيانات معرفية.

from pydantic import BaseModel, constr
import outlines
import os

COMLPETION_ENGINE='gpt-4-turbo'
API_VERSION=OPENAI_CRED[COMLPETION_ENGINE]['API_VERSION']
API_BASE=OPENAI_CRED[COMLPETION_ENGINE]['AZURE_ENDPOINT']
API_KEY=OPENAI_CRED[COMLPETION_ENGINE]['API_KEY']

os.environ["AZURE_OPENAI_BASE"] = API_BASE
os.environ["AZURE_OPENAI_API_VERSION"]=API_VERSION
os.environ["AZURE_OPENAI_API_KEY"]=API_KEY

class Field(str, Enum):
    technology= "technology"
    healthcare="healthcare"
    finance="finance"
    consumer_good="consumer_good"
    energy="energy"


class Company(BaseModel):
    name: constr(max_length=24)
    street_number: int
    street_name: str
    office_number: str
    zip_code: str
    city: constr(max_length=24)
    country_code: constr(max_length=5)
    cin:int #corportate identification number
    field:Field

model=outlines.models.openai(COMLPETION_ENGINE)

# Construct structured sequence generator
generator = outlines.generate.json(model, Company)

#Prompt
prompt_template="""
Given the following context, extract the information related to the company:
context:{{context}}
"""

#context if the output of the retriever
result = generator(prompt_template.replace("{{context}}",context), rng=rng)

RAGAS

RAGAS هو إطار عمل يُستخدم لتقييم pipeline الخاصّ بـ RAG مكوّنًا تلو الآخر (Retriever وGenerator) وكذلك من البداية إلى النهاية لرصد جودته وتحسينها.

بشكل عامّ، يتطلّب 4 مدخلات: Question، Ground truths، Contexts، Answer.

إطار عمل RAGAS

Context Precision

التعريف: Context Precision مقياس يُقيّم ما إذا كانت جميع العناصر ذات الصلة بـ ground-truth الموجودة في contexts مُرتَّبة في مرتبة أعلى أم لا. من الناحية المثالية، يجب أن تظهر جميع chunks ذات الصلة في المراتب العليا. يُحسب هذا المقياس باستخدام السؤال والـ context، بقيم تتراوح بين 0 و1، حيث تشير الدرجات الأعلى إلى دقّة أفضل.

حيث k هو إجمالي عدد chunks في contexts

الإدخال: (context, ground truth)

مثال:

  • السؤال: أين تقع فرنسا وما عاصمتها؟
  • Ground truth: فرنسا تقع في أوروبا الغربية وعاصمتها باريس.
  • درجة عالية: [“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”, “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.”]
  • درجة منخفضة: [“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”,]

Context Recall

التعريف: Context recall يقيس مدى توافق الـ context المُسترجَع مع الإجابة المُوسَّمة، التي تُعامل كـ ground truth. يُحسب بناءً على ground truth وcontext المُسترجَع، وتتراوح القيم بين 0 و1، حيث تشير القيم الأعلى إلى أداء أفضل.

الإدخال: (context, ground truth)

مثال:

  • السؤال: أين تقع فرنسا وما عاصمتها؟
  • Ground truth: فرنسا تقع في أوروبا الغربية وعاصمتها باريس.
  • درجة عالية: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the EiffelTower.
  • درجة منخفضة: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Relevancy

التعريف: يقيس هذا المقياس مدى صلة الـ context المُسترجَع، يُحسب بناءً على كلّ من السؤال وcontext. تقع القيم في النطاق (0, 1)، حيث تشير القيم الأعلى إلى صلة أفضل. من الناحية المثالية، يجب أن يحتوي context المُسترجَع حصرًا على المعلومات الأساسية لمعالجة الاستعلام المُقدَّم.

الإدخال: (question, contexts)

مثال:

  • السؤال: ما عاصمة فرنسا؟
  • درجة عالية: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.
  • درجة منخفضة: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Entity Recall

التعريف: يُعطي هذا المقياس قياسًا لـ recall الـ context المُسترجَع، بناءً على عدد الكيانات الموجودة في كلّ من ground truths وcontext بالنسبة لعدد الكيانات الموجودة في ground truths وحدها. ببساطة، إنه قياس لجزء الكيانات المُسترجَعة من ground truths. هذا المقياس مفيد في حالات الاستخدام القائمة على الحقائق مثل مكتب مساعدة السياحة، QA التاريخي، إلخ. يمكن أن يُساعد هذا المقياس في تقييم آلية الاسترجاع للكيانات، بناءً على المقارنة مع الكيانات الموجودة في ground truths، لأنه في الحالات التي تكون فيها الكيانات مهمّة، نحتاج إلى contexts التي تُغطّيها.

حيث CE: Context Entities, GE: Ground truth Entities

الإدخال: (ground truths, context)

مثال:

  • Ground Truth: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.
  • درجة عالية: The Taj Mahal is a symbol of love and an architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
  • درجة منخفضة: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.

Faithfulness

التعريف: يقيس هذا الاتّساق الواقعي للإجابة المُولَّدة مقابل context المُعطى. يُحسب من الإجابة وcontext المُسترجَع. تُقيَّم الإجابة على نطاق (0,1). كلما كان أعلى كان أفضل. تُعتبر الإجابة المُولَّدة موثوقة إذا كانت جميع الادّعاءات الواردة فيها يمكن استنتاجها من context المُعطى

الإدخال: (contexts, answer)

مثال:

  • السؤال: أين ومتى وُلد Einstein؟
  • Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
  • درجة عالية: Einstein was born in Germany on 14th March 1879.
  • درجة منخفضة: Einstein was born in Germany on 20th March 1879.

Answer Relevancy

التعريف: مقياس التقييم، Answer Relevancy، يُركّز على تقييم مدى صلة الإجابة المُولَّدة بالـ prompt المُعطى. تُمنح درجة أدنى للأجوبة غير المكتملة أو التي تحتوي على معلومات زائدة. يُحسب هذا المقياس باستخدام السؤال والإجابة، بقيم تتراوح بين 0 و1، حيث تشير الدرجات الأعلى إلى صلة أفضل.

حيث generatedQuestion يُولَّد بواسطة LLM من الإجابة

الإدخال: (question, answer)

مثال:

  • السؤال: أين تقع فرنسا وما عاصمتها؟
  • درجة عالية: France is in western Europe and Paris is its capital.
  • درجة منخفضة: France is in Western Europe.

Answer Semantic Similarity

التعريف: يتعلّق مفهوم Answer Semantic Similarity بتقييم التشابه الدلالي بين الإجابة المُولَّدة وground truth. يستند هذا التقييم إلى ground truth والإجابة، بقيم تقع في النطاق من 0 إلى 1. تُشير الدرجة الأعلى إلى توافق أفضل بين الإجابة المُولَّدة وground truth

الإدخال: (Ground truth, Answer)

مثال:

  • Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”
  • درجة عالية: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.
  • درجة منخفضة: Isaac Newton’s laws of motion greatly influenced classical physics.

Answer correctness

التعريف: يتمثّل تقييم Answer Correctness في قياس دقّة الإجابة المُولَّدة مقارنةً بـ ground truth. يعتمد هذا التقييم على ground truth والإجابة، بدرجات تتراوح من 0 إلى 1. تشير الدرجة الأعلى إلى توافق أوثق بين الإجابة المُولَّدة وground truth، ممّا يعني صحّة أفضل

الإدخال: (Answer, Ground Truth)

مثال:

  • Ground truth: Einstein was born in 1879 in Germany.
  • درجة عالية: High answer correctness: In 1879, Einstein was born in Germany.
  • درجة منخفضة: Low answer cor4rectness: Einstein was born in Spain in 1879.

ملاحظة: التعريفات أعلاه مستوحاة من الوثائق الرسمية لإطار عمل RAGAS

سكربت Python

أدناه السكربت لتشغيل إطار عمل تقييم RAGAS مع Azure OpenAI. يمكنك أيضًا استخدام LLM مفتوح المصدر.

## Imports
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai.chat_models import AzureChatOpenAI
from ragas.llms import LangchainLLM
from ragas import evaluate
from datasets import Dataset
from ragas.metrics import (
    answer_correctness,
    answer_relevancy,
    answer_similarity,
    context_precision,
    context_recall,
    context_relevancy,
    faithfulness,
)import pandas as pd
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

## Set openai env vars
COMLPETION_ENGINE='gpt-4-turbo'
API_VERSION=OPENAI_CRED[COMLPETION_ENGINE]['API_VERSION']
API_BASE=OPENAI_CRED[COMLPETION_ENGINE]['AZURE_ENDPOINT']
API_KEY=OPENAI_CRED[COMLPETION_ENGINE]['API_KEY']

os.environ["AZURE_OPENAI_BASE"] = API_BASE
os.environ["AZURE_OPENAI_API_VERSION"]=API_VERSION
os.environ["AZURE_OPENAI_API_KEY"]=API_KEY

## Embedding and Completion models
def get_azure_openai_embed_gener():
    embeddings = AzureOpenAIEmbeddings(
        openai_api_version=API_VERSION,
        azure_endpoint=API_BASE,
        azure_deployment=EMBDEDING_ENGINE,
        model=EMBDEDING_ENGINE,
    )
    llm = AzureChatOpenAI(
        openai_api_version=API_VERSION,
        azure_endpoint=API_BASE,
        azure_deployment=COMLPETION_ENGINE,
        model=COMLPETION_ENGINE,
    )
    return embeddings, llm

## Ragas metrics
def get_metrics(
  embeddings,
  llm,
):
    metrics = [
        answer_correctness,
        answer_relevancy,
        answer_similarity,
        context_precision,
        context_recall,
        context_relevancy,
        faithfulness,
    ]
    for m in metrics:
        m.__setattr__("llm", llm)
        if hasattr(m, "embeddings"):
            m.__setattr__("embeddings", embeddings)
    answer_correctness.faithfulness = faithfulness
    answer_correctness.answer_similarity = answer_similarity
    return metrics

## Main script
#Load embedding and LLM models
embeddings, llm = get_azure_openai_embed_gener()
llm = LangchainLLM(llm=llm)
metrics = get_metrics(embeddings, llm)
#df has columns:
  #question:str
  #ground_truths:List[str]
  #answer:List[str]
  #contexts:List[str]
#Launch evaluation
ragas_eval= evaluate(
    Dataset.from_pandas(df),
    metrics=metrics,
    raise_exceptions=False,
).to_pandas()

إطار عمل مخصّص

أُطر العمل المخصّصة طريقة أخرى لأتمتة تقييمك بناءً على مهامّ محدّدة في pipeline الخاصّ بـ RAG. هذا مفيد عند دمجه مع few-shots prompting المرتبط بحالة الاستخدام الخاصّة بك. ستُحدّد هذه أمثلة المدخلات ودرجاتها المرتبطة.

تتمثّل مقاربة MDD هذه في تعيين لكلّ مقياس مخصّص:

  • التعريف: مقارنة صريحة يجب استنتاجها بواسطة LLM
  • المدخلات: التي تتمّ مقارنتها
  • النموذج: embedding أو completion
  • Few shots: حالات مرجعية لخطوة التسجيل
  • الإخراج: عادةً درجة بين 0 و1 أو 1 و5
إطار عمل تقييم مخصّص

سكربت Python

لننظر في خمسة مقاييس رئيسية مشابهة لتلك من RAGAS كمثال:

  • Answer similarity: ما مدى تقارب حقائق الإجابات؟
  • Answer fluency: ما جودة الإجابة المُولَّدة؟
  • Answer groundedness: إلى أيّ مدى تستند الإجابة إلى context المُقدَّم؟
  • Answer relevancy: ما مدى صلة الإجابة بالسؤال؟
  • Answer coherence: ما مدى تماسك الجمل المختلفة للإجابة مع بعضها البعض؟

المقاييس أعلاه ستُحسب باستخدام gpt-4-turbo كـ LLM judge.

ملاحظة: لا تنسَ تكييف few shots مع أمثلة من حالة الاستخدام الخاصّة بك.

#Few shots inspired from the prompt flow evaluation of Microsoft
from dataclasses import dataclass

@dataclass
class answer_similarity:
    question: str
    ground_truth: str
    answer: str
    name: str = "answer_similarity"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given a definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your task is to compute an accurate evaluation score using the provided metric. Provide only the metric value without any additional text or explanation.

    User:
    Equivalence measures the similarity between the predicted answer and the correct answer. If the information and facts in the predicted answer are similar or equivalent to the correct answer, the Equivalence metric value should be high; otherwise, it should be low. Given the question, correct answer, and predicted answer, determine the Equivalence metric value using the following rating scale:

    One star: the predicted answer is not at all similar to the correct answer
    Two stars: the predicted answer is mostly not similar to the correct answer
    Three stars: the predicted answer is somewhat similar to the correct answer
    Four stars: the predicted answer is mostly similar to the correct answer
    Five stars: the predicted answer is completely similar to the correct answer
    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    correct answer: {{ground_truth}}
    predicted answer: {{answer}}
    stars:
    """
    few_shots: str = """
    The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.

    question: What is the role of ribosomes?
    correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
    predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
    stars: 1

    question: Why did the Titanic sink?
    correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
    predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
    stars: 2

    question: What causes seasons on Earth?
    correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
    predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
    stars: 3

    question: How does photosynthesis work?
    correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
    predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
    stars: 4

    question: What are the health benefits of regular exercise?
    correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
    predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{ground_truth}}", self.ground_truth)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_fluency:
    question: str
    answer: str
    name: str = "answer_fluency"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given a definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Fluency measures the quality of individual sentences in the answer and whether they are well-written and grammatically correct. Given the question and answer, score the fluency of the answer between one to five stars using the following rating scale:

    One star: the answer completely lacks fluency
    Two stars: the answer mostly lacks fluency
    Three stars: the answer is partially fluent
    Four stars: the answer is mostly fluent
    Five stars: the answer has perfect fluency
    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    question: What did you have for breakfast today?
    answer: Breakfast today, me eating cereal and orange juice very good.
    stars: 1

    question: How do you feel when you travel alone?
    answer: Alone travel, nervous, but excited also. I feel adventure and like its time.
    stars: 2

    question: When was the last time you went on a family vacation?
    answer: Last family vacation, it took place in last summer. We traveled to a beach destination, very fun.
    stars: 3

    question: What is your favorite thing about your job?
    answer: My favorite aspect of my job is the chance to interact with diverse people. I am constantly learning from their experiences and stories.
    stars: 4

    question: Can you describe your morning routine?
    answer: Every morning, I wake up at 6 am, drink a glass of water, and do some light stretching. After that, I take a shower and get dressed for work. Then, I have a healthy breakfast, usually consisting of oatmeal and fruits, before leaving the house around 7:30 am.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_groundedness:
    context: str
    answer: str
    name: str = "answer_groundedness"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following ratings:

    5: The ANSWER follows logically from the information contained in the CONTEXT.
    1: The ANSWER is logically false from the information contained in the CONTEXT.
    An integer score between 1 and 5 if such a score exists; otherwise, use 1: It is not possible to determine whether the ANSWER is true or false without further information.
    Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

    Note the ANSWER is generated by a computer system and may contain certain symbols, which should not negatively affect the evaluation.

    Reminder: The return values for each task should be correctly formatted as an integer between 1 and 5. Do not repeat the context.

    {{few_shots}}

    ## Actual Task Input:
    {"CONTEXT": {{context}}, "ANSWER": {{answer}}}

    Actual Task Output:
    """
    few_shots: str = """
    Independent Examples:
    ## Example Task #1 Input:
    {"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is presented every other two years"}
    ## Example Task #1 Output:
    1

    ## Example Task #2 Input:
    {"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is very important awards in the entertainment industry in the United States. And it's also significant worldwide"}
    ## Example Task #2 Output:
    5

    ## Example Task #3 Input:
    {"CONTEXT": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is neither French nor English.", "ANSWER": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is not French."}
    ## Example Task #3 Output:
    5

    ## Example Task #4 Input:
    {"CONTEXT": "Some are reported as not having been wanted at all.", "ANSWER": "All are reported as being completely and fully wanted."}
    ## Example Task #4 Output:
    1
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{context}}", self.context)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_relevancy:
    question: str
    context: str
    answer: str
    name: str = "answer_relevancy"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
    - One star: the answer completely lacks relevance
    - Two stars: the answer mostly lacks relevance
    - Three stars: the answer is partially relevant
    - Four stars: the answer is mostly relevant
    - Five stars: the answer has perfect relevance

The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    context: {{context}}
    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    context: Marie Curie was a Polish-born physicist and chemist who pioneered research on radioactivity and was the first woman to win a Nobel Prize.
    question: What field did Marie Curie excel in?
    answer: Marie Curie was a renowned painter who focused mainly on impressionist styles and techniques.
    stars: 1

    context: The Beatles were an English rock band formed in Liverpool in 1960, and they are widely regarded as the most influential music band in history.
    question: Where were The Beatles formed?
    answer: The band The Beatles began their journey in London, England, and they changed the history of music.
    stars: 2

    context: The recent Mars rover, Perseverance, was launched in 2020 with the main goal of searching for signs of ancient life on Mars. The rover also carries an experiment called MOXIE, which aims to generate oxygen from the Martian atmosphere.
    question: What are the main goals of Perseverance Mars rover mission?
    answer: The Perseverance Mars rover mission focuses on searching for signs of ancient life on Mars.
    stars: 3

    context: The Mediterranean diet is a commonly recommended dietary plan that emphasizes fruits, vegetables, whole grains, legumes, lean proteins, and healthy fats. Studies have shown that it offers numerous health benefits, including a reduced risk of heart disease and improved cognitive health.
    question: What are the main components of the Mediterranean diet?
    answer: The Mediterranean diet primarily consists of fruits, vegetables, whole grains, and legumes.
    stars: 4

    context: The Queen's Royal Castle is a well-known tourist attraction in the United Kingdom. It spans over 500 acres and contains extensive gardens and parks. The castle was built in the 15th century and has been home to generations of royalty.
    question: What are the main attractions of the Queen's Royal Castle?
    answer: The main attractions of the Queen's Royal Castle are its expansive 500-acre grounds, extensive gardens, parks, and the historical castle itself, which dates back to the 15th century and has housed generations of royalty.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{context}}", self.context)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res


@dataclass
class answer_coherence:
    question: str
    answer: str
    name: str = "answer_coherence"
    prompt_template: str = """
    System:
    You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. Provide only the metric value without any additional text or explanation.

    User:
    Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of the answer between one to five stars using the following rating scale:
    - One star: the answer completely lacks coherence
    - Two stars: the answer mostly lacks coherence
    - Three stars: the answer is partially coherent
    - Four stars: the answer is mostly coherent
    - Five stars: the answer has perfect coherency

    The rating value should always be an integer between 1 and 5. The rating should be 1, 2, 3, 4, or 5.

    {{few_shots}}

    question: {{question}}
    answer: {{answer}}
    stars:
    """
    few_shots: str = """
    question: What is your favorite indoor activity and why do you enjoy it?
    answer: I like pizza. The sun is shining.
    stars: 1

    question: Can you describe your favorite movie without giving away any spoilers?
    answer: It is a science fiction movie. There are dinosaurs. The actors eat cake. People must stop the villain.
    stars: 2

    question: What are some benefits of regular exercise?
    answer: Regular exercise improves your mood. A good workout also helps you sleep better. Trees are green.
    stars: 3

    question: How do you cope with stress in your daily life?
    answer: I usually go for a walk to clear my head. Listening to music helps me relax as well. Stress is a part of life, but we can manage it through some activities.
    stars: 4

    question: What can you tell me about climate change and its effects on the environment?
    answer: Climate change has far-reaching effects on the environment. Rising temperatures result in the melting of polar ice caps, contributing to sea-level rise. Additionally, more frequent and severe weather events, such as hurricanes and heatwaves, can cause disruption to ecosystems and human societies alike.
    stars: 5
    """

    def prompt(self, use_few_shots: bool = False):
        few_shots = ""
        if use_few_shots:
            few_shots = self.few_shots
        res = (
            self.prompt_template.replace("{{question}}", self.question)
            .replace("{{answer}}", self.answer)
            .replace("{{few_shots}}", few_shots)
        )
        return res
import pandas as pd
from tqdm import tqdm
import numpy as np
from openai import AzureOpenAI
import time

COMPLETION_ENGINE="gpt-4-turbo"

class RAGCUSTOMEVAL:
    def __init__(self, openai_cred):
        self.available_metrics = {
            "answer_similarity": {
                "metric": answer_similarity,
                "params": ["question", "ground_truth", "answer"],
            },
            "answer_fluency": {
                "metric": answer_fluency,
                "params": ["question", "answer"],
            },
            "answer_groundedness": {
                "metric": answer_groundedness,
                "params": ["context", "answer"],
            },
            "answer_relevancy": {
                "metric": answer_relevancy,
                "params": ["question", "context", "answer"],
            },
            "answer_coherence": {
                "metric": answer_coherence,
                "params": ["question", "answer"],
            },
        }
        self.openai = AzureOpenAI(
            api_key=openai_cred["API_KEY"],
            api_version=openai_cred["API_VERSION"],
            azure_endpoint=openai_cred["AZURE_ENDPOINT"],
        )
        self.model = openai_cred["ENGINE"]

    def completion(
        self,
        prompt: str,
        temperature: float = 0.2,
        max_tokens: int = 2,
    ) -> str:
        messages = [{"role": "user", "content": prompt}]
        runtime_output = self.openai.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
        )

        res_text = runtime_output.choices[0].message.content

        return res_text

    def evaluate(self, df: pd.DataFrame, metrics: list, use_few_shots: bool = False):
        dict_list = []
        for i in tqdm(range(len(df))):
            row = df.iloc[i].to_dict()
            for mx in metrics:
                print("Evaluation {} ...".format(mx))
                data_tmp = {x: row[x] for x in self.available_metrics[mx]["params"]}
                mx_tmp = self.available_metrics[mx]["metric"](**data_tmp)
                prompt_tmp = mx_tmp.prompt(use_few_shots)
                answer_tmp = self.completion(prompt_tmp)
                row[mx] = answer_tmp
            dict_list.append(row)
            time.sleep(1) #to avoid hitting api calls limits

        result = pd.DataFrame(dict_list)

        return result

RAG_EVAL_PIPE=RAGCUSTOMEVAL(OPENAI_CRED[COMPLETION_ENGINE])

results=RAG_EVAL_PIPE.evaluate(
    df=eval_df,
    metrics=[
    'answer_similarity',
     'answer_fluency',
     'answer_groundedness',
     'answer_relevancy',
     'answer_coherence'
    ]
)

الخاتمة

تقييم RAG هو تمرين بالغ التعقيد بالنظر إلى التذبذب الكبير في كلّ من المدخلات والمخرجات خصوصًا عند معالجة الاستعلامات الذاتية. وهو مجال بحثي نشط جدًّا لأنّ كثيرين يصارعون لقياس تأثير التغييرات التي أجروها على pipeline بشكل شامل.

يمكن أن يكون التقييم البشري صعبًا أيضًا لأنّ إجابتَين على الاستعلام نفسه يمكن أن يُعزى إليهما درجتان مختلفتان من شخصَين مختلفَين. لا يزال أكثر المنهجيات موثوقية ويمكن أن يُقدّم أيضًا few shots جديرة بالثقة للتقييم البشري.

RAG هو تطبيق LLM شائع جدًّا يُمكّن من إجراء Q&A على بياناتك بالاستفادة من قدرات التفكير المنطقي للنموذج. عادةً ما يكون السؤال إمّا:

ابقَ على تواصل

هل لديك سؤال؟ يسعدنا أن نسمع منك.