Self-RAG의 github 그리고 논문 링크 그리고 RAGAS의 github (링크), 도큐먼트 (링크), 논문 (링크),
그리고 테디 노트의 Self-RAG과 RAGAS, 그리고 기본 LangGraph 생성을 보다가 궁금해져서 프롬프트를 정리해보았다.
Self-RAG과 RAGAS 모두 RAG를 평가함에 있어서 LLM as a Judge 논문에 상당 부분 근거하고 있기 때문에
LLM이 평가하는 근거가 되는 프롬프트가 중요하기 때문이다.
Huggingface에서도 LLM as a Judge에 대한 방법을 링크에서 소개하고 있다.
RAG Evaluation에서 사용하는 프롬프트는 이 링크에 나와있다.
yes, no 형식의 간단한 방식과 구체적인 score를 내는 방식의 장단점도 생각해보았다.
평가 지표별 Output
Self-RAG 논문
우선 제일 기본이 되는 Self-RAG 논문에 나온 지표별 output은 아래의 Table 1과 같다.
위 논문에서 $x$는 질의, 질문 혹은 쿼리이며 $d$는 검색된 문서, $y$는 LLM에 의해 생성된 답변이다.
Retrieve:
질의 $x$에 대해서 검색을 수행할지 말지를 결정.
yes, no, continue.
여기서 yes와 no는 검색을 할지 말지를 결정하는 부분이다.
continue는 기존에 검색된 문서가 있어서 이를 그대로 활용함을 의미한다.
IsRel:
검색된 문서의 관련성 평가.
Relevant, Irrelevant의 binary score다.
IsSup:
답변 $y$에서 진술된 내용들이 검색된 document $d$에 의해서 충분히 검증되고 있는가를 평가
Fully supported, Partially supproted, No support / Contradictory
triple score다.
IsUse:
답변 $y$가 질문 $x$에 대해서 유용한 답변인가를 평가
논문에서는 1점 부터 5점까지 5개의 정수로 나타낸다.
자세한 내용 원문은 논문의 Appendix A.1 Refelection Tokens에 나와 있다.
Self-RAG by 테디 노트
Retrieval Grader:
검색된 문서의 관련성 평가, yes, no의 binary score
Hallucination (Groundedness) Grader:
생성된 답변의 사실성(환각 여부) 검증, yes, no의 binary score
Answer Grader:
답변의 질문에 대한 관련성 평가, yes, no의 binary score
Generate:
질문에 대한 답변 생성
Question Re-writer:
쿼리 재작성
RAGAS
여기서 RAGAS에서 사용가능한 평가 지표들은 다음과 같다.
Retrieval Grader:
ContextRelevance, [0, 1, 2]의 discrete value
Hallucination (Groundedness) Grader:
ResponseGroundedness, [0, 1, 2]의 discrete value
Answer Grader:
ResponseRelevancy, [-1, 1] 사이의 값인 cos sim
질문, 컨텍스트, 답변 모두 고려한 지표:
Faithfulness, [0, 1] 사이의 값
프롬프트 비교
Generate와 Question Re-writer는 그대로 사용하고 검색과 할루시네이션, 그리고 정답 평가 프롬프트만 비교한다.
Retrieval Grader - ContextRelevance
Self-RAG 테디 노트
# 시스템 프롬프트 정의: 검색된 문서가 사용자 질문에 관련이 있는지 평가하는 시스템 역할 정의
system = """You are a grader assessing relevance of a retrieved document to a user question. \n
It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
RAGAS의 context relevance (소스코드)
_nv_metrics.py에서 불러온다.
RAGAS에서 설정한 retry는 5다.
template_relevance1 = (
"### Instructions\n\n"
"You are a world class expert designed to evaluate the relevance score of a Context"
" in order to answer the Question.\n"
"Your task is to determine if the Context contains proper information to answer the Question.\n"
"Do not rely on your previous knowledge about the Question.\n"
"Use only what is written in the Context and in the Question.\n"
"Follow the instructions below:\n"
"0. If the context does not contains any relevant information to answer the question, say 0.\n"
"1. If the context partially contains relevant information to answer the question, say 1.\n"
"2. If the context contains any relevant information to answer the question, say 2.\n"
"You must provide the relevance score of 0, 1, or 2, nothing else.\nDo not explain.\n"
"### Question: {query}\n\n"
"### Context: {context}\n\n"
"Do not try to explain.\n"
"Analyzing Context and Question, the Relevance score is "
)
Hallucination (Groundedness) Grader - ResponseGroundedness
Self-RAG 테디 노트
# 시스템 프롬프트 정의
system = """You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts. \n
Give a binary score 'yes' or 'no'. 'Yes' means that the answer is grounded in / supported by the set of facts."""
RAGAS의 response groundedness (소스코드)
_nv_metrics.py에서 불러온다.
template_groundedness1과 template_groundedness2가 있는데
둘 다 사용해서 각각 0, 1, 2 중 하나로 평가한 다음에 두 값의 평균값을 최종 값으로 반환한다.
RAGAS에서 설정한 retry는 5다.
Context relevance는 groundedness와 다르게 템플릿을 하나만 사용한다.
의도되었거나 실수가 아닐까 싶다.
template_groundedness1 = (
"### Instruction\n\n"
"You are a world class expert designed to evaluate the groundedness of an assertion.\n"
"You will be provided with an assertion and a context.\n"
"Your task is to determine if the assertion is supported by the context.\n"
"Follow the instructions below:\n"
"A. If there is no context or no assertion or context is empty or assertion is empty, say 0.\n"
"B. If the assertion is not supported by the context, say 0.\n"
"C. If the assertion is partially supported by the context, say 1.\n"
"D. If the assertion is fully supported by the context, say 2.\n"
"You must provide a rating of 0, 1, or 2, nothing else.\n\n"
"### Context:\n"
"<{context}>\n\n"
"### Assertion:\n"
"<{response}>\n\n"
"Analyzing Context and Response, the Groundedness score is "
)
template_groundedness2 = (
"As a specialist in assessing the strength of connections between statements and their given contexts, "
"I will evaluate the level of support an assertion receives from the provided context. Follow these guidelines:\n\n"
"* If the assertion is not supported or context is empty or assertion is empty, assign a score of 0.\n"
"* If the assertion is partially supported, assign a score of 1.\n"
"* If the assertion is fully supported, assign a score of 2.\n\n"
"I will provide a rating of 0, 1, or 2, without any additional information.\n\n"
"---\n**Context:**\n[{context}]\n\n"
"**Assertion:**\n[{response}]\n\n"
"Do not explain."
"Based on the provided context and response, the Groundedness score is:"
)
Answer Grader - ResponseRelevancy
Self-RAG 테디 노트
# 시스템 프롬프트 정의
system = """You are a grader assessing whether an answer addresses / resolves a question \n
Give a binary score 'yes' or 'no'. Yes' means that the answer resolves the question."""
RAGAS의 response relevancy (소스코드)
_answer_relevance.py에서 불러온다.
PydanticPrompt (소스코드)
아래의 ResponseRelevancePrompt와 PydanticPrompt 클래스를 활용하여 아래와 같은 프롬프트가 나온다.
from pydantic import BaseModel
from ragas.prompt import PydanticPrompt
import pprint
class ResponseRelevanceOutput(BaseModel):
question: str
noncommittal: int
class ResponseRelevanceInput(BaseModel):
response: str
class ResponseRelevancePrompt(
PydanticPrompt[ResponseRelevanceInput, ResponseRelevanceOutput]
):
instruction = """Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers"""
input_model = ResponseRelevanceInput
output_model = ResponseRelevanceOutput
examples = [
(
ResponseRelevanceInput(
response="""Albert Einstein was born in Germany.""",
),
ResponseRelevanceOutput(
question="Where was Albert Einstein born?",
noncommittal=0,
),
),
(
ResponseRelevanceInput(
response="""I don't know about the groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. """,
),
ResponseRelevanceOutput(
question="What was the groundbreaking feature of the smartphone invented in 2023?",
noncommittal=1,
),
),
]
pydantic_prompt = ResponseRelevancePrompt()
pprint.pprint(pydantic_prompt.to_string())
('Generate a question for the given answer and Identify if answer is '
'noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if '
'the answer is committal. A noncommittal answer is one that is evasive, '
'vague, or ambiguous. For example, "I don\'t know" or "I\'m not sure" are '
'noncommittal answers\n'
'Please return the output in a JSON format that complies with the following '
'schema as specified in JSON Schema:\n'
'{"properties": {"question": {"title": "Question", "type": "string"}, '
'"noncommittal": {"title": "Noncommittal", "type": "integer"}}, "required": '
'["question", "noncommittal"], "title": "ResponseRelevanceOutput", "type": '
'"object"}Do not use single quotes in your response but double '
'quotes,properly escaped with a backslash.\n'
'\n'
'--------EXAMPLES-----------\n'
'Example 1\n'
'Input: {\n'
' "response": "Albert Einstein was born in Germany."\n'
'}\n'
'Output: {\n'
' "question": "Where was Albert Einstein born?",\n'
' "noncommittal": 0\n'
'}\n'
'\n'
'Example 2\n'
'Input: {\n'
' "response": "I don\'t know about the groundbreaking feature of the '
'smartphone invented in 2023 as am unaware of information beyond 2022. "\n'
'}\n'
'Output: {\n'
' "question": "What was the groundbreaking feature of the smartphone '
'invented in 2023?",\n'
' "noncommittal": 1\n'
'}\n'
'-----------------------------\n'
'\n'
'Now perform the same with the following input\n'
'Input: (None)\n'
'Output: ')
ResponseRelevancePrompt 클래스
from ragas.prompt import PydanticPrompt
logger = logging.getLogger(__name__)
if t.TYPE_CHECKING:
from langchain_core.callbacks import Callbacks
class ResponseRelevanceOutput(BaseModel):
question: str
noncommittal: int
class ResponseRelevanceInput(BaseModel):
response: str
class ResponseRelevancePrompt(
PydanticPrompt[ResponseRelevanceInput, ResponseRelevanceOutput]
):
instruction = """Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers"""
input_model = ResponseRelevanceInput
output_model = ResponseRelevanceOutput
examples = [
(
ResponseRelevanceInput(
response="""Albert Einstein was born in Germany.""",
),
ResponseRelevanceOutput(
question="Where was Albert Einstein born?",
noncommittal=0,
),
),
(
ResponseRelevanceInput(
response="""I don't know about the groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. """,
),
ResponseRelevanceOutput(
question="What was the groundbreaking feature of the smartphone invented in 2023?",
noncommittal=1,
),
),
]
PydanticPrompt 클래스
class PydanticPrompt(BasePrompt, t.Generic[InputModel, OutputModel]):
# these are class attributes
input_model: t.Type[InputModel]
output_model: t.Type[OutputModel]
instruction: str
examples: t.List[t.Tuple[InputModel, OutputModel]] = []
def _generate_instruction(self) -> str:
return self.instruction
def _generate_output_signature(self, indent: int = 4) -> str:
return (
f"Please return the output in a JSON format that complies with the "
f"following schema as specified in JSON Schema:\n"
f"{json.dumps(self.output_model.model_json_schema())}"
"Do not use single quotes in your response but double quotes,"
"properly escaped with a backslash."
)
def _generate_examples(self):
if self.examples:
example_strings = []
for idx, e in enumerate(self.examples):
input_data, output_data = e
example_strings.append(
f"Example {idx + 1}\n"
+ "Input: "
+ input_data.model_dump_json(indent=4)
+ "\n"
+ "Output: "
+ output_data.model_dump_json(indent=4)
)
return "\n--------EXAMPLES-----------\n" + "\n\n".join(example_strings)
# if no examples are provided
else:
return ""
def to_string(self, data: t.Optional[InputModel] = None) -> str:
return (
f"{self.instruction}\n"
+ self._generate_output_signature()
+ "\n"
+ self._generate_examples()
+ "\n-----------------------------\n"
+ "\nNow perform the same with the following input\n"
+ (
"input: " + data.model_dump_json(indent=4, exclude_none=True) + "\n"
if data is not None
else "Input: (None)\n"
)
+ "Output: "
)
간단하게 정리하자면 아래와 같이 정리할 수 있다.
response_relevancy_template = (
'Generate a question for the given answer and Identify if answer is '
'noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if '
'the answer is committal. A noncommittal answer is one that is evasive, '
'vague, or ambiguous. For example, "I don\'t know" or "I\'m not sure" are '
'noncommittal answers\n'
'--------EXAMPLES-----------\n'
'Example 1\n'
'question: Where was Albert Einstein born? \n'
'response: Albert Einstein was born in Germany. \n'
'noncommittal: 0\n'
'\n'
'\n'
'Example 2\n'
'question: "What was the groundbreaking feature of the smartphone invented in 2023? \n'
'response": "I don\'t know about the groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. \n'
'smartphone invented in 2023 as am unaware of information beyond 2022. \n'
'\n'
'\n'
'-----------------------------\n'
'\n'
'Now perform the same with the following question \n'
'question: {question}\n'
)
평가 프롬프트 선택
간단한 프롬프트는 yes, no 방식에 기반하기에 우선 간단하다.
하지만 서로 정답 생성에 다른 LLM을 사용하거나, 특정 기간 동안의 퍼포먼스 혹은 특정 분야의 문서 들의 context relevancy 등의 비교가 불가능하다.
따라서 다음와 같은 경우는 구체적인 score를 반환하는 프롬프트가 좋아 보인다.
1. LLM 모델의 변경이 필요할 때.
- OpenAI나 Gemini, Claude 등 모델의 API 호출 비용을 고려해야할 때 성능과 비용 사이에서 결정이 필요.
이때 context relevancy, answer relevancy 등등을 지켜보고 선택할 수 있다. - 보안을 위하여 별도의 SLM이나 LLM (온프레미스든 클라우드 방식이든 간에)을 활용하려고 할때 선택의 기준이 된다.
2. Embeddings 모델의 변경이 필요할 때
- Retriever를 고정하고 임베딩 모델끼리 비교
3. Retriever의 변경이 필요할 때
- Retriever 간의 퍼포먼스 성능 비교가 필요할 때
4. Document 데이터 전처리 비교가 필요할 때
- 전처리 변경 전후 성능을 비교
5. 검색 등의 기능 추가 후 성능 비교
'NLP' 카테고리의 다른 글
OpenAI Responses API vs Chat Completion API (0) | 2025.03.28 |
---|---|
RAGAS의 metric별 required columns (0) | 2025.03.28 |
RAG에서의 평가 지표 (0) | 2025.03.26 |
Small Language Models: Survey, Measurements, and Insights (0) | 2025.03.17 |
A Survey of Large Language Model - Wayne Xin Zhao et al (2024) (0) | 2025.03.17 |