EvaluateRagAnswerCorrectness 2025.5.31.15¶

Bundle¶

com.snowflake.openflow.runtime | runtime-rag-evaluation-processors-nar

Description¶

Evaluates the correctness of generated answers in a Retrieval-Augmented Generation (RAG) context by computing metrics such as F1 score, cosine similarity, and answer correctness. The processor uses an LLM (e.g., OpenAI’s GPT) to assess the generated answer against the ground truth.

Tags¶

ai, answer correctness, evaluation, llm, nlp, openai, openflow, rag

Input Requirement¶

REQUIRED

Supports Sensitive Dynamic Properties¶

false

Properties¶

Property	Description
Cosine Similarity Weight	The weight to apply to the cosine similarity when calculating answer correctness (between 0.0 and 1.0)
Evaluation Results Record Path	The RecordPath to write the results of the evaluation to.
F1 Score Weight	The weight to apply to the F1 score when calculating answer correctness (between 0.0 and 1.0)
Generated Answer Record Path	The path to the answer field in the record
Generated Answer Vector Record Path	The path to the answer vector field in the record.
Ground Truth Record Path	The RecordPath to the ground truth field in the record.
Ground Truth Vector Record Path	The path to the ground truth vector field in the record.
LLM Provider Service	The provider service for sending evaluation prompts to LLM
Question Record Path	The RecordPath to the question field in the record.
Record Reader	The Record Reader to use for reading the FlowFile.
Record Writer	The Record Writer to use for writing the results.

Relationships¶

Name	Description
failure	FlowFiles that cannot be processed are routed to this relationship
success	FlowFiles that are successfully processed are routed to this relationship

Writes attributes¶

Name	Description
average.f1Score	The average F1 score computed over all records.
average.cosineSim	The average cosine similarity between the ground truth and answer embeddings.
average.answerCorrectness	The average answer correctness score computed over all records.
json.parse.failures	Number of JSON parse failures encountered.

Use cases¶

Use this processor to assess the quality of answers generated by an LLM in comparison to ground truth answers, providing metrics that can be used for monitoring and improving the performance of RAG systems.