EvaluateRagAnswerCorrectness 2025.3.28.13-SNAPSHOT¶

BUNDLE¶

com.snowflake.openflow.runtime | runtime-rag-evaluation-processors-nar

DESCRIPTION¶

Evaluates the correctness of generated answers in a Retrieval-Augmented Generation (RAG) context by computing metrics such as F1 score, cosine similarity, and answer correctness. The processor uses an LLM (e.g., OpenAI’s GPT) to assess the generated answer against the ground truth.

TAGS¶

ai, answer correctness, evaluation, llm, nlp, openai, openflow, rag

INPUT REQUIREMENT¶

REQUIRED

Supports Sensitive Dynamic Properties¶

false

PROPERTIES¶

Property

Description

Cosine Similarity Weight

The weight to apply to the cosine similarity when calculating answer correctness (between 0.0 and 1.0)

Evaluation Results Record Path

The RecordPath to write the results of the evaluation to.

F1 Score Weight

The weight to apply to the F1 score when calculating answer correctness (between 0.0 and 1.0)

Generated Answer Record Path

The path to the answer field in the record

Generated Answer Vector Record Path

The path to the answer vector field in the record.

Ground Truth Record Path

The RecordPath to the ground truth field in the record.

Ground Truth Vector Record Path

The path to the ground truth vector field in the record.

LLM Provider Service

The provider service for sending evaluation prompts to LLM

Question Record Path

The RecordPath to the question field in the record.

Record Reader

The Record Reader to use for reading the FlowFile.

Record Writer

The Record Writer to use for writing the results.

RELATIONSHIPS¶

NAME

DESCRIPTION

failure

FlowFiles that cannot be processed are routed to this relationship

success

FlowFiles that are successfully processed are routed to this relationship

WRITES ATTRIBUTES¶

NAME

DESCRIPTION

average.f1Score

The average F1 score computed over all records.

average.cosineSim

The average cosine similarity between the ground truth and answer embeddings.

average.answerCorrectness

The average answer correctness score computed over all records.

json.parse.failures

Number of JSON parse failures encountered.

USE CASES¶

Use this processor to assess the quality of answers generated by an LLM in comparison to ground truth answers, providing metrics that can be used for monitoring and improving the performance of RAG systems.