EvaluateRagAnswerCorrectness 2025.10.9.21

Bundle

com.snowflake.openflow.runtime | runtime-rag-evaluation-processors-nar

Description

Evaluates the correctness of generated answers in a Retrieval-Augmented Generation (RAG) context by computing metrics such as F1 score, cosine similarity, and answer correctness. The processor uses an LLM (e.g., OpenAI’s GPT) to assess the generated answer against the ground truth.

Tags

ai, answer correctness, evaluation, llm, nlp, openai, openflow, rag

Input Requirement

REQUIRED

Supports Sensitive Dynamic Properties

false

Properties

PropertyDescription
Cosine Similarity WeightThe weight to apply to the cosine similarity when calculating answer correctness (between 0.0 and 1.0)
Evaluation Results Record PathThe RecordPath to write the results of the evaluation to.
F1 Score WeightThe weight to apply to the F1 score when calculating answer correctness (between 0.0 and 1.0)
Generated Answer Record PathThe path to the answer field in the record
Generated Answer Vector Record PathThe path to the answer vector field in the record.
Ground Truth Record PathThe RecordPath to the ground truth field in the record.
Ground Truth Vector Record PathThe path to the ground truth vector field in the record.
LLM Provider ServiceThe provider service for sending evaluation prompts to LLM
Question Record PathThe RecordPath to the question field in the record.
Record ReaderThe Record Reader to use for reading the FlowFile.
Record WriterThe Record Writer to use for writing the results.

Relationships

NameDescription
failureFlowFiles that cannot be processed are routed to this relationship
successFlowFiles that are successfully processed are routed to this relationship

Writes attributes

NameDescription
average.f1ScoreThe average F1 score computed over all records.
average.cosineSimThe average cosine similarity between the ground truth and answer embeddings.
average.answerCorrectnessThe average answer correctness score computed over all records.
json.parse.failuresNumber of JSON parse failures encountered.

Use cases

Use this processor to assess the quality of answers generated by an LLM in comparison to ground truth answers, providing metrics that can be used for monitoring and improving the performance of RAG systems.