Skip to main content

Integrate Cleanlab Evaluations with Langflow

Unlock trustworthy Agentic, RAG, and LLM pipelines with Cleanlab's evaluation and remediation suite.

Cleanlab adds automation and trust to every data point going in and every prediction coming out of AI and RAG solutions.

This Langflow integration provides three Langflow components that assess and improve the trustworthiness of any LLM or RAG pipeline output.

Use the components in this bundle to quantify the trustworthiness of any LLM response with a score between 0 and 1, and explain why a response may be good or bad. For RAG/Agentic pipelines with context, you can evaluate context sufficiency, groundedness, helpfulness, and query clarity with quantitative scores. Additionally, you can remediate low-trust responses with warnings or fallback answers.

Prerequisites

CleanlabEvaluator

This component evaluates and explains the trustworthiness of a prompt and response pair using Cleanlab. For more information on how the score works, see the Cleanlab documentation.

Parameters

Inputs

NameTypeDescription
system_promptMessageThe system message prepended to the prompt. Optional.
promptMessageThe user-facing input to the LLM.
responseMessageThe model's response to evaluate.
cleanlab_api_keySecretYour Cleanlab API key.
cleanlab_evaluation_modelDropdownEvaluation model used by Cleanlab, such as GPT-4 or Claude. This does not need to be the same model that generated the response.
quality_presetDropdownTradeoff between evaluation speed and accuracy.

Outputs

NameTypeDescription
scorenumberDisplays the trust score between 0–1.
explanationMessageProvides an explanation of the trust score.
responseMessageReturns the original response for easy chaining to the CleanlabRemediator component.

CleanlabRemediator

This component uses the trust score from the CleanlabEvaluator component to determine whether to show, warn about, or replace an LLM response. This component has configurables for the score threshold, warning text, and fallback message that you can customize as needed.

Parameters

Inputs

NameTypeDescription
responseMessageThe response to potentially remediate.
scorenumberThe trust score from CleanlabEvaluator.
explanationMessageThe explanation to append if a warning is shown. Optional.
thresholdfloatThe minimum trust score to pass a response unchanged.
show_untrustworthy_responseboolWhether to display or hide the original response with a warning if a response is deemed untrustworthy.
untrustworthy_warning_textPromptThe warning text for untrustworthy responses.
fallback_textPromptThe fallback message if the response is hidden.

Outputs

NameTypeDescription
remediated_responseMessageThe final message shown to user after remediation logic.

CleanlabRAGEvaluator

This component evaluates RAG and LLM pipeline outputs for trustworthiness, context sufficiency, response groundedness, helpfulness, and query ease. Learn more about Cleanlab's evaluation metrics here.

Additionally, use the CleanlabRemediator component with this component to remediate low-trust responses coming from the RAG pipeline.

Parameters

Inputs

NameTypeDescription
cleanlab_api_keySecretYour Cleanlab API key.
cleanlab_evaluation_modelDropdownThevaluation model used by Cleanlab, such as GPT-4, or Claude. This does not need to be the same model that generated the response.
quality_presetDropdownThe tradeoff between evaluation speed and accuracy.
contextMessageThe retrieved context from your RAG system.
queryMessageThe original user query.
responseMessageThe model's response based on the context and query.
run_context_sufficiencyboolEvaluate whether context supports answering the query.
run_response_groundednessboolEvaluate whether the response is grounded in the context.
run_response_helpfulnessboolEvaluate how helpful the response is.
run_query_easeboolEvaluate if the query is vague, complex, or adversarial.

Outputs

NameTypeDescription
trust_scorenumberThe overall trust score.
trust_explanationMessageThe explanation for the trust score.
other_scoresdictA dictionary of optional enabled RAG evaluation metrics.
evaluation_summaryMessageA Markdown summary of query, context, response, and evaluation results.

Cleanlab component example flows

The following example flows show how to use the CleanlabEvaluator and CleanlabRemediator components to evaluate and remediate responses from any LLM, and how to use the CleanlabRAGEvaluator component to evaluate RAG pipeline outputs.

Evaluate and remediate responses from an LLM

tip

Optionally, Download the Evaluate and Remediate flow and follow along.

This flow evaluates and remediates the trustworthiness of a response from any LLM using the CleanlabEvaluator and CleanlabRemediator components.

Evaluate response trustworthiness

Connect the Message output from any LLM component to the response input of the CleanlabEvaluator component, and then connect the Prompt component to its prompt input.

The CleanlabEvaluator component returns a trust score and explanation from the flow.

The CleanlabRemediator component uses this trust score to determine whether to output the original response, warn about it, or replace it with a fallback answer.

This example shows a response that was determined to be untrustworthy (a score of .09) and flagged with a warning by the CleanlabRemediator component.

CleanlabRemediator Example

To hide untrustworthy responses, configure the CleanlabRemediator component to replace the response with a fallback message.

CleanlabRemediator Example

Evaluate RAG pipeline

This example flow includes the Vector Store RAG template with the CleanlabRAGEvaluator component added to evaluate the flow's context, query, and response.

To use the CleanlabRAGEvaluator component in a flow, connect the context, query, and response outputs from any RAG pipeline to the CleanlabRAGEvaluator component.

Evaluate RAG pipeline

Here is an example of the Evaluation Summary output from the CleanlabRAGEvaluator component.

Evaluate RAG pipeline

The Evaluation Summary includes the query, context, response, and all evaluation results. In this example, the Context Sufficiency and Response Groundedness scores are low (a score of 0.002) because the context doesn't contain information about the query, and the response is not grounded in the context.

Search