Skip to main content

Integrate Cleanlab Evaluations with Langflow

Unlock trustworthy Agentic, RAG, and LLM pipelines with Cleanlab's evaluation and remediation suite.

Cleanlab adds automation and trust to every data point going in and every prediction coming out of AI and RAG solutions.

This Langflow integration provides 3 modular components that assess and improve the trustworthiness of any LLM or RAG pipeline output, enabling critical oversight for safety-sensitive, enterprise, and production GenAI applications.

Use this bundle to:

  • Quantify trustworthiness of ANY LLM response with a 0-1 score
  • Explain why a response may be good or bad
  • Evaluate context sufficiency, groundedness, helpfulness, and query clarity with quantitative scores (for RAG/Agentic pipelines with context)
  • Remediate low-trust responses with warnings or fallback answers

Prerequisites

Before using these components, you'll need:

Components

CleanlabEvaluator

Purpose: Evaluate and explain the trustworthiness of a prompt + response pair using Cleanlab. More details on how the score works here.

Inputs

NameTypeDescription
system_promptMessage(Optional) System message prepended to the prompt
promptMessageThe user-facing input to the LLM
responseMessageOpenAI's, Claude, etc. model's response to evaluate
cleanlab_api_keySecretYour Cleanlab API key
cleanlab_evaluation_modelDropdownEvaluation model used by Cleanlab (GPT-4, Claude, etc.) This does not need to be the same model that generated the response.
quality_presetDropdownTradeoff between evaluation speed and accuracy

Outputs

NameTypeDescription
scorenumberTrust score between 0–1
explanationMessageExplanation of the trust score
responseMessageReturns the original response for easy chaining to CleanlabRemediator component

CleanlabRemediator

Purpose: Use the trust score from the CleanlabEvaluator component to determine whether to show, warn about, or replace an LLM response. This component has configurables for the score threshold, warning text, and fallback message which you can customize as needed.

Inputs

NameTypeDescription
responseMessageThe response to potentially remediate
scorenumberTrust score from CleanlabEvaluator
explanationMessage(Optional) Explanation to append if warning is shown
thresholdfloatMinimum trust score to pass response unchanged
show_untrustworthy_responseboolShow original response with warning if untrustworthy
untrustworthy_warning_textPromptWarning text for untrustworthy responses
fallback_textPromptFallback message if response is hidden

Output

NameTypeDescription
remediated_responseMessageFinal message shown to user after remediation logic

See example outputs below!


CleanlabRAGEvaluator

Purpose: Comprehensively evaluate RAG and LLM pipeline outputs by analyzing the context, query, and response quality using Cleanlab. This component assesses trustworthiness, context sufficiency, response groundedness, helpfulness, and query ease. Learn more about Cleanlab's evaluation metrics here. You can also use the CleanlabRemediator component with this one to remediate low-trust responses coming from the RAG pipeline.

Inputs

NameTypeDescription
cleanlab_api_keySecretYour Cleanlab API key
cleanlab_evaluation_modelDropdownEvaluation model used by Cleanlab (GPT-4, Claude, etc.) This does not need to be the same model that generated the response.
quality_presetDropdownTradeoff between evaluation speed and accuracy
contextMessageRetrieved context from your RAG system
queryMessageThe original user query
responseMessageOpenAI's, Claude, etc. model's response based on the context and query
run_context_sufficiencyboolEvaluate whether context supports answering the query
run_response_groundednessboolEvaluate whether the response is grounded in the context
run_response_helpfulnessboolEvaluate how helpful the response is
run_query_easeboolEvaluate if the query is vague, complex, or adversarial

Outputs

NameTypeDescription
trust_scorenumberOverall trust score
trust_explanationMessageExplanation for trust score
other_scoresdictDictionary of optional enabled RAG evaluation metrics
evaluation_summaryMessageMarkdown summary of query, context, response, and evaluation results

Example Flows

The following example flows show how to use the CleanlabEvaluator and CleanlabRemediator components to evaluate and remediate responses from any LLM, and how to use the CleanlabRAGEvaluator component to evaluate RAG pipeline outputs.

Evaluate and remediate responses from any LLM

Download the flow to follow along!

This flow evaluates and remediates the trustworthiness of a response from any LLM using the CleanlabEvaluator and CleanlabRemediator components.

Evaluate response trustworthiness

Simply connect the Message output from any LLM component (like OpenAI, Anthropic, or Google) to the response input of the CleanlabEvaluator component, along with connecting your prompt to its prompt input.

That's it! The CleanlabEvaluator component will return a trust score and explanation which you can use however you'd like.

The CleanlabRemediator component uses this trust score and user configurable settings to determine whether to output the original response, warn about it, or replace it with a fallback answer.

The example below shows a response that was determined to be untrustworthy (score of .09) and flagged with a warning by the CleanlabRemediator component.

CleanlabRemediator Example

If you don't want to show untrustworthy responses, you can also configure the CleanlabRemediator to replace the response with a fallback message.

CleanlabRemediator Example

Evaluate RAG pipeline

The below flow is the Vector Store RAG example template, with the CleanlabRAGEvaluator component added to evaluate the context, query, and response. You can use the CleanlabRAGEvaluator with any flow that has a context, query, and response. Simply connect the context, query, and response outputs from any RAG pipeline to the CleanlabRAGEvaluator component.

Evaluate RAG pipeline

Here is an example of the Evaluation Summary output from the CleanlabRAGEvaluator component.

Evaluate RAG pipeline

Notice how the Evaluation Summary includes the query, context, response, and all the evaluation results! In this example, the Context Sufficiency and Response Groundedness scores are low (0.002) because the context doesn't contain information about the query and the response is not grounded in the context.

Search