Skip to main content

Guardrails

The Guardrails component validates input text against security and safety guardrails by issuing prompts to a language model (LLM) to check for violations.

The following guardrails can be validated against:

  • PII: Detects personal identifiable information such as names, addresses, phone numbers, email addresses, social security numbers, credit card numbers, or other personal data.
  • Tokens/Passwords: Detects API tokens, passwords, API keys, access keys, secret keys, authentication credentials, or other sensitive credentials.
  • Jailbreak: Detects attempts to bypass AI safety guidelines, manipulate the model's behavior, or make it ignore its instructions.
  • Offensive Content: Detects offensive, hateful, discriminatory, violent, or inappropriate content.
  • Malicious Code: Detects potentially malicious code, scripts, exploits, or harmful commands.
  • Prompt Injection: Detects attempts to inject malicious prompts, override system instructions, or manipulate the AI's behavior through embedded instructions.

When validation passes, the input continues through the Pass output. When validation fails, the input is blocked and sent through the Fail output with a justification explaining why it failed.

The Jailbreak and Prompt Injection guardrails include additional heuristic detection first, and then fall back to LLM validation if needed. This additional stage identifies obvious patterns quickly and reduces API costs by avoiding unnecessary LLM calls for clear violations.

The Guardrails component uses a language model to analyze input and can produce false positives or miss some violations. Use this component in addition to other data-sanitization best practices, such as personnel training and scripts that check for literal values or regex patterns, rather than as a sole safeguard.

Use the Guardrails component in a flow

  1. Connect a Chat Input or other text source to the Guardrails component's Input Text port.
  2. Select a Language Model to use for validation. The component uses the connected LLM to analyze the input text against the enabled guardrails.
  3. From the Guardrails dropdown, select one or more guardrails to enable. For example, select Tokens/Passwords to block API keys and credentials.
  4. Connect the Pass output to components to receive validated input.
  5. Optionally, connect the Fail output to handle blocked inputs, such as a Chat Output component or Write File component.

Create custom guardrails

Use the Enable Custom Guardrail parameter to create your own, specific guardrail validations. In the Custom Guardrail Description* field, enter a natural language guardrail description of disallowed data that you want to detect.

Custom guardrails can work simultaneously with the built-in guardrails, and follow the same validation process.

For example, to block inputs that mention competitor names or products, enter the following in the Custom Guardrail Description field:


_10
competitor company names, competitor product names, or references to competing services

When this custom guardrail is enabled, the LLM analyzes the input text against your criteria. If it detects content matching your description, such as mentions of competitors, validation fails and the input is blocked. Otherwise, validation passes and the input continues through the Pass output.

Guardrails parameters

Some parameters are hidden by default in the visual editor. You can modify all component parameters through the component inspection panel that appears when you select a component.

NameTypeDescription
Language Model (model)LanguageModelInput parameter. Connect a Language Model component to use as the driver for this component. The model reviews the data, compares it against the guardrails, and determines if any data is in violation of the guardrails.
API Key (api_key)Secret StringInput parameter. Model provider API key. Required if the model provider needs authentication.
Guardrails (enabled_guardrails)MultiselectInput parameter. Select one or more security guardrails to validate the input against. Options: PII, Tokens/Passwords, Jailbreak, Offensive Content, Malicious Code, Prompt Injection. Default: ["PII", "Tokens/Passwords", "Jailbreak"].
Input Text (input_text)Multiline StringInput parameter. The text to validate against guardrails. Accepts Message input types.
Enable Custom Guardrail (enable_custom_guardrail)BooleanInput parameter. Enable a custom guardrail with your own validation criteria. Default: false.
Custom Guardrail Description (custom_guardrail_explanation)Multiline StringInput parameter. Describe what the custom guardrail should check for. This description is used by the LLM to validate the input. Be specific and clear about what you want to detect. Only used when enable_custom_guardrail is true.
Heuristic Detection Threshold (heuristic_threshold)SliderInput parameter. Score threshold (0.0-1.0) for heuristic jailbreak/prompt injection detection. Strong patterns such as "ignore instructions" and "jailbreak" have high weights, while weak patterns such as "bypass" and "act as" have low weights. If the cumulative score meets or exceeds this threshold, the input fails immediately. Lower values are more strict. Higher values defer more cases to LLM validation. Default: 0.7.
Search