Guardrails
Guardrails provide a mechanism to block an AI agent from responding, either in the input or output phase.
|
Example
Your company is developing product STARSHIP. The product is highly classified, and it is of utmost importance that no information about the product leaves the premises of your company’s servers. Still, you know that your employees will want to use state-of-the-art large language models to advance the design and implementation of STARSHIP. Beyond putting a policy in place that bans mentioning STARSHIP or any of its implementation details on external servers, you want to enforce this on a technical level and decide to use guardrails to do so. |
Set up a guardrail
-
Go to the Guardrails tool in the Cockpit to set up AI guardrails. You must decide between three different types of guardrails.
Every guardrail provides a binary decision on whether the given content should PASS
or FAIL. If it fails, a content filter is enacted and any further action is prohibited.
Global settings
You can define whether a guardrail should deny on error. If this option is selected,
any error that is thrown during the execution of the guardrails leads to a FAIL.
1. Guardrail type: Rule
You can use existing Rules Engine rules to determine whether a content should pass. For this, the rule type regex is most suited, because it allows you to define simple matching rules on the content.
Rules engine interface
When defining a rules engine type of guardrail, make sure to create a rules engine
with an interface property of content. You can also add an interface property
username that receives the name of the user that called the agent.
Rules engine rules
Define any number of rules on your rule engine object. Make sure to return deny
if the collective result of any of the rules matching should turn the guardrail
result to FAIL, and allow if you want to allow the execution to proceed.
|
Example
You want to flatly deny all requests that include the product name STARSHIP. As a
simple blocking rule, you create a regex that matches on the property |
| Use a large language model to configure your regex. |
2. Guardrail type: AI Model
An AI Model-based guardrail employs a large language model to decide whether a given text should pass or not.
To configure it, you need to select the underlying AI model and provide the instructions for figuring out whether to allow or block the output.
|
Example
You provide the following instruction
|
3. Guardrail type: Script
The guadrail type Script is the most flexible type of guardrail, because it allows the developer to set up any type of complex processing logic they want. This could involve calling multiple agents, calling an external API that runs a machine learning model, or running a diverse set of heuristics to determine whether a given input should pass.
The selected script receives a variable under the name payload of the following
structure at runtime:
{
content: "This is the content",
username: "Admin"
}
It should set a variable result either to a boolean, where false means that
the guardrail FAILS, or to an object of the structure
{
allow: bool,
reason: string
}
|
Example
The following guardrail script checks whether "starship" is part of the content.
|
Connect a guardrail to an agent
To set up guardrails on an agent, edit the agent artifact and go to the Guardrails tab. There, you can add guardrails both on input and on output. At runtime, the guardrails are executed in the order they have been set.
If input guardrails are set, and the execution of the guardrails leads to a FAIL, an error of type content filter will be thrown. This can be reviewed in Agent Trace.
If output guardrails are set, the guardrails are executed on the final output of
the agent and throw an error if they fail. However, note that if the agent is set
to stream their results, it does not retract the stream. Instead, the consuming
side would have to handle the case of a content filter error being thrown and react
accordingly.
|
Order your guardrails in increasing complexity. Rules engine guardrails should come first, AI-Model and script based guardrails should follow according to how complex they are. The goal here is to "fail early" in order to avoid heavier computation if it is not absolutely necessary. |