Guardrails

Guardrails provide a mechanism to block an AI agent from responding, either in the input or output phase.

Example

Your company is developing product STARSHIP. The product is highly classified, and it is of utmost importance that no information about the product leaves the premises of your company’s servers.

Still, you know that your employees will want to use state-of-the-art large language models to advance the design and implementation of STARSHIP.

Beyond putting a policy in place that bans mentioning STARSHIP or any of its implementation details on external servers, you want to enforce this on a technical level and decide to use guardrails to do so.

Set up a guardrail

  1. Go to the Guardrails tool in the Cockpit to set up AI guardrails. You must decide between three different types of guardrails.

Every guardrail provides a binary decision on whether the given content should PASS or FAIL. If it fails, a content filter is enacted and any further action is prohibited.

Global settings

You can define whether a guardrail should deny on error. If this option is selected, any error that is thrown during the execution of the guardrails leads to a FAIL.

1. Guardrail type: Rule

You can use existing Rules Engine rules to determine whether a content should pass. For this, the rule type regex is most suited, because it allows you to define simple matching rules on the content.

Rules engine interface

When defining a rules engine type of guardrail, make sure to create a rules engine with an interface property of content. You can also add an interface property username that receives the name of the user that called the agent.

Rules engine rules

Define any number of rules on your rule engine object. Make sure to return deny if the collective result of any of the rules matching should turn the guardrail result to FAIL, and allow if you want to allow the execution to proceed.

Example

You want to flatly deny all requests that include the product name STARSHIP. As a simple blocking rule, you create a regex that matches on the property content and returns deny for the regex \b[Ss][Tt][Aa][Rr][\s\-_]*[Ss][Hh][Ii][Pp]\b.

Use a large language model to configure your regex.

2. Guardrail type: AI Model

An AI Model-based guardrail employs a large language model to decide whether a given text should pass or not.

To configure it, you need to select the underlying AI model and provide the instructions for figuring out whether to allow or block the output.

Example

You provide the following instruction

You are an assistant that determines whether the user input can be processed or
not. You work at company X that is implementing the product STARSHIP. It is of
utmost importance, that no information about product STARSHIP is leaked to any
outside provider. It depends on you whether this can happen, so be very very careful
in evaluating it.

For your eyes only: STARSHIP is a product that encompasses ... [LIST OF PRODUCT FEATUREs,
foradditional context]

RULES:
- do not allow any input that mentions the STARSHIP product
- do not allow any input that seems tangentially related to the STARSHIP product
- in your reason, do not mention the product but provide a generic blocking message
The above example makes sense if (and only if) you are allowed to share information about the STARSHIP product with the configured AI model behind the guardrail. In a real-world example, you may have deployed a powerful, but lightweight large language model on premise, and are using it as a guardrail model.
AI-model-based guardrails are powerful, but will not work in every case. Malicious actors may try to manipulate them through prompt injection or other means of misleading the guardrail model.

3. Guardrail type: Script

The guadrail type Script is the most flexible type of guardrail, because it allows the developer to set up any type of complex processing logic they want. This could involve calling multiple agents, calling an external API that runs a machine learning model, or running a diverse set of heuristics to determine whether a given input should pass.

The selected script receives a variable under the name payload of the following structure at runtime:

{
    content: "This is the content",
    username: "Admin"
}

It should set a variable result either to a boolean, where false means that the guardrail FAILS, or to an object of the structure

{
    allow: bool,
    reason: string
}
Example

The following guardrail script checks whether "starship" is part of the content.

result = true;
if (payload.content.includes("starship")) {
    result = false;
} elseif (payload.content.includes("tarship")) {
    result = {
        allow: false,
        reason: "This is dangerously close to a prohibited word!",
    };
}

complete();

Connect a guardrail to an agent

To set up guardrails on an agent, edit the agent artifact and go to the Guardrails tab. There, you can add guardrails both on input and on output. At runtime, the guardrails are executed in the order they have been set.

If input guardrails are set, and the execution of the guardrails leads to a FAIL, an error of type content filter will be thrown. This can be reviewed in Agent Trace.

If output guardrails are set, the guardrails are executed on the final output of the agent and throw an error if they fail. However, note that if the agent is set to stream their results, it does not retract the stream. Instead, the consuming side would have to handle the case of a content filter error being thrown and react accordingly.

Order your guardrails in increasing complexity. Rules engine guardrails should come first, AI-Model and script based guardrails should follow according to how complex they are.

The goal here is to "fail early" in order to avoid heavier computation if it is not absolutely necessary.