Guardrails

Simple Definition

Guardrails are safety controls that limit what an AI system can produce or do. They’re barriers that prevent the model from generating harmful content, sharing dangerous information, going off-topic, or taking unintended actions.

Think of them like lane markers on a highway — they don’t prevent the car from moving, but they keep it within safe boundaries.

Types of Guardrails

Input guardrails — filter or block certain types of user inputs before they reach the model

Output guardrails — check and filter the model’s responses before they’re shown to users

System-level guardrails — built into the model during training (RLHF, constitutional AI)

Application-level guardrails — added by developers on top of the base model for specific use cases

What Guardrails Typically Block

  • Harmful or dangerous content (instructions for weapons, self-harm)
  • Inappropriate content (explicit material)
  • Off-topic responses (customer support bot staying on-topic)
  • Personally identifiable information (PII) in outputs
  • Prompt injection attempts

Implementing Guardrails

System prompts — instructions that tell the model what it should and shouldn’t do

Input/output classifiers — secondary AI models that check if content violates policies

Keyword and pattern filtering — rule-based filters for obvious violations

Third-party tools — platforms like Guardrails AI, Nemo Guardrails, or LLaMA Guard

The Balance

Too few guardrails → harmful outputs. Too many → a model that refuses reasonable requests and frustrates users. Finding the right balance is an ongoing challenge in AI development.

  • AI Safety — guardrails are a practical implementation of AI safety
  • Alignment — guardrails help enforce aligned behavior
  • System Prompt — often used to implement application-level guardrails
  • Prompt Injection — attacks that try to bypass guardrails

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: