Guardrails
Simple Definition
Guardrails are safety controls that limit what an AI system can produce or do. They’re barriers that prevent the model from generating harmful content, sharing dangerous information, going off-topic, or taking unintended actions.
Think of them like lane markers on a highway — they don’t prevent the car from moving, but they keep it within safe boundaries.
Types of Guardrails
Input guardrails — filter or block certain types of user inputs before they reach the model
Output guardrails — check and filter the model’s responses before they’re shown to users
System-level guardrails — built into the model during training (RLHF, constitutional AI)
Application-level guardrails — added by developers on top of the base model for specific use cases
What Guardrails Typically Block
- Harmful or dangerous content (instructions for weapons, self-harm)
- Inappropriate content (explicit material)
- Off-topic responses (customer support bot staying on-topic)
- Personally identifiable information (PII) in outputs
- Prompt injection attempts
Implementing Guardrails
System prompts — instructions that tell the model what it should and shouldn’t do
Input/output classifiers — secondary AI models that check if content violates policies
Keyword and pattern filtering — rule-based filters for obvious violations
Third-party tools — platforms like Guardrails AI, Nemo Guardrails, or LLaMA Guard
The Balance
Too few guardrails → harmful outputs. Too many → a model that refuses reasonable requests and frustrates users. Finding the right balance is an ongoing challenge in AI development.
Related Terms
- AI Safety — guardrails are a practical implementation of AI safety
- Alignment — guardrails help enforce aligned behavior
- System Prompt — often used to implement application-level guardrails
- Prompt Injection — attacks that try to bypass guardrails
See AI terms in action
Browse practical AI workflows that use the concepts in this glossary.
Last updated: