Computer Vision

Simple Definition

Computer vision is the field of AI that enables computers to “see” and interpret visual information — recognizing what’s in images and video, understanding spatial relationships, and making sense of the visual world.

It’s what allows your phone to recognize your face, a Tesla to identify pedestrians, and Google Photos to let you search your pictures by content.

What Computer Vision Can Do

Object detection — identify and locate objects in images (“there’s a cat in the top-left corner”)

Image classification — categorize an entire image (“this is a photo of a beach”)

Facial recognition — identify specific individuals from photos or video

Scene understanding — understand the full context of an image

Optical character recognition (OCR) — extract text from images and documents

Medical imaging — detect tumors, diagnoses, anomalies in scans

Video analysis — track objects, detect events, analyze motion over time

How It Works

Modern computer vision uses deep learning — particularly convolutional neural networks (CNNs) and increasingly vision transformers (ViTs). These models are trained on millions of labeled images and learn to extract visual features at progressively higher levels of abstraction.

Applications in Everyday Life

  • Smartphone face unlock
  • Google Lens — identify objects by pointing your camera
  • Self-driving car perception
  • Industrial quality control
  • Security and surveillance cameras
  • Augmented reality filters

Computer Vision in Multimodal AI

Modern AI assistants like GPT-4o and Claude can “see” — you can send them images and they’ll describe, analyze, or answer questions about what they see. This is computer vision integrated into conversational AI.

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: