Back to Blog
Engineering9 minFeb 3, 2026

Building an AI Agent That Actually Sees

Most AI support tools are text-to-text: customer input → LLM → text response. They're fast and cheap, but they're blind. Orasi is different. We built an agent that processes pixels, not just words.

The Multimodal Challenge

Traditional AI agents decompose problems into language. 'The WiFi light is blinking' becomes a query sent to a large language model. But that's already information loss. An actual image of the blinking light—with color, intensity, surrounding context—is orders of magnitude richer.

We built Orasi as a multimodal agent: it processes live video streams, extracts visual features in real-time, reasons about what it sees, and issues commands or guidance based on visual context. This required rethinking everything from inference latency to memory management.

Real-Time Inference at the Edge

Processing video at scale is computationally expensive. Sending every frame to a centralized server creates latency and bandwidth costs. Our solution: edge inference. The agent analyzes video locally on the customer's device or a nearby edge node, minimizing round-trip latency to under 200ms.

This lets the agent maintain a real-time conversation with the customer—watching what they do, responding immediately, guiding step-by-step actions without the feel of talking to a slow API.

The Memory Problem

Long-context understanding is critical. An agent needs to remember what the customer tried 5 minutes ago, what error occurred, and what state the device is in now. Managing that memory across multiple video frames and interactions is non-trivial. We use a hybrid approach: key frames are stored in a compact vector representation, and full conversation history is retained for reasoning.

Ready to try Orasi?

See how visual AI can transform your customer support.

Request Demo ↗
Stay in the loop

Get insights on visual AI & support

No spam. One email per week with our latest thinking on AI agents, customer experience, and product updates.