Engineering11 minDec 18, 2025

Real-Time Visual Context: Our Architecture

Achieving sub-200ms latency for live video analysis requires rethinking traditional cloud architecture. Here's how we built Orasi's real-time pipeline.

The Latency Budget

Video capture: 16ms (60 fps)

WebRTC transmission: 30ms (peer-to-peer optimization)

Inference: 80ms (edge inference + quantization)

Decision/response generation: 40ms

Transmission back to client: 20ms

Total: 186ms. Below the 200ms threshold for 'real-time' conversation.

Edge Inference Strategy

We don't send every frame to a central server. Instead, we use a two-tier approach:

1. Edge tier: Low-latency model runs on the customer's device or nearest edge node. Handles pattern recognition, change detection, and local reasoning.

2. Cloud tier: High-accuracy model runs asynchronously for complex reasoning, historical context, and decision-making.

This hybrid approach keeps latency low while maintaining accuracy.

WebRTC + Custom Codec

We stream video via WebRTC (peer-to-peer, lower latency) with a custom codec optimized for visual AI—focusing on object edges and spatial features rather than photorealism. This cuts bandwidth by 60% without sacrificing accuracy.

Memory Management

Long conversations generate large context. We use ring buffers for recent frames and vector embeddings for historical context. This keeps memory footprint under 500MB even for hour-long conversations.

Ready to try Orasi?

See how visual AI can transform your customer support.

Request Demo ↗

PreviousThe Show-Don't-Tell Principle in Support Design Next AI Agents vs. Chatbots: What Actually Changed

Real-Time Visual Context: Our Architecture

The Latency Budget

Edge Inference Strategy

WebRTC + Custom Codec

Memory Management

Get insights on visual AI & support