Back to Blog
Engineering11 minDec 18, 2025

Real-Time Visual Context: Our Architecture

Achieving sub-200ms latency for live video analysis requires rethinking traditional cloud architecture. Here's how we built Orasi's real-time pipeline.

The Latency Budget

Video capture: 16ms (60 fps)

WebRTC transmission: 30ms (peer-to-peer optimization)

Inference: 80ms (edge inference + quantization)

Decision/response generation: 40ms

Transmission back to client: 20ms

Total: 186ms. Below the 200ms threshold for 'real-time' conversation.

Edge Inference Strategy

We don't send every frame to a central server. Instead, we use a two-tier approach:

1. Edge tier: Low-latency model runs on the customer's device or nearest edge node. Handles pattern recognition, change detection, and local reasoning.

2. Cloud tier: High-accuracy model runs asynchronously for complex reasoning, historical context, and decision-making.

This hybrid approach keeps latency low while maintaining accuracy.

WebRTC + Custom Codec

We stream video via WebRTC (peer-to-peer, lower latency) with a custom codec optimized for visual AI—focusing on object edges and spatial features rather than photorealism. This cuts bandwidth by 60% without sacrificing accuracy.

Memory Management

Long conversations generate large context. We use ring buffers for recent frames and vector embeddings for historical context. This keeps memory footprint under 500MB even for hour-long conversations.

Ready to try Orasi?

See how visual AI can transform your customer support.

Request Demo ↗
Stay in the loop

Get insights on visual AI & support

No spam. One email per week with our latest thinking on AI agents, customer experience, and product updates.