Real-Time Visual Context: Our Architecture
Achieving sub-200ms latency for live video analysis requires rethinking traditional cloud architecture. Here's how we built Orasi's real-time pipeline.
The Latency Budget
Video capture: 16ms (60 fps)
WebRTC transmission: 30ms (peer-to-peer optimization)
Inference: 80ms (edge inference + quantization)
Decision/response generation: 40ms
Transmission back to client: 20ms
Total: 186ms. Below the 200ms threshold for 'real-time' conversation.
Edge Inference Strategy
We don't send every frame to a central server. Instead, we use a two-tier approach:
1. Edge tier: Low-latency model runs on the customer's device or nearest edge node. Handles pattern recognition, change detection, and local reasoning.
2. Cloud tier: High-accuracy model runs asynchronously for complex reasoning, historical context, and decision-making.
This hybrid approach keeps latency low while maintaining accuracy.
WebRTC + Custom Codec
We stream video via WebRTC (peer-to-peer, lower latency) with a custom codec optimized for visual AI—focusing on object edges and spatial features rather than photorealism. This cuts bandwidth by 60% without sacrificing accuracy.
Memory Management
Long conversations generate large context. We use ring buffers for recent frames and vector embeddings for historical context. This keeps memory footprint under 500MB even for hour-long conversations.