E119
Element 119
AI Infrastructure Architecture
Infrastructure Design · Two-Phase Plan

Current Demo Stack +
On-Prem Migration Path

Phase 1 is the live demo: OpenRouter provides inference so the system can run without dedicated local GPUs, while self-hosted Appwrite handles functions, persistence, analytics, and secret management. Phase 2 moves inference on-prem to Ollama once the right hardware is available.

Phase 1 · Live DemoPhase 2 · On-Prem OllamaSelf-Hosted AppwritePortable Agent LayerZero Cloud Lock-In
1

Live Demo — Fast Validation Without Dedicated GPUs

Currently deployed at demo.dustin.ninja
Phase 01
Deployment
Public demo served with Docker + Caddy
Advisor and router apps run as isolated services
No dedicated local GPU required for the live demo
Inference Stack
OpenRouter powers live inference
Default demo model: meta-llama/llama-3.1-8b-instruct
Fast way to validate UX, prompts, and routing before hardware spend
Functions & Data
Self-hosted Appwrite Functions execute server-side workflows
Self-hosted Appwrite stores sessions and analytics
Provider API keys stay on the server and never reach the browser
Phase 1 Goals
Validate agent routing logic
Benchmark model behavior per department
Profile latency and token usage
Prove secure server-side key handling
Migration Trigger:Agent behavior validated + volume justifies hardware ROI
2

On-Premise Migration — Self-Hosted Ollama

When dedicated GPU hardware is available
Phase 02
Compute Hardware
Dedicated GPU host(s) sized to expected traffic
Enough VRAM for local Llama inference at target latency
Local storage for model weights and artifacts
Inference Stack
Self-hosted Ollama serving advisor and router requests
Local Llama model variants tuned for cost and latency targets
Optional quantization and batching once usage patterns are known
Functions & Data
Self-hosted Appwrite remains the server-side control plane
Functions continue brokering secrets and workflow execution
Sessions, analytics, and admin tooling stay on-prem
Phase 2 Benefits
No external inference dependency
Private model hosting on your own hardware
More predictable marginal cost at scale
Full runtime and model control

Portable Agent Layer — This Demo

Works across both phases without changes

The agent orchestration layer is portable. In the live demo, inference runs through OpenRouter while self-hosted Appwrite Functions keep provider credentials server-side and out of the browser. With dedicated GPU hardware, the same agent behavior can move to self-hosted Ollama on-prem without changing the user-facing experience.

SALES
RAG product advisor
MILITARY
DoD compliance + procurement
INSTALLER
Application support
CUSTOM
Structured intake pipeline
Migration Execution Checklist
Provision GPU hardware sized for advisor + router concurrencyPhase 2
Install Linux, NVIDIA drivers, CUDA, and Docker runtimePhase 2
Deploy Ollama locally and load the target Llama modelsPhase 2
Point server-side inference calls from OpenRouter to OllamaPhase 2
Validate advisor and router outputs against the demo baselinePhase 2
Run parallel OpenRouter + Ollama latency and quality checksMigration
Cut production inference over to local Ollama endpointsMigration
Keep Appwrite self-hosted for functions, sessions, and analyticsMigration
Restrict external inference access once local serving is stableMigration
Add monitoring for GPU usage, model latency, and Appwrite healthPhase 2