E119

Element 119

AI Infrastructure Architecture

Infrastructure Design · Two-Phase Plan

Current Demo Stack +
On-Prem Migration Path

Phase 1 is the live demo: OpenRouter provides inference so the system can run without dedicated local GPUs, while self-hosted Appwrite handles functions, persistence, analytics, and secret management. Phase 2 moves inference on-prem to Ollama once the right hardware is available.

Phase 1 · Live DemoPhase 2 · On-Prem OllamaSelf-Hosted AppwritePortable Agent LayerZero Cloud Lock-In

Live Demo — Fast Validation Without Dedicated GPUs

Currently deployed at demo.dustin.ninja

Phase 01

Deployment

Public demo served with Docker + Caddy

Advisor and router apps run as isolated services

No dedicated local GPU required for the live demo

Inference Stack

OpenRouter powers live inference

Default demo model: meta-llama/llama-3.1-8b-instruct

Fast way to validate UX, prompts, and routing before hardware spend

Functions & Data

Self-hosted Appwrite Functions execute server-side workflows

Self-hosted Appwrite stores sessions and analytics

Provider API keys stay on the server and never reach the browser

Phase 1 Goals

Validate agent routing logic

Benchmark model behavior per department

Profile latency and token usage

Prove secure server-side key handling

Migration Trigger:Agent behavior validated + volume justifies hardware ROI

On-Premise Migration — Self-Hosted Ollama

When dedicated GPU hardware is available

Phase 02

Compute Hardware

Dedicated GPU host(s) sized to expected traffic

Enough VRAM for local Llama inference at target latency

Local storage for model weights and artifacts

Inference Stack

Self-hosted Ollama serving advisor and router requests

Local Llama model variants tuned for cost and latency targets

Optional quantization and batching once usage patterns are known

Functions & Data

Self-hosted Appwrite remains the server-side control plane

Functions continue brokering secrets and workflow execution

Sessions, analytics, and admin tooling stay on-prem

Phase 2 Benefits

No external inference dependency

Private model hosting on your own hardware

More predictable marginal cost at scale

Full runtime and model control

⬡

Portable Agent Layer — This Demo

Works across both phases without changes

The agent orchestration layer is portable. In the live demo, inference runs through OpenRouter while self-hosted Appwrite Functions keep provider credentials server-side and out of the browser. With dedicated GPU hardware, the same agent behavior can move to self-hosted Ollama on-prem without changing the user-facing experience.

SALES

RAG product advisor

MILITARY

DoD compliance + procurement

INSTALLER

Application support

CUSTOM

Structured intake pipeline

Migration Execution Checklist

Provision GPU hardware sized for advisor + router concurrencyPhase 2

Install Linux, NVIDIA drivers, CUDA, and Docker runtimePhase 2

Deploy Ollama locally and load the target Llama modelsPhase 2

Point server-side inference calls from OpenRouter to OllamaPhase 2

Validate advisor and router outputs against the demo baselinePhase 2

Run parallel OpenRouter + Ollama latency and quality checksMigration

Cut production inference over to local Ollama endpointsMigration

Keep Appwrite self-hosted for functions, sessions, and analyticsMigration

Restrict external inference access once local serving is stableMigration

Add monitoring for GPU usage, model latency, and Appwrite healthPhase 2

Current Demo Stack +On-Prem Migration Path

Live Demo — Fast Validation Without Dedicated GPUs

On-Premise Migration — Self-Hosted Ollama

Portable Agent Layer — This Demo

Current Demo Stack +
On-Prem Migration Path