Complete

Autonomous Local AI Agent

A fully private, 13-layer AI assistant running 100% on local hardware — voice, memory, vision, tools, and zero monthly cost.

13Phases Built

5Local LLMs

8Memory Layers

$0/ Month

The Problem

Cloud AI subscriptions cost money, send your data to third parties, and stop working without internet. I wanted a fully autonomous AI assistant that runs entirely on my own hardware — private, free, and always available.

What I Built — Layer by Layer

✓

Phase 0-1 — Foundation

Fresh Ubuntu 24.04 LTS install. NVIDIA driver 570 for RTX 5060 Ti. Docker with nvidia-container-toolkit for GPU access. Python 3, Node.js 20, VS Code, Git.

✓

Phase 2 — AI Engine

Installed Ollama as the local model server. Downloaded 5 models (~35GB total, all running on GPU):

→ Qwen3 14B — general assistant (9GB)
→ Qwen2.5-Coder 14B — coding specialist (9GB)
→ DeepSeek-R1 14B — reasoning specialist (9GB)
→ Qwen2.5VL 7B — vision model, sees screenshots (5GB)
→ Nomic-embed-text — embeddings for memory/RAG (274MB)

✓

Phase 3 — Coding Agent

Roo Code + Continue.dev + Aider — a local Claude Code-style experience using my own models. Tab autocomplete, terminal coding, and VS Code integration all pointing to local Ollama.

✓

Phase 4 — Voice Layer

Faster-Whisper (speech→text on GPU) + Kokoro TTS (54 voices). Wake word detection ('hey assistant'), silence detection, runs as a systemd service on login. Full voice loop: microphone → transcription → Ollama → spoken response.

✓

Phase 5 — Memory System (8 Layers)

The most critical phase:

→ Layer 1: identity.md — who I am, stack, preferences
→ Layer 3: Mem0 persistent memory — stores/retrieves facts
→ Layer 4: Client database — per-client markdown files
→ Layer 5: Project context — CONTEXT.md in every folder
→ Layer 7: Behavioral memory — productivity patterns
→ Layer 8: ChromaDB (Docker, port 8000) for vector semantics

Built a memory manager that auto-classifies and retrieves the right memories per query.

✓

Phase 6 — Tools (MCP Servers)

Gave the agent hands using Model Context Protocol:

→ Filesystem MCP — read/write specific folders
→ GitHub MCP — repos, PRs, issues
→ Brave Search MCP — live web search
→ Shell MCP — terminal commands (confirm-mode)
→ Google Calendar MCP — OAuth calendar access
→ Playwright MCP — full browser automation
→ Notes MCP — connected to Obsidian/markdown

✓

Phase 7 — Computer Use (Vision + Screen Control)

Screenshot → Qwen2.5VL describes screen and coordinates → action model decides → xdotool/PyAutoGUI executes → screenshot again to verify. Full autonomous screen control loop. Also integrated OpenHands for long-horizon coding tasks.

✓

Phase 8 — Client Work Layer

Lighthouse CLI for auto performance scoring. Playwright mobile screenshot testing via vision model. Netlify + Vercel CLI for auto-deployment. Client handover report generator.

✓

Phase 9 — Health Monitoring

Background systemd services:

→ Sit-time monitor — speaks after 90 min of continuous work
→ Hydration reminders — every 2 hours
→ Late-night detector — suggests winding down after 11pm
→ Morning briefing — weather, calendar, top 3 tasks
→ Focus mode — blocks distracting sites, tracks deep work time

✓

Phase 10 — Cloud Fallback

LiteLLM proxy routes to local Ollama first, auto-falls back to free APIs (Gemini, Groq, Cerebras, OpenRouter, Mistral, Cohere, Cloudflare Workers AI, GitHub Models). One endpoint: localhost:4000/v1.

✓

Phase 11 — System Prompt & Personality

Master prompt with 7 sections: identity, user profile, communication style, behavioral rules, tool guidelines, memory rules, permission levels. Updated weekly based on real usage friction.

✓

Phase 12-13 — Daily Use & Advanced

Daily workflow: morning briefing → VS Code with auto-loaded context → code via Roo Code → voice queries → evening memory save. Advanced: multi-agent system via CrewAI, LoRA fine-tuning pipeline on my own repos.

The Hard Part

Orchestrating 13 different layers to work together seamlessly — making voice, memory, tools, vision, and LLMs all communicate through a single coherent system. Getting Qwen2.5VL to reliably identify screen coordinates for click actions. Building the memory manager that auto-classifies queries across 8 memory layers without latency.

Full Stack Summary

Model serverOllama

General LLMQwen3 14B

Coding LLMQwen2.5-Coder 14B

Reasoning LLMDeepSeek-R1 14B

VisionQwen2.5VL 7B

EmbeddingsNomic-embed-text

Vector DBChromaDB

MemoryMem0

Speech→TextFaster-Whisper

Text→SpeechKokoro TTS

Coding agentRoo Code + Continue + Aider

Tools7 MCP servers

Screenxdotool + PyAutoGUI

AutonomousOpenHands

Cloud fallbackLiteLLM

DeployNetlify + Vercel CLI

BUILT WITH: RTX 5060 Ti 16GB · AMD Ryzen 9 7900X · 32GB RAM · Ubuntu 24.04 LTS