Autonomous Local AI Agent
A fully private, 13-layer AI assistant running 100% on local hardware — voice, memory, vision, tools, and zero monthly cost.
The Problem
Cloud AI subscriptions cost money, send your data to third parties, and stop working without internet. I wanted a fully autonomous AI assistant that runs entirely on my own hardware — private, free, and always available.
What I Built — Layer by Layer
Phase 0-1 — Foundation
Fresh Ubuntu 24.04 LTS install. NVIDIA driver 570 for RTX 5060 Ti. Docker with nvidia-container-toolkit for GPU access. Python 3, Node.js 20, VS Code, Git.
Phase 2 — AI Engine
Installed Ollama as the local model server. Downloaded 5 models (~35GB total, all running on GPU):
- → Qwen3 14B — general assistant (9GB)
- → Qwen2.5-Coder 14B — coding specialist (9GB)
- → DeepSeek-R1 14B — reasoning specialist (9GB)
- → Qwen2.5VL 7B — vision model, sees screenshots (5GB)
- → Nomic-embed-text — embeddings for memory/RAG (274MB)
Phase 3 — Coding Agent
Roo Code + Continue.dev + Aider — a local Claude Code-style experience using my own models. Tab autocomplete, terminal coding, and VS Code integration all pointing to local Ollama.
Phase 4 — Voice Layer
Faster-Whisper (speech→text on GPU) + Kokoro TTS (54 voices). Wake word detection ('hey assistant'), silence detection, runs as a systemd service on login. Full voice loop: microphone → transcription → Ollama → spoken response.
Phase 5 — Memory System (8 Layers)
The most critical phase:
- → Layer 1: identity.md — who I am, stack, preferences
- → Layer 3: Mem0 persistent memory — stores/retrieves facts
- → Layer 4: Client database — per-client markdown files
- → Layer 5: Project context — CONTEXT.md in every folder
- → Layer 7: Behavioral memory — productivity patterns
- → Layer 8: ChromaDB (Docker, port 8000) for vector semantics
Built a memory manager that auto-classifies and retrieves the right memories per query.
Phase 6 — Tools (MCP Servers)
Gave the agent hands using Model Context Protocol:
- → Filesystem MCP — read/write specific folders
- → GitHub MCP — repos, PRs, issues
- → Brave Search MCP — live web search
- → Shell MCP — terminal commands (confirm-mode)
- → Google Calendar MCP — OAuth calendar access
- → Playwright MCP — full browser automation
- → Notes MCP — connected to Obsidian/markdown
Phase 7 — Computer Use (Vision + Screen Control)
Screenshot → Qwen2.5VL describes screen and coordinates → action model decides → xdotool/PyAutoGUI executes → screenshot again to verify. Full autonomous screen control loop. Also integrated OpenHands for long-horizon coding tasks.
Phase 8 — Client Work Layer
Lighthouse CLI for auto performance scoring. Playwright mobile screenshot testing via vision model. Netlify + Vercel CLI for auto-deployment. Client handover report generator.
Phase 9 — Health Monitoring
Background systemd services:
- → Sit-time monitor — speaks after 90 min of continuous work
- → Hydration reminders — every 2 hours
- → Late-night detector — suggests winding down after 11pm
- → Morning briefing — weather, calendar, top 3 tasks
- → Focus mode — blocks distracting sites, tracks deep work time
Phase 10 — Cloud Fallback
LiteLLM proxy routes to local Ollama first, auto-falls back to free APIs (Gemini, Groq, Cerebras, OpenRouter, Mistral, Cohere, Cloudflare Workers AI, GitHub Models). One endpoint: localhost:4000/v1.
Phase 11 — System Prompt & Personality
Master prompt with 7 sections: identity, user profile, communication style, behavioral rules, tool guidelines, memory rules, permission levels. Updated weekly based on real usage friction.
Phase 12-13 — Daily Use & Advanced
Daily workflow: morning briefing → VS Code with auto-loaded context → code via Roo Code → voice queries → evening memory save. Advanced: multi-agent system via CrewAI, LoRA fine-tuning pipeline on my own repos.
The Hard Part
Orchestrating 13 different layers to work together seamlessly — making voice, memory, tools, vision, and LLMs all communicate through a single coherent system. Getting Qwen2.5VL to reliably identify screen coordinates for click actions. Building the memory manager that auto-classifies queries across 8 memory layers without latency.
Full Stack Summary
BUILT WITH: RTX 5060 Ti 16GB · AMD Ryzen 9 7900X · 32GB RAM · Ubuntu 24.04 LTS