Skip to main content
← Back to Projects
Complete

Autonomous Local AI Agent

A fully private, 13-layer AI assistant running 100% on local hardware — voice, memory, vision, tools, and zero monthly cost.

13Phases Built
5Local LLMs
8Memory Layers
$0/ Month

The Problem

Cloud AI subscriptions cost money, send your data to third parties, and stop working without internet. I wanted a fully autonomous AI assistant that runs entirely on my own hardware — private, free, and always available.

What I Built — Layer by Layer

Phase 0-1 — Foundation

Fresh Ubuntu 24.04 LTS install. NVIDIA driver 570 for RTX 5060 Ti. Docker with nvidia-container-toolkit for GPU access. Python 3, Node.js 20, VS Code, Git.


Phase 2 — AI Engine

Installed Ollama as the local model server. Downloaded 5 models (~35GB total, all running on GPU):

  • → Qwen3 14B — general assistant (9GB)
  • → Qwen2.5-Coder 14B — coding specialist (9GB)
  • → DeepSeek-R1 14B — reasoning specialist (9GB)
  • → Qwen2.5VL 7B — vision model, sees screenshots (5GB)
  • → Nomic-embed-text — embeddings for memory/RAG (274MB)

Phase 3 — Coding Agent

Roo Code + Continue.dev + Aider — a local Claude Code-style experience using my own models. Tab autocomplete, terminal coding, and VS Code integration all pointing to local Ollama.


Phase 4 — Voice Layer

Faster-Whisper (speech→text on GPU) + Kokoro TTS (54 voices). Wake word detection ('hey assistant'), silence detection, runs as a systemd service on login. Full voice loop: microphone → transcription → Ollama → spoken response.


Phase 5 — Memory System (8 Layers)

The most critical phase:

  • → Layer 1: identity.md — who I am, stack, preferences
  • → Layer 3: Mem0 persistent memory — stores/retrieves facts
  • → Layer 4: Client database — per-client markdown files
  • → Layer 5: Project context — CONTEXT.md in every folder
  • → Layer 7: Behavioral memory — productivity patterns
  • → Layer 8: ChromaDB (Docker, port 8000) for vector semantics

Built a memory manager that auto-classifies and retrieves the right memories per query.


Phase 6 — Tools (MCP Servers)

Gave the agent hands using Model Context Protocol:

  • → Filesystem MCP — read/write specific folders
  • → GitHub MCP — repos, PRs, issues
  • → Brave Search MCP — live web search
  • → Shell MCP — terminal commands (confirm-mode)
  • → Google Calendar MCP — OAuth calendar access
  • → Playwright MCP — full browser automation
  • → Notes MCP — connected to Obsidian/markdown

Phase 7 — Computer Use (Vision + Screen Control)

Screenshot → Qwen2.5VL describes screen and coordinates → action model decides → xdotool/PyAutoGUI executes → screenshot again to verify. Full autonomous screen control loop. Also integrated OpenHands for long-horizon coding tasks.


Phase 8 — Client Work Layer

Lighthouse CLI for auto performance scoring. Playwright mobile screenshot testing via vision model. Netlify + Vercel CLI for auto-deployment. Client handover report generator.


Phase 9 — Health Monitoring

Background systemd services:

  • → Sit-time monitor — speaks after 90 min of continuous work
  • → Hydration reminders — every 2 hours
  • → Late-night detector — suggests winding down after 11pm
  • → Morning briefing — weather, calendar, top 3 tasks
  • → Focus mode — blocks distracting sites, tracks deep work time

Phase 10 — Cloud Fallback

LiteLLM proxy routes to local Ollama first, auto-falls back to free APIs (Gemini, Groq, Cerebras, OpenRouter, Mistral, Cohere, Cloudflare Workers AI, GitHub Models). One endpoint: localhost:4000/v1.


Phase 11 — System Prompt & Personality

Master prompt with 7 sections: identity, user profile, communication style, behavioral rules, tool guidelines, memory rules, permission levels. Updated weekly based on real usage friction.


Phase 12-13 — Daily Use & Advanced

Daily workflow: morning briefing → VS Code with auto-loaded context → code via Roo Code → voice queries → evening memory save. Advanced: multi-agent system via CrewAI, LoRA fine-tuning pipeline on my own repos.

The Hard Part

Orchestrating 13 different layers to work together seamlessly — making voice, memory, tools, vision, and LLMs all communicate through a single coherent system. Getting Qwen2.5VL to reliably identify screen coordinates for click actions. Building the memory manager that auto-classifies queries across 8 memory layers without latency.

Full Stack Summary

Model serverOllama
General LLMQwen3 14B
Coding LLMQwen2.5-Coder 14B
Reasoning LLMDeepSeek-R1 14B
VisionQwen2.5VL 7B
EmbeddingsNomic-embed-text
Vector DBChromaDB
MemoryMem0
Speech→TextFaster-Whisper
Text→SpeechKokoro TTS
Coding agentRoo Code + Continue + Aider
Tools7 MCP servers
Screenxdotool + PyAutoGUI
AutonomousOpenHands
Cloud fallbackLiteLLM
DeployNetlify + Vercel CLI

BUILT WITH: RTX 5060 Ti 16GB · AMD Ryzen 9 7900X · 32GB RAM · Ubuntu 24.04 LTS