All blogs
Engineering Teardowns

Building an Agentic AutoML Platform on a VPS Budget: GAIA Engineering Teardown

How we built a Dockerised AutoML platform with parallel Bayesian hyperparameter search, transfer-learning speedups, and multi-format model export — without cloud GPUs

Maitreya KulkarniFounder, Nexolve Technologies
9 min read
AutoML PlatformBayesian Hyperparameter OptimizationCelery Distributed TrainingFastAPI ML BackendONNX Model ExportSelf-hosted AI Infrastructure

Most AutoML platforms (Google Vertex, AWS SageMaker, DataRobot) are enterprise-priced and assume you've already bought into one cloud's stack. For Flare Studio, we needed to build something different: a self-hosted, agentic AutoML platform that runs on a single VPS, supports classification, regression, NLP, and computer vision from one interface, and gives builders full ownership of their trained models.

This is the engineering teardown of GAIA — what we built, how we made it fast on commodity infrastructure, and the architectural decisions that mattered most.

The Problem

Every product team we'd worked with had the same shape of pain: they had data, they had a defined business problem, but going from "CSV file" to "deployed REST API serving predictions" required a data scientist plus an MLOps engineer plus weeks of infrastructure work. Hosted AutoML services solved the engineering effort but locked teams into a cloud's pricing, formats, and access controls.

The GAIA mandate: a single platform where a non-ML builder uploads data, describes the task in natural language, and gets a deployed model and exported artifact within hours — without needing cloud expertise or budget.

The Architecture

GAIA is a Dockerised four-service stack:

  • Next.js builder studio — TypeScript front-end where users upload datasets, describe tasks, and inspect runs
  • FastAPI backend — orchestrator API exposing endpoints for data profiling, training, deployment, and inference
  • Celery worker fleet — distributed training workers, scaled horizontally based on load
  • Redis broker — task queue + result backend for Celery

PostgreSQL stores users, datasets, runs, and full training lineage. Google OAuth for authentication. Razorpay for billing. The entire stack runs on a single Hostinger VPS with horizontal Celery scaling.

Key Technical Decisions

Bayesian Hyperparameter Search Parallelised Across Workers

A single sequential hyperparameter search on a non-trivial dataset takes hours. We needed it parallel. Most frameworks (Optuna, Hyperopt) support parallel evaluation, but coordinating parallel search across an autoscaling worker fleet without race conditions is non-trivial.

We use Optuna's RDB storage backend pointed at PostgreSQL, with Celery workers each pulling the next trial atomically. Optuna's TPE sampler uses the running history from the shared DB to suggest the next trial intelligently — so adding workers improves both throughput AND search quality. With 4 workers, we hit ~3.5x effective throughput; with 8 workers, ~6x. The sub-linear scaling is from Bayesian sampler concurrency overhead, not infrastructure.

Transfer Learning by Default, Not by Configuration

For NLP and CV tasks, training from scratch is wasteful — pretrained models do 95% of the work. GAIA picks a base model based on the task (Hugging Face's distilbert-base-uncased for text classification under 10K rows, t5-small for text-to-text, mobilenet-v3 for image classification under 5K images) and fine-tunes from there.

This is where the "10x training speedup" headline comes from. A from-scratch BERT on 5K rows takes ~6 hours; fine-tuning distilbert on the same data, 35 minutes. Same accuracy. Same memory footprint. The trick is making the base-model selection automatic and invisible to the user.

Multi-Format Export with Lineage

Builders use models in different runtimes. A Node.js backend wants ONNX. A Python notebook wants pickle. A mobile app wants TFLite. We built export adapters for ONNX, H5, PKL, and TFLite, and the platform automatically picks the format based on the user's stated target runtime.

Each export bundles the model artifact AND a lineage manifest: dataset hash, hyperparameters used, training metrics, base model (if transfer-learned), and library versions. This is what separates "model" from "production-ready model" — anyone can re-derive how this artifact was produced.

Why Celery + Redis, Not Kubernetes

Most modern ML platforms reach for Kubernetes. We deliberately didn't. Kubernetes adds an entire control-plane ops burden that doesn't pay back at single-VPS scale. Celery + Redis on a single host gives us: process isolation, retry semantics, scheduled tasks, result backends, and a dashboard via Flower. Redis MEMORY consumption is bounded by short-lived task results.

When GAIA needs to scale beyond one VPS, the Celery layer is portable — same code, swap the broker to managed Redis (Upstash, Redis Cloud), point workers at it, done. Kubernetes was the wrong abstraction for the scale we were at.

Why This Matters for Builders

GAIA's design point — agentic, self-hosted, no cloud GPU dependency — reflects what we believe most early-stage AI products actually need. Cloud AutoML services solve a real problem for enterprises with seven-figure ML budgets and existing cloud commitments. They overcomplicate the workflow for everyone else.

For builders evaluating the build vs buy decision around AI infrastructure, our build vs buy framework for software covers the broader principles. For the specifics of designing AI agents, see How to build an AI agent for your business and LLM Architecture Deep Dive.

What We'd Do Differently

If we were starting GAIA today:

  • Move feature store to Postgres + pgvector earlier. We added it later for similarity-based dataset recommendations; should have been there from week one.
  • Use FastAPI background tasks for cheap operations instead of Celery for everything. We over-Celery'd lightweight ops; doing it through FastAPI's async background tasks would have been simpler.
  • Standardise on ONNX as the canonical format internally. Multi-format export is good for users but having three internal serialisations made the testing matrix painful.

Where Nexolve Fits

GAIA was a long, intense engineering build. If you're working on an AI product that needs the same kind of from-scratch infrastructure thinking, our SaaS & Web Apps service handles deep platform engagements like this. For the broader portfolio context, see the full GAIA case study and our other AI infrastructure work.

Working on something similar?

Nexolve scopes, designs, and ships production software for startups and growing businesses. Tell us what you're building — we come back with a scoped plan within 48 hours.

Related reading