ML Ensemble for Supply Chain on 9,573 Rows: Jersey Engineering Teardown

Most ML guides assume you have millions of rows. Most government and SMB ML problems have thousands. Jersey Island's supply chain risk platform sat at the extreme end: 9,573 rows of historical data, a hard requirement that outputs be interpretable by non-technical policy analysts, and a stress-test simulator that needed real-time scenarios without retraining.

This is the engineering teardown of Jersey Supply Chain AI — the model design, why we chose an ensemble over a single deep learning model, and the trade-offs of building ML for small-data government contexts.

The Problem

Jersey Island is a small economy with concentrated import dependence. Food, fuel, medicine — almost everything arrives via a handful of shipping routes. A disruption in any of those routes (port strike, weather event, geopolitical shock) ripples through the island within days.

The existing approach was reactive: weekly spreadsheet updates, no forecasting, no scenario simulation. The Jersey government wanted:

Real-time risk scoring across all major supply categories
30-day forecasting of import volumes
Anomaly detection to flag unusual patterns
Scenario simulation — analysts could ask "What if fuel imports drop 30% for two weeks?" and see cascading effects
Interpretability — every prediction had to be explainable to non-technical policy staff

And: government data couldn't leave Jersey infrastructure.

The Architecture

A three-model ensemble, each handling a different question:

LSTM neural network — Captures temporal patterns in import volumes (seasonality, week-of-year effects, autocorrelation). Forecasts 30 days ahead per category.
XGBoost gradient-boosted trees — Predicts risk scores with explicit feature-importance rankings per category. We picked XGBoost specifically because of its native feature-importance output, which we surface directly in the UI.
Isolation Forest — Detects anomalies in the recent data and assigns severity scores from 1 to 5.

A FastAPI inference backend serves all three with sub-200ms response times. A Next.js frontend renders the dashboards, scenario simulator, and category drill-downs.

Key Technical Decisions

Why an Ensemble, Not a Single Deep Model

The temptation with ML problems is to reach for a single sophisticated deep model. With 9,573 rows, that's how you overfit and end up with an unexplainable system that performs worse than three simple specialised models.

Each model in our ensemble has a clear job:

LSTM is good at temporal patterns and bad at explainability
XGBoost is good at structured tabular features and excellent at feature importance
Isolation Forest is good at anomaly detection and unaffected by class imbalance

By splitting the problem, we got better calibrated outputs AND interpretability AND lower model complexity. The cost: three models to maintain instead of one. Worth it.

Heavy Feature Engineering Over Model Sophistication

With small data, feature engineering pays back more than model architecture. We invested heavily in:

Lag features (1d, 7d, 14d, 30d)
Rolling statistics (mean, std, min, max over multiple windows)
Calendar features (day of week, month, holidays, fiscal-year markers)
Cross-category features (e.g., fuel imports lagged 7 days as a feature for medicine availability)
Domain-specific engineered features (port congestion proxies, weather signals from external feeds)

A well-engineered feature matrix made an ordinary XGBoost outperform a fancier deep model trying to learn features end-to-end.

Scenario Simulation Without Retraining

The stress-test simulator was the most novel requirement. Analysts type something like "fuel imports drop 30% for 2 weeks starting next Monday" and see the cascading effects.

Our approach:

Take the trained models as fixed
Modify the input feature matrix to reflect the hypothetical scenario (the lag features for the next 14 days for the affected category get adjusted)
Run inference with the modified inputs
Render the forecast deltas vs the baseline

This is much faster than retraining. Sub-200ms scenario simulation is the result. The trade-off: it can't capture out-of-distribution behaviours the model has never seen. We surface this caveat in the UI ("Confidence: lower in scenarios outside historical patterns").

Interpretability Hard-Coded into the Pipeline

Every prediction the system surfaces comes with three things:

The XGBoost feature-importance ranking for that prediction
A 1-5 anomaly severity score from the Isolation Forest
A natural-language explanation generated by an LLM that takes the model outputs and produces a one-sentence "why" — e.g., "Risk increased due to lower-than-expected fuel imports last week and unusually high medicine demand."

The natural-language layer is critical: policy staff don't read XGBoost feature importance plots; they read the sentence below the chart and act on it.

For deeper context on the LLM layer that powers explanations, see our LLM Architecture Deep Dive. For agent-style explanation generation, How to build an AI agent.

Inference Latency Under 200ms

Three models in series wouldn't have hit sub-200ms. Three models in parallel does. FastAPI's async support + ProcessPoolExecutor for the CPU-bound XGBoost and Isolation Forest, with the LSTM running in a dedicated worker process, gets us p50 latency around 150ms and p99 around 280ms.

For the LSTM specifically, we use ONNX runtime instead of PyTorch in production — about 3x faster inference at no accuracy cost.

Why This Matters for ML Product Builders

The Jersey Supply Chain platform is a template for a class of ML products that's underserved: small-data domain problems where interpretability matters more than raw accuracy. Government, SMB operations, healthcare ops, and many B2B SaaS problems sit here. The approach:

Choose ensemble over single deep model
Engineer features heavily; let simple models do the work
Make interpretability a first-class output, not an afterthought
Use LLM-generated natural-language explanations as the human-facing layer

This pattern is dramatically underused. Most ML projects pursue raw accuracy and lose deployability.

For the broader business context on choosing AI use cases that pay back, see AI Automation for Indian SMBs — many of those use cases follow this same small-data ensemble pattern.

What We'd Do Differently

Use Polars instead of Pandas for the feature engineering pipeline. Performance gain on the engineering step would have been ~5x.
Add SHAP explanations alongside XGBoost feature importance. SHAP is more rigorous; we should have included both.
Containerise the inference layer with explicit GPU/CPU profiles from the start. We added this later for portability.

Where Nexolve Fits

We build domain ML and AI products via our AI-Powered Automation service. Jersey is one of several government and SMB ML platforms we've shipped. For the full project context, see the Jersey Supply Chain case study.

Supply Chain Forecasting on 9,573 Rows: Jersey Island ML Ensemble Teardown