Our fully managed Slurm-on-Kubernetes operator streamlines AI training on NVIDIA GPU clusters — cutting infrastructure complexity so you can focus on model development.
Launch your training environment in minutes (not days). Our solution handles node provisioning, pre-installed dependencies, and full infrastructure setup — so you can schedule jobs instantly, no configuration required.
Train models stress-free: automatic health checks & recovery keep jobs running through hardware/node failures (zero downtime). Integrated monitoring dashboards & logging give you full cluster visibility and control.
Maximize your AI hardware ROI: smart scheduling & topology-aware job placement boost efficiency for large-scale training. Optimized dependencies ensure fast execution of your model training frameworks.
Our Managed Operator is powered by Soperator — our custom-built Slurm Kubernetes Operator. This lets us deliver Slurm’s advanced job scheduling capabilities and Kubernetes’ cloud-native flexibility in a single, unified AI training environment.
A shared root filesystem provides a unified file environment across all cluster nodes — streamlining package management and boosting cluster scalability.
| Managed Operator | Professional Operator | Soperator | |
|---|---|---|---|
| Solution | Slurm-based clusters | Slurm-based clusters | Kubernetes operator for Slurm |
| Delivery model | Self-service app | Professional service | Open-source software |
| Cloud environment | Kobayashi | Kobayashi | Cloud agnostic |
| Pre-installed AI/ML-drivers and libraries | Yes | Yes | Yes |
| All types of containers supported | Yes | Yes | Yes |
| Passive health checks | Yes | Yes | No |
| Active health checks | Yes | Yes | No |
| Topology-aware job scheduling | Yes | Yes | No |
| Auto-healing mechanism | Yes | Yes | on Kobayashi cloud only |
| Free software, consumption-based pricing | Yes | Yes | Yes |