Managed Soperator

Our fully managed Slurm-on-Kubernetes operator streamlines AI training on NVIDIA GPU clusters — cutting infrastructure complexity so you can focus on model development.

Pricing

One-click cluster setup

Launch your training environment in minutes (not days). Our solution handles node provisioning, pre-installed dependencies, and full infrastructure setup — so you can schedule jobs instantly, no configuration required.

Fault-tolerant training

Train models stress-free: automatic health checks & recovery keep jobs running through hardware/node failures (zero downtime). Integrated monitoring dashboards & logging give you full cluster visibility and control.

Maximum GPU utilization

Maximize your AI hardware ROI: smart scheduling & topology-aware job placement boost efficiency for large-scale training. Optimized dependencies ensure fast execution of your model training frameworks.

Running Slurm on Kubernetes

Our Managed Operator is powered by Soperator — our custom-built Slurm Kubernetes Operator. This lets us deliver Slurm’s advanced job scheduling capabilities and Kubernetes’ cloud-native flexibility in a single, unified AI training environment.

How it works

A shared root filesystem provides a unified file environment across all cluster nodes — streamlining package management and boosting cluster scalability.

How it works diagram

Slurm-on-Kubernetes solutions by Kobayashi

Managed Operator Professional Operator Soperator
Solution Slurm-based clusters Slurm-based clusters Kubernetes operator for Slurm
Delivery model Self-service app Professional service Open-source software
Cloud environment Kobayashi Kobayashi Cloud agnostic
Pre-installed AI/ML-drivers and libraries Yes Yes Yes
All types of containers supported Yes Yes Yes
Passive health checks Yes Yes No
Active health checks Yes Yes No
Topology-aware job scheduling Yes Yes No
Auto-healing mechanism Yes Yes on Kobayashi cloud only
Free software, consumption-based pricing Yes Yes Yes