Two complementary approaches to AI-driven liquid cooling for GPU data centers — Phaidra’s RL agent masters CDU setpoint optimization while Federator.ai extends the control boundary to GPU workloads, scheduling, and platform-wide thermal management. Together, they cover the full stack.
Executive Summary
Better together than apart
Both solutions address the same root cause: PID controllers are reactive, not predictive, leading to thermal overshoots during power transients and wasted energy from chronic sub-cooling. Rather than competing, they operate at different layers of the control stack — Phaidra masters CDU setpoint optimization via RL, while Federator.ai extends control upward into GPU workloads, scheduling, and platform-wide orchestration. Combined, they deliver what neither achieves alone: full-stack thermal intelligence from the pump to the job scheduler.
Phaidra
Single-variable RL agent for CDU setpoint. Supervisory layer on existing PID. Uses rack power as leading indicator (~10-60s). Self-learning via digital twin pre-training. Co-authored with NVIDIA; validated on DGX SuperPOD and CoreWeave NVL72.
Federator.ai SLC
Three-layer control hierarchy bridging the fundamental timing gap between GPU heating (milliseconds) and liquid cooling response (180+ seconds). By treating IT and OT as one integrated domain, SLC uses workload-aware predictive control and admission gating to prevent thermal throttling, save 25-30% cooling energy, and dynamically adjust flow rates to meet target exit temperatures — all without additional OT integration effort.
Phaidra excels at CDU setpoint optimization — learning nonlinear dynamics no physics model captures. Federator.ai extends control upward into workload admission, GPU execution, and platform orchestration. The combined architecture covers every layer from the coolant pump to the job scheduler.
Control Approach
RL + MPC: different layers, one integrated stack
| Dimension | Phaidra | Federator.ai SLC |
|---|---|---|
| Paradigm | Reinforcement Learning (model-free, feed-forward) | Model Predictive Control (physics-based) + PID + Scheduler |
| Manipulated variable | CDU secondary supply temp setpoint | Pump flow + GPU power limits + launch rate + job admission |
| Leading indicator | Rack power (electrical → thermal delay) | Scheduler queue + power prediction (3 confidence-weighted sources) |
| Horizon | Implicit in RL policy (~10-60s via transport delay) | Explicit: 6×5s = 30s MPC + 5-min workload pre-cooling |
| Explainability | Black-box — validated by results | White-box: J = Σ[wT·(T−T*)² + wE·Ppump + wΔU·Δu²] |
| Solver | Neural network (PPO/SAC) | scipy SLSQP; PID fallback on solver failure |
| Adaptability | Self-learning: digital twin → live post-training (hours) | Online parameter estimation: thermal mass, time constant, HTC |
| Timing gap | Responds to observed thermal lag | Bridges GPU heat (ms) vs coolant (180s+) — predictive + admission |
| IT / OT boundary | OT only (CDU setpoint) | IT = OT unified — workload awareness makes cooling effective |
| Flow control | Indirect via setpoint | Target exit temp → dynamic flow rate auto-adjustment |
Architecture Depth
Each solution owns different layers — together they span all four
Phaidra excels at Layer 2 — a supervisory RL agent that learns optimal CDU setpoints faster than any physics model can be manually tuned. It works with the existing CDU PID (Layers 0-1). Federator.ai contributes Layers 0-1 and L3: direct pump flow control, PID with anti-windup and bumpless transfer, and critically, L3 workload-aware admission with pre-cooling. Combined, Phaidra’s RL handles CDU optimization while Federator.ai controls the heat source itself through workload scheduling — a capability no single solution provides alone.
Safety Architecture
Layered defense — CDU safety + platform-wide interlocks
| Layer | Phaidra | Federator.ai |
|---|---|---|
| Guardrails | Hard-coded TCS envelope | 83°C max, 90°C shutdown, ramp limits |
| Failover | Agent fail → local PID | MPC fail → PID + anti-windup + bumpless |
| Interlocks | Existing CDU retained | 4: GPU≥90, supply≥55, return≥70, flow<50 |
| Actuation | Temp setpoint only | Pump + GPU power + launch + admission |
| Blast radius | CDU thermal only (safe) | Wider — requires Proof of Trust |
| Regulatory | Easy to certify as advisory | Full ICS, 4-phase trust progression |
Workload Integration
Phaidra reacts in seconds; Federator.ai plans minutes ahead — both needed
Power as proxy
Rack power as leading indicator. ~10-60s window bounded by physical transport delay. Does not integrate with Slurm/K8s. Cannot see queued jobs before they start.
Schedule-aware pre-cooling
Scheduler integration (conf 0.9), trend extrapolation (0.6), current baseline (0.3). 5-minute pre-cooling window. Can also shape the thermal load via admission control.
Phaidra reacts to power transients in 10–60 seconds with unmatched CDU precision. Federator.ai looks 5+ minutes ahead via scheduler integration and can shape the thermal load itself. Combined: fast CDU response for spikes AND proactive workload shaping for sustained transitions.
Thermal Admission & GPU Execution Control
Federator.ai’s contribution above the CDU layer — what Phaidra was never designed to do
| Level | Mechanism | Measured Impact |
|---|---|---|
| NONE | Baseline operation | 36.76W avg, 96.67% util, 64.16°C |
| POWER_CAP | nvidia-smi -pl {watts} | Immediate, sub-second, no app changes |
| LAUNCH_THROTTLE | LD_PRELOAD=libnvscope.so token bucket | Moderate −31.6%, Heavy −63.4%, Extreme −85.0% |
| DEFER | K8s/Slurm queue hold | Zero GPU impact; job starts with full thermal budget |
| REJECT | Admission denied | Prevents thermal emergency entirely |
Performance Claims
Different metrics, additive benefits
Phaidra — March 2026 Whitepaper
Federator.ai SLC — Core Value Propositions
The fundamental insight: cooling can only be effective and efficient when you understand the workload. SLC sets the target exit temperature and dynamically adjusts flow rate to meet design specifications — no over-cooling, no under-cooling, no performance capping from thermal events.
Integration Scope
CDU agent + full-stack platform = complete coverage
Phaidra is a best-in-class CDU optimization agent, deep where it matters most. Federator.ai SLC is one module within a 12-domain AI data center operating system, providing the platform fabric that connects cooling to workload scheduling, GPU execution control, failure prediction, auto-remediation (Martin-SRE), observability, and billing. Phaidra plugs into Federator.ai’s L2 slot, contributing superior CDU setpoint intelligence while Federator.ai handles everything above and around it.
Deployment & Learning Model
Phaidra self-learns the CDU; Federator.ai manages the trust boundary above it
Phaidra: RL self-learning
- Pre-train on digital twin (per CDU model)
- Shadow mode (observe only)
- Live post-training (converges in hours)
- Active — adjusts setpoint in real-time
Advantage: adapts automatically, no manual parameter tuning.
Federator.ai: Physics model + Proof of Trust
- Configure physics model parameters
- SHADOW — read-only telemetry (30 days)
- ADVISORY — dual-key approval (60 days)
- BOUNDED AUTONOMY — auto within blast radius
- FULL AUTONOMY — closed-loop control
Advantage: explainable at every step, formal audit trail for ICS certification.
Revenue & TCO Impact
Capacity unlock + operational savings = stacked ROI
Phaidra unlocks stranded cooling capacity: at 1GW scale, raising TCS by 10°C frees 67.4 MW for an additional $3.8B/year in IT revenue. Federator.ai delivers 25-30% cooling energy savings, eliminates GPU thermal throttling (protecting compute revenue), and extends GPU lifespan by keeping junction temperatures within design targets. These value streams are entirely additive — deploying both captures revenue that neither achieves alone.
Strategic Assessment
What each brings to the partnership
What Phaidra contributes
- CDU mastery — Self-learning RL adapts to any CDU, hours to converge
- Transient suppression — 75-80% overshoot reduction (3-4°C → 0.5-1°C residual)
- NVIDIA ecosystem — Co-authored, DGX SuperPOD validated
- Capacity unlock — $2.2-6.5B/year revenue at GW scale
- Zero-config deployment — Digital twin pre-training, no parameter tuning
What Federator.ai contributes
- 25-30% cooling energy savings — Dynamic flow rate to target exit temperature
- Zero overshoot — Admission control eliminates the thermal spike entirely, not just reduces it
- IT = OT unified — Already managing IT workloads, no extra OT integration needed
- Workload-aware cooling — Only when you understand workloads can cooling be effective
- GPU execution control — 5 mechanisms: power cap, launch throttle, defer, reject
- Platform fabric — Cortex ADDC connects 12+ domains
Combined Capability Map
| Capability | Phaidra contributes | Federator.ai contributes |
|---|---|---|
| CDU optimization | Primary RL learns CDU dynamics | MPC supplements; PID safety fallback |
| Transient response | Primary 75-80% overshoot reduction | Primary Zero overshoot via admission control |
| Prediction horizon | ~10-60s (transport delay) | Primary 5+ min scheduler integration |
| GPU execution control | — | Primary 5 mechanisms (power cap → reject) |
| Workload integration | — | Primary Slurm/K8s native |
| Deployment speed | Primary Self-learning, hours | Trust progression for actuation layers |
| Explainability | RL learns — results validate | Primary Auditable MPC cost function |
| Safety architecture | CDU guardrails + failover | Primary 3-layer interlocks + PoT |
| Platform integration | CDU-focused agent | Primary Full Cortex ADDC (12 domains) |
| NVIDIA ecosystem | Primary Co-authored, DGX validated | NVIDIA-native stack (DCGM, NIM) |
| Value impact | Primary $B capacity unlock | Additive 25-30% cooling savings + zero throttling |
Phaidra makes your CDU the smartest it can be. Federator.ai bridges the timing gap between GPU heating and coolant response, saves 25-30% cooling energy, and prevents performance capping — because only when you understand workloads can cooling be truly effective. Together, they make the entire AI factory thermally intelligent.
Phaidra’s roadmap (NVIDIA DSX Max-Q) envisions unifying IT, OT, and cooling into a single optimization layer. Federator.ai Cortex already treats IT and OT as one domain — no extra integration work needed because SLC already manages the workloads that generate the heat. The partnership is natural: Phaidra brings CDU intelligence, Federator.ai brings the workload awareness, admission control, and dynamic flow adjustment that makes the entire system effective and efficient.