Complementary Architecture

Two complementary approaches to AI-driven liquid cooling for GPU data centers — Phaidra’s RL agent masters CDU setpoint optimization while Federator.ai extends the control boundary to GPU workloads, scheduling, and platform-wide thermal management. Together, they cover the full stack.

01

Executive Summary

Better together than apart

Both solutions address the same root cause: PID controllers are reactive, not predictive, leading to thermal overshoots during power transients and wasted energy from chronic sub-cooling. Rather than competing, they operate at different layers of the control stack — Phaidra masters CDU setpoint optimization via RL, while Federator.ai extends control upward into GPU workloads, scheduling, and platform-wide orchestration. Combined, they deliver what neither achieves alone: full-stack thermal intelligence from the pump to the job scheduler.

Phaidra

Single-variable RL agent for CDU setpoint. Supervisory layer on existing PID. Uses rack power as leading indicator (~10-60s). Self-learning via digital twin pre-training. Co-authored with NVIDIA; validated on DGX SuperPOD and CoreWeave NVL72.

Federator.ai SLC

Three-layer control hierarchy bridging the fundamental timing gap between GPU heating (milliseconds) and liquid cooling response (180+ seconds). By treating IT and OT as one integrated domain, SLC uses workload-aware predictive control and admission gating to prevent thermal throttling, save 25-30% cooling energy, and dynamically adjust flow rates to meet target exit temperatures — all without additional OT integration effort.

Phaidra excels at CDU setpoint optimization — learning nonlinear dynamics no physics model captures. Federator.ai extends control upward into workload admission, GPU execution, and platform orchestration. The combined architecture covers every layer from the coolant pump to the job scheduler.

Complementary value proposition
02

Control Approach

RL + MPC: different layers, one integrated stack

DimensionPhaidraFederator.ai SLC
ParadigmReinforcement Learning (model-free, feed-forward)Model Predictive Control (physics-based) + PID + Scheduler
Manipulated variableCDU secondary supply temp setpointPump flow + GPU power limits + launch rate + job admission
Leading indicatorRack power (electrical → thermal delay)Scheduler queue + power prediction (3 confidence-weighted sources)
HorizonImplicit in RL policy (~10-60s via transport delay)Explicit: 6×5s = 30s MPC + 5-min workload pre-cooling
ExplainabilityBlack-box — validated by resultsWhite-box: J = Σ[wT·(T−T*)² + wE·Ppump + wΔU·Δu²]
SolverNeural network (PPO/SAC)scipy SLSQP; PID fallback on solver failure
AdaptabilitySelf-learning: digital twin → live post-training (hours)Online parameter estimation: thermal mass, time constant, HTC
Timing gapResponds to observed thermal lagBridges GPU heat (ms) vs coolant (180s+) — predictive + admission
IT / OT boundaryOT only (CDU setpoint)IT = OT unified — workload awareness makes cooling effective
Flow controlIndirect via setpointTarget exit temp → dynamic flow rate auto-adjustment
Phaidra captures nonlinear CDU dynamics via learned policy. Federator.ai SLC adds auditable multi-variable control above it. Together: RL precision + MPC breadth.
03

Architecture Depth

Each solution owns different layers — together they span all four

Phaidra excels at Layer 2 — a supervisory RL agent that learns optimal CDU setpoints faster than any physics model can be manually tuned. It works with the existing CDU PID (Layers 0-1). Federator.ai contributes Layers 0-1 and L3: direct pump flow control, PID with anti-windup and bumpless transfer, and critically, L3 workload-aware admission with pre-cooling. Combined, Phaidra’s RL handles CDU optimization while Federator.ai controls the heat source itself through workload scheduling — a capability no single solution provides alone.

04

Safety Architecture

Layered defense — CDU safety + platform-wide interlocks

LayerPhaidraFederator.ai
GuardrailsHard-coded TCS envelope83°C max, 90°C shutdown, ramp limits
FailoverAgent fail → local PIDMPC fail → PID + anti-windup + bumpless
InterlocksExisting CDU retained4: GPU≥90, supply≥55, return≥70, flow<50
ActuationTemp setpoint onlyPump + GPU power + launch + admission
Blast radiusCDU thermal only (safe)Wider — requires Proof of Trust
RegulatoryEasy to certify as advisoryFull ICS, 4-phase trust progression
Complementary Phaidra’s CDU-focused safety is simple to deploy and certify. Federator.ai adds platform-wide interlocks (GPU temp, flow rate, admission control). Together: defense in depth from CDU to workload layer.
05

Workload Integration

Phaidra reacts in seconds; Federator.ai plans minutes ahead — both needed

Power as proxy

Rack power as leading indicator. ~10-60s window bounded by physical transport delay. Does not integrate with Slurm/K8s. Cannot see queued jobs before they start.

Schedule-aware pre-cooling

Scheduler integration (conf 0.9), trend extrapolation (0.6), current baseline (0.3). 5-minute pre-cooling window. Can also shape the thermal load via admission control.

Phaidra reacts to power transients in 10–60 seconds with unmatched CDU precision. Federator.ai looks 5+ minutes ahead via scheduler integration and can shape the thermal load itself. Combined: fast CDU response for spikes AND proactive workload shaping for sustained transitions.

Combined advantage
06

Thermal Admission & GPU Execution Control

Federator.ai’s contribution above the CDU layer — what Phaidra was never designed to do

Unique to Federator.ai — no other solution controls GPU execution for thermal management
LevelMechanismMeasured Impact
NONEBaseline operation36.76W avg, 96.67% util, 64.16°C
POWER_CAPnvidia-smi -pl {watts}Immediate, sub-second, no app changes
LAUNCH_THROTTLELD_PRELOAD=libnvscope.so token bucketModerate −31.6%, Heavy −63.4%, Extreme −85.0%
DEFERK8s/Slurm queue holdZero GPU impact; job starts with full thermal budget
REJECTAdmission deniedPrevents thermal emergency entirely
07

Performance Claims

Different metrics, additive benefits

Phaidra — March 2026 Whitepaper

~75%
Overshoot reduction (60kW)
3-4°C → 0.5-1°C
~80%
Overshoot reduction (100kW)
5-6°C → ~1°C
Hours
Live training convergence
After digital twin pre-training
DGX GB200
Validation platform
SuperPOD + CoreWeave NVL72

Federator.ai SLC — Core Value Propositions

25-30%
Cooling energy savings
Dynamic flow → target exit temp
Zero
Thermal overshoot
Admission control eliminates the spike entirely
ms vs 180s
Timing gap bridged
GPU heat (ms) ↔ coolant (180s+)
IT = OT
Unified domain
No extra OT integration work
<100ms
L1 safety response
PID emergency override

The fundamental insight: cooling can only be effective and efficient when you understand the workload. SLC sets the target exit temperature and dynamically adjusts flow rate to meet design specifications — no over-cooling, no under-cooling, no performance capping from thermal events.

Additive Phaidra reduces overshoot by 75-80% (3-4°C down to 0.5-1°C). Federator.ai eliminates overshoot entirely via admission control — the spike never happens. Combined with 25-30% cooling energy savings.
08

Integration Scope

CDU agent + full-stack platform = complete coverage

Phaidra is a best-in-class CDU optimization agent, deep where it matters most. Federator.ai SLC is one module within a 12-domain AI data center operating system, providing the platform fabric that connects cooling to workload scheduling, GPU execution control, failure prediction, auto-remediation (Martin-SRE), observability, and billing. Phaidra plugs into Federator.ai’s L2 slot, contributing superior CDU setpoint intelligence while Federator.ai handles everything above and around it.

09

Deployment & Learning Model

Phaidra self-learns the CDU; Federator.ai manages the trust boundary above it

Phaidra: RL self-learning

  1. Pre-train on digital twin (per CDU model)
  2. Shadow mode (observe only)
  3. Live post-training (converges in hours)
  4. Active — adjusts setpoint in real-time

Advantage: adapts automatically, no manual parameter tuning.

Federator.ai: Physics model + Proof of Trust

  1. Configure physics model parameters
  2. SHADOW — read-only telemetry (30 days)
  3. ADVISORY — dual-key approval (60 days)
  4. BOUNDED AUTONOMY — auto within blast radius
  5. FULL AUTONOMY — closed-loop control

Advantage: explainable at every step, formal audit trail for ICS certification.

10

Revenue & TCO Impact

Capacity unlock + operational savings = stacked ROI

Phaidra unlocks stranded cooling capacity: at 1GW scale, raising TCS by 10°C frees 67.4 MW for an additional $3.8B/year in IT revenue. Federator.ai delivers 25-30% cooling energy savings, eliminates GPU thermal throttling (protecting compute revenue), and extends GPU lifespan by keeping junction temperatures within design targets. These value streams are entirely additive — deploying both captures revenue that neither achieves alone.

Combined financial impact
11

Strategic Assessment

What each brings to the partnership

What Phaidra contributes

  • CDU mastery — Self-learning RL adapts to any CDU, hours to converge
  • Transient suppression — 75-80% overshoot reduction (3-4°C → 0.5-1°C residual)
  • NVIDIA ecosystem — Co-authored, DGX SuperPOD validated
  • Capacity unlock — $2.2-6.5B/year revenue at GW scale
  • Zero-config deployment — Digital twin pre-training, no parameter tuning

What Federator.ai contributes

  • 25-30% cooling energy savings — Dynamic flow rate to target exit temperature
  • Zero overshoot — Admission control eliminates the thermal spike entirely, not just reduces it
  • IT = OT unified — Already managing IT workloads, no extra OT integration needed
  • Workload-aware cooling — Only when you understand workloads can cooling be effective
  • GPU execution control — 5 mechanisms: power cap, launch throttle, defer, reject
  • Platform fabric — Cortex ADDC connects 12+ domains

Combined Capability Map

CapabilityPhaidra contributesFederator.ai contributes
CDU optimizationPrimary RL learns CDU dynamicsMPC supplements; PID safety fallback
Transient responsePrimary 75-80% overshoot reductionPrimary Zero overshoot via admission control
Prediction horizon~10-60s (transport delay)Primary 5+ min scheduler integration
GPU execution controlPrimary 5 mechanisms (power cap → reject)
Workload integrationPrimary Slurm/K8s native
Deployment speedPrimary Self-learning, hoursTrust progression for actuation layers
ExplainabilityRL learns — results validatePrimary Auditable MPC cost function
Safety architectureCDU guardrails + failoverPrimary 3-layer interlocks + PoT
Platform integrationCDU-focused agentPrimary Full Cortex ADDC (12 domains)
NVIDIA ecosystemPrimary Co-authored, DGX validatedNVIDIA-native stack (DCGM, NIM)
Value impactPrimary $B capacity unlockAdditive 25-30% cooling savings + zero throttling

Phaidra makes your CDU the smartest it can be. Federator.ai bridges the timing gap between GPU heating and coolant response, saves 25-30% cooling energy, and prevents performance capping — because only when you understand workloads can cooling be truly effective. Together, they make the entire AI factory thermally intelligent.

Combined positioning

Phaidra’s roadmap (NVIDIA DSX Max-Q) envisions unifying IT, OT, and cooling into a single optimization layer. Federator.ai Cortex already treats IT and OT as one domain — no extra integration work needed because SLC already manages the workloads that generate the heat. The partnership is natural: Phaidra brings CDU intelligence, Federator.ai brings the workload awareness, admission control, and dynamic flow adjustment that makes the entire system effective and efficient.

Federator.ai SLC + Phaidra  ·  Complementary Architecture  ·  ProphetStor Data Services  ·  March 2026

Sources: Phaidra “AI Agents for Liquid-Cooled AI Factories” (Mar 2026) · Federator.ai Cortex codebase