Designing Colocation for AI R&D: How Developers Should Specify Power, Cooling and Network SLAs
data-centersinfrastructureAI ops

Designing Colocation for AI R&D: How Developers Should Specify Power, Cooling and Network SLAs

MMarcus Elwood
2026-05-03
23 min read

A practical colocation SLA checklist for AI racks: power, liquid cooling, low-latency networking, and contract language that engineers can use.

AI R&D teams are no longer buying “server space.” They are buying the ability to run dense GPU racks, keep models training without thermal throttling, and move data fast enough to avoid wasting expensive accelerator time. That means the contract matters as much as the hardware list: if the AI compute plan is vague, the data center will fill in the blanks with its own assumptions. The result is often a mismatch between what developers need and what facilities teams think they asked for.

This guide turns colocation jargon into practical procurement language. You’ll get a checklist for kW per rack, liquid cooling, ready-now power, carrier-neutral connectivity, and low-latency network SLAs. It is designed for engineering teams negotiating with colo providers, cloud adjacencies, or hybrid operators who want to support high-density AI racks without discovering the hard limits after deployment. If you are building an internal procurement standard, this is the template to start from.

1) Why AI R&D Changes the Colocation Conversation

GPU density rewrites the physical assumptions

Traditional enterprise colocation was built for moderate-rack-density workloads where airflow, not physics, was the main constraint. AI R&D flips that model by concentrating very large thermal loads into a few cabinets, often with GPUs that can push the total rack load into the tens or hundreds of kilowatts. A spec that worked for a general-purpose application stack can become unsafe or underpowered the moment you install a modern accelerator cluster. That is why developers need to ask for capacity in explicit terms, not just “room to grow.”

In practical terms, specifying an environment for AI means documenting rack density, redundancy expectations, and deployment ramp. Teams often start by comparing a proposed buildout with a capacity plan mindset: what is live on day one, what is reserved, and what must be deliverable within the contract term. This is the same discipline used in software reliability, where assumptions are written down and measured instead of implied. For infrastructure buyers, the difference is that the failure mode is not a dropped packet; it is a multi-million-dollar cluster that cannot be turned on at full speed.

AI contracts should map to developer outcomes

Engineers care about training time, experiment throughput, and cluster availability. Facilities teams care about amps, chilled water, and cross-connects. A good SLA bridges those worlds by tying infrastructure commitments to outcomes developers can verify: power delivered at rack, temperatures sustained under load, network latency to the nearest cloud region, and time-to-provision for expansion. If those outcomes are not in the contract, you are negotiating on faith.

That is why procurement should borrow from the discipline of service bundling and risk transfer: define exactly what is included, what is metered, and what triggers an escalation path. AI teams do not need generic “best effort” language. They need enforceable thresholds, like maximum power derating during summer peaks, guaranteed on-site support windows, and carrier diversity requirements. Those details are what convert a data center from a marketing brochure into a dependable production platform.

Immediate capacity beats future promises

Many providers sell a roadmap instead of an operating environment. For AI R&D, future megawatts are not useful if the model cluster needs to go live this quarter. That makes “ready now” power a critical contract term, not a sales slogan. Teams should insist on evidence: current substations, completed electrical rooms, live metering data, and the exact number of racks that can be energized today.

Pro Tip: If the provider cannot show live capacity in the form of energized kW per rack, ask for a stamped one-line diagram, recent load test evidence, and an expansion schedule with milestone dates. If they cannot document it, do not assume it exists.

2) Translate Data-Center Jargon Into Developer Requirements

kW per rack: ask for usable, sustained density

“10 kW per rack” sounds precise, but it is often meaningless without context. Is that continuous load or peak? Does it assume blanking panels, specific airflow patterns, or a particular aisle containment design? For AI racks, the more relevant question is whether the facility can sustain the rack’s real operating profile under normal ambient conditions and during maintenance events. Developers should insist on continuous, not just marketing, numbers.

Write requirements in plain language: “The provider must support 30 kW sustained per rack, with the option to scale to 60 kW within six months without moving cabinets.” Then ask for the supporting design details. This is similar to how teams validate app infrastructure in the reliability stack: the headline metric matters less than the system behaviors that maintain it. If the site can only hit density under special operating conditions, the contract should say so explicitly.

Liquid cooling: define the method, not just the label

“Liquid cooling supported” is too vague to sign. You need to know whether the site offers direct-to-chip cooling, rear-door heat exchangers, or a hybrid design that still depends on a major air-handling component. Each option has different installation requirements, maintenance workflows, and failure modes. For AI deployments, a vague cooling promise can become a deployment blocker when the vendor’s interpretation does not match the hardware vendor’s requirements.

Use terminology carefully. Liquid cooling is the umbrella term; direct-to-chip is a specific implementation that removes heat closer to the source; and RDHx usually means a rear-door heat exchanger that assists with high heat loads but may not be enough alone for the densest clusters. Your contract should name the exact architecture, who supplies the cold plates or manifolds, who owns leak response, and what service window applies if a loop has to be isolated for repair.

Carrier-neutral and low-latency: define both reach and performance

Carrier-neutral sounds good because it implies choice, but choice is only valuable if there are multiple usable carriers and the on-net ecosystem is actually diverse. Ask for the list of live carriers, recent activation lead times, and whether diverse entrances and meet-me rooms are physically independent. If you need to support cloud adjacency, you should also specify low-latency access to the relevant interconnect points and cloud regions. The more explicit you are, the less likely you are to be surprised by a facility that is “neutral” in name only.

Latency requirements should be tied to actual application behavior. For inference, developer collaboration, and data exchange, a small delay may be acceptable; for distributed training, storage replication, or synchronous workflows, the tolerance is much lower. Pair this with operational language from performance engineering: define p95 and p99 network latency targets, not just a vague “fast network” promise. If you can’t measure the obligation, you can’t enforce it.

3) The Power SLA Checklist for High-Density AI Racks

Specify the electrical envelope clearly

The first line item should be continuous power per rack, followed by peak power, feed type, and redundancy architecture. If the design depends on A/B feeds, state the required availability of each feed and whether the rack can continue running on a single feed during maintenance. Include circuit breaker ratings, maximum inrush tolerance, and whether remote power cycling is available at the cabinet level. These details prevent vendor ambiguity when an accelerator refresh changes electrical behavior.

A practical clause might read: “Provider shall deliver 40 kW sustained per rack, 48 kW peak for 15 minutes, with dual 208V feeds and N+1 facility redundancy, measured at rack PDU output.” That language is much more useful than “high-density capable.” It tells your platform engineers what design headroom exists and tells procurement what to validate during acceptance testing. When comparing options, consider using a formal checklist similar to a vendor diligence playbook so nothing is left to interpretation.

Ask how power is measured, billed, and curtailed

For AI workloads, billing and electrical governance are inseparable. You need to know whether you are charged on connected load, metered consumption, or a committed-capacity model. You also need to know what happens if the provider hits a local utility constraint or performs load shedding. If the contract says nothing, the provider may reserve broad rights to limit usage during stress events.

Negotiate for transparent metering, exportable historical data, and advance notice on any planned curtailment. Ask whether the facility supports automated alerts, and whether your ops team can integrate meter feeds into internal observability. This is where lessons from automating financial reporting are surprisingly relevant: manual reconciliation scales badly, and infrastructure billing is no exception. If power usage can’t be audited with the same rigor as cloud spend, your AI unit economics will drift.

Build in expansion rights before you need them

AI teams almost always under-estimate their future density needs. A project that starts as one pilot rack can become a multi-rack cluster with liquid cooling and storage adjacency within a quarter. Your contract should include pre-negotiated expansion rights, reserved adjacent space, and clear triggers for re-pricing or move costs. The best time to negotiate future capacity is before the first rack is commissioned.

Think of it as a portfolio strategy rather than a one-off purchase, much like the discipline behind value investing decisions: the quality of the entry matters, but the long-term downside protection matters more. In a colo contract, downside protection means knowing you can add density, add power, and keep your topology stable without a forklift migration. That flexibility can save months of engineering time later.

4) Cooling SLAs: From Airflow to Direct-to-Chip

Choose the right cooling model for the hardware

Not every AI deployment needs the same thermal architecture. Some clusters can still run on optimized air cooling with containment and high-capacity CRAC systems. Others require direct-to-chip liquid cooling because the heat flux is too high for air to remove efficiently. Rear-door heat exchangers can bridge the gap, but they are not a universal answer. The right choice depends on GPU generation, rack layout, and your tolerance for operational complexity.

When comparing providers, ask for inlet and outlet temperature ranges, humidity limits, fluid specification, maintenance cadence, and leak detection methods. If the site supports multiple methods, make them list which rack classes are validated for each. You should also ask whether the cooling system is designed for sustained 24/7 AI loads or only bursty workloads. A startup-style pilot cluster may tolerate more operational improvisation than a production research platform, but the SLA should not assume that flexibility forever.

Define response times for thermal incidents

Cooling failures are not just a facilities issue; they are an uptime issue. If a pump fails, a loop trips, or a rear-door system overheats, the provider should have a documented response time and escalation path. Developers should request a thermal incident SLA that includes alerting, on-site response, containment actions, and communication updates. The more specific the response workflow, the less time your team spends chasing status during a critical training job.

Ask how the provider monitors telemetry. Are temperature and flow sensors integrated with the NOC? Are alarms visible to tenants? Can you export trend data to your own observability platform? These are the sorts of implementation details that separate a truly managed AI site from a marketing-deck facility. If your model run is worth tens of thousands of dollars per hour, then minutes matter.

Plan for maintenance without losing the cluster

Maintenance windows are one of the most overlooked risks in AI colocation. If the provider cannot maintain cooling loops, pumps, or filters without taking the environment partially offline, then your operational resilience is lower than it appears. Ask for the process around maintenance notifications, isolation procedures, and temporary load balancing. If the vendor claims you will stay online through maintenance, request a practical demonstration or references from live tenants.

A useful internal comparison is the way teams evaluate service continuity in reliability engineering: not every event can be prevented, but it can be bounded. You want to know how a single-cabinet issue is contained so it does not cascade into a whole-row or whole-suite event. This is especially important where liquid cooling is deployed, because the failure response may involve both electrical and mechanical teams.

5) Network SLAs That Actually Support AI Workloads

Latency, bandwidth, and path diversity all matter

AI infrastructure is network-hungry in ways that general enterprise environments are not. Training data often moves in bulk, checkpoints need fast storage replication, and distributed jobs can be sensitive to jitter. That means your network SLA should include bandwidth per cross-connect, oversubscription assumptions, latency targets, and diversity of physical paths. If you only negotiate bandwidth, you may still end up with a topology that introduces bottlenecks during synchronized training.

For many teams, the best approach is to describe the expected pattern of traffic: inbound datasets, east-west cluster traffic, storage sync, and cloud egress. Then tie the SLA to those patterns with measurable targets. If a provider is genuinely good at AI compute planning, they should be able to explain how their network supports not just internet access, but interconnects, cloud on-ramps, and low-latency regional reach.

Carrier neutrality should be operational, not theoretical

Carrier-neutral facilities can still create practical lock-in if installation lead times are long or the meet-me room is congested. Ask how many carriers are live, which are currently selling ports, and whether the facility has documented cross-connect SLAs. Also ask whether your preferred carrier can physically enter the building without custom construction or long lead times. Neutrality is only useful when it lowers friction.

In a healthy carrier-neutral environment, you should be able to compare vendors on delivery speed, not just price. That’s similar to how enterprise audits work: the asset is not merely having links, but being able to connect the right things at the right time. For AI, the “right thing” may be a cloud fabric, a peering partner, or a dedicated route to a remote training site.

Make observability part of the SLA

Network SLAs without telemetry are weak. Require visibility into packet loss, jitter, interface utilization, and cross-connect status where possible. If the vendor can’t provide dashboards, ask for regular reports and an incident export format. Your ops team should be able to correlate network events with model degradation, checkpoint delays, or storage issues.

This is also where internal collaboration matters. Procurement, platform engineering, and security should agree on what “good” looks like before signing. If your team already uses data-driven operating practices, borrow from the discipline in data-driven roadmaps: define the metrics first, then choose the partner that can prove them. AI infra isn’t bought through trust alone; it is validated through measurable performance.

6) Security, Compliance, and Access Control in AI Colocation

Physical security is part of the technical stack

AI R&D environments often host valuable model weights, proprietary datasets, and sensitive research outputs. That means you need more than badge readers and cameras. Ask about mantrap design, visitor logging, escort requirements, remote hands permissions, and chain-of-custody procedures for hardware swaps. If the provider handles liquid-cooling service or high-value GPU repairs, ask who is allowed to touch which components and under what supervision.

Organizations in regulated environments often benefit from the mindset behind a trust-first deployment checklist. The principle is simple: treat compliance and access workflows as design constraints, not audit afterthoughts. If a technician can enter a cage but you cannot prove what they touched, your incident response and audit posture both suffer.

Data handling, retention, and evidence matter

Colocation providers may log alarms, access events, camera footage, and maintenance records. Your contract should specify retention periods, access rights, and how evidence is produced during an investigation. This matters when you need to reconcile a hardware incident with an experiment failure or a security review. It also matters if your team must prove that controls were followed for internal governance or external audits.

Security language should also address spares and returns. If a failed GPU node is removed from the site, what happens to local storage? Who controls sanitization? Can you witness destruction or receive signed evidence? These questions are unglamorous, but they often determine whether the deployment is operationally safe. The same rigor that protects enterprise identity in enterprise mobile identity should be applied to physical infrastructure handling.

Plan for incident communication across teams

When AI infrastructure fails, the cost is not limited to downtime. Research cycles stall, engineering teams lose time, and executive confidence can erode. Your vendor should commit to communication timelines, named contacts, and escalation thresholds for power, cooling, and network events. You want the provider to act like part of your operations team, not a disconnected landlord.

That expectation is consistent with how mature teams manage change in other high-stakes domains. A good vendor should provide enough transparency for you to make decisions quickly and document them later. If a provider is vague about who gets notified, how soon, and through which channel, treat that as a red flag. In AI, operational ambiguity is often the prelude to financial waste.

7) A Practical Negotiation Template for Engineering Teams

Use this checklist before signing

AreaAskMinimum acceptable answerWhy it matters
PowerHow much continuous kW per rack is available today?Stated sustained kW at rack with evidencePrevents underpowered deployments
Power growthHow fast can we add density?Contracted expansion timelineAvoids migration when demand rises
CoolingIs liquid cooling direct-to-chip, RDHx, or hybrid?Specific validated methodPrevents mismatch with hardware
Cooling opsWhat is the thermal incident response time?Documented SLA with escalationReduces training interruptions
NetworkWhat are the latency and path diversity guarantees?Measured targets and route diversitySupports distributed AI workloads
Carrier accessWhich carriers are live and installable now?Named carriers with realistic lead timesEnsures carrier-neutral value
SecurityWho can access cages, racks, and cooling gear?Role-based access controlsProtects data and equipment
BillingHow is power metered and billed?Transparent, exportable meteringSupports cost control and audits

Draft contract language the business can actually use

Here is a sample requirement format you can adapt: “Provider must deliver a minimum of 40 kW sustained continuous load per rack, with dual-path power feeds, support for liquid cooling architecture specified as direct-to-chip, and on-site response for critical cooling incidents within 30 minutes. Provider must maintain carrier-neutral access with at least three live carriers and document cross-connect provisioning times. Latency to designated cloud interconnect points must remain within agreed thresholds.”

This style of language works because it is testable. It doesn’t ask the provider to be “best in class”; it asks for measurable obligations that your ops team can verify. If your organization already uses structured operational reviews, model the negotiation like a vendor due diligence process and attach evidence requirements. The goal is to make the SLA executable, not aspirational.

Negotiate for exit rights and portability

AI infrastructure can evolve quickly. Your team should know what happens if the site no longer fits the hardware roadmap, the carrier ecosystem changes, or power pricing shifts materially. Ask for exit assistance, data export rights for telemetry, decommissioning support, and reasonable notice periods for material changes. Exit language is not pessimism; it is resilience planning.

At scale, portability protects both engineering momentum and budget discipline. It also reduces vendor dependence when new generations of accelerators or cooling systems arrive. In practice, strong exit terms can be as valuable as a discount because they preserve your ability to re-platform with minimal disruption. That is especially important in fast-moving AI infrastructure strategies where the compute stack may need to change every 12 to 24 months.

8) Real-World Scenarios: What Good Looks Like

Scenario 1: A research lab with one dense pilot rack

A university lab or startup may begin with a single high-density rack that needs more cooling than a normal enterprise cabinet but not full-scale campus buildout. The best provider in this scenario is usually the one that can support the pilot without penalizing the lab for growth later. The contract should allow the lab to reserve expansion space and add power without re-negotiating every time the experiment program expands. The important outcome is speed from idea to first training run.

In this case, the lab can accept more flexibility around secondary connectivity, provided the low-latency path to its cloud region is dependable. It should still ask for precise terms on rack power, thermal response, and access controls. If the provider offers liquid cooling and an upgrade path to RDHx or direct-to-chip, that future option may be worth more than a slightly cheaper air-cooled deal. The project will grow faster than anyone expects.

Scenario 2: A product team training models continuously

Here, cluster uptime and network stability matter more than one-time deployment simplicity. The team should prioritize redundant power, live failover support, strong observability, and carrier diversity. It should also demand a clear process for scheduled maintenance, because a one-hour downtime event can invalidate an entire training epoch or delay a release cycle. The contract should reflect the operational cost of interruption.

This is where a mature, SRE-style operating model becomes useful. You define error budgets, tie them to the vendor’s maintenance windows, and ensure that the team can predict the impact of infrastructure work on model delivery. If the provider cannot align to that level of rigor, it may be a poor fit for continuous AI operations.

Scenario 3: A hybrid enterprise with compliance obligations

For regulated teams, the physical environment is inseparable from governance. They need documented access controls, security logs, chain-of-custody procedures, and evidence retention. Network terms should include deterministic connectivity to approved cloud regions and explicit rules for data movement. Cooling and power still matter, but compliance often determines whether the site can be used at all.

These teams often gain from the same discipline used in regulated deployment checklists. The difference is that the “application” is not a software release but an AI training environment that must satisfy multiple stakeholders. Procurement, security, and infrastructure engineering should agree on a shared acceptance checklist before the contract is signed.

9) Common Mistakes to Avoid When Buying AI Colocation

Confusing brochure language with operable capacity

One of the most common mistakes is accepting marketing terms as proof of engineering reality. “AI-ready,” “high-density enabled,” and “future-proof” do not tell you how much power is actually live or how cooling behaves under stress. Ask for measurable engineering documents, not just sales collateral. If the provider hesitates, that usually means the capability is still being built.

Ignoring network path and cloud adjacency

Many teams focus so heavily on power and cooling that they forget the network is what makes AI infrastructure useful. If the facility is technically powerful but isolated from the cloud regions, datasets, or partners you need, the total project value drops fast. You may end up paying for an excellent site with poor access to your actual workflow. Always include latency, carrier access, and interconnect options in the first round of evaluation.

Not planning for the second and third rack

AI environments rarely stay small. If you negotiate only for the first cabinet, later expansion can become expensive, slow, or impossible. The best contracts reserve adjacent space, define upgrade paths, and specify how pricing changes when density rises. That kind of foresight can save a migration that would otherwise consume months of engineering time and capital.

Pro Tip: If the vendor won’t commit to expansion language, assume future growth is a separate project with separate risks. Negotiate as if you will need three times today’s capacity.

10) Final Buying Checklist and Takeaway

What to ask before you sign

Before approving a colo or hybrid AI contract, confirm that the provider can answer five questions in writing: how much sustained kW per rack is available now, what cooling architecture is actually supported, what carrier-neutral options exist today, what latency targets are guaranteed, and what happens when you need to expand. If any answer is vague, insist on documentation or walk away. AI infrastructure is too expensive to leave to interpretation.

Your engineers should treat the site as part of the compute stack, not just a lease. That mindset is what separates a successful deployment from an expensive relocation later. The right contract gives you power headroom, thermal confidence, and network predictability in terms the team can operate against every day. The wrong contract only gives you a promise.

How to use this in procurement

Turn this guide into a scoring sheet. Weight power and cooling heavily for dense GPU clusters, weight network and carrier access heavily for distributed or cloud-adjacent workloads, and weight security and access controls heavily for regulated teams. Then compare vendors on evidence, not enthusiasm. That process is repeatable, defensible, and far more useful than a generic RFP.

For teams building their broader AI platform strategy, it also helps to align colocation decisions with your AI compute roadmap, observability standards, and incident response model. The result is not just a better facility contract; it is a better operating model for the entire stack. That is what modern AI R&D infrastructure should deliver.

FAQ

What is a good kW per rack target for AI R&D?

It depends on the GPU generation and cooling method, but many AI deployments should start by validating sustained density in the 30 kW to 60 kW range, with a path higher if the roadmap includes next-generation accelerators. The key is to ask for continuous, not just peak, capacity. Always confirm the site can support your actual power profile under normal operating conditions.

Is direct-to-chip cooling better than RDHx?

Direct-to-chip is usually more effective for very high heat densities because it removes heat closer to the source. RDHx can be useful as a bridge or supplement, especially when you still rely on some air cooling. The right answer depends on your rack design, thermal load, and maintenance model.

What does carrier-neutral actually mean?

Carrier-neutral means you are not locked to a single telecom provider, but in practice it only matters if multiple carriers are truly live and installable with reasonable lead times. Ask for the current carrier list, cross-connect process, and any physical constraints in the meet-me room. Neutrality without usable choice is not very useful.

Should network latency be in the SLA?

Yes, if your AI workloads depend on distributed training, storage sync, or cloud adjacency. The SLA should include measurable latency targets and preferably route diversity or loss/jitter thresholds. If the provider refuses to define performance, you should assume the network is not optimized for your workload.

How do I avoid getting locked into the wrong facility?

Negotiate expansion rights, exit assistance, and portable observability from the start. Make sure the contract covers not only today’s pilot cluster but also the second and third rack. Flexibility is one of the most valuable features in a fast-moving AI infrastructure program.

Advertisement
IN BETWEEN SECTIONS
Sponsored Content

Related Topics

#data-centers#infrastructure#AI ops
M

Marcus Elwood

Senior SEO Content Strategist

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
BOTTOM
Sponsored Content
2026-05-03T01:05:12.282Z