The data centre industry is quickly adapting to the demands of running AI workloads, with a big focus on improved power and cooling infrastructure. However, the increased demands on connectivity will also be significant as data centres handle more data than ever, this article will outline the key areas that networking technology for the data centre industry must evolve.
Why are the connectivity needs of AI greater than for legacy workloads?
AI will place a greater demand on networking equipment than any workloads have so far. Large language models (LLMs) are dealing with huge amounts of data and transferring this data which places a large pressure on networking infrastructure. To train these models you need more advanced networking equipment within the data centre, and then when they are actually deployed you are also moving potentially huge data workloads. AI inferencing varies widely in terms of compute demand, but for some models it can be highly data-intensive and if the model is running in a different location from the devices or sensors generating the data then this places huge demand on the networks.
Figure 1: AI applications require a high level of networking between training and inferencing sites and the end-customer
This article will cover exactly how data centre connectivity must adapt to serve these higher demands.
How networking must advance within the data centre
AI models are composed of training and inferencing phases, often involving distinct clusters. Training clusters are typically hosted within a single data centre to leverage high-bandwidth, low-latency interconnects for efficient model refinement. In some cases, retraining may occur in distributed locations. Inferencing, on the other hand, involves applying the trained model to new data, requiring input data to be sent to the inferencing sites which are often numerous. Trained model parameters are then deployed from the training cluster to inferencing locations as needed. Both training and inferencing rely on robust connectivity to ensure smooth data movement and low-latency outputs, particularly for real-time applications.
This is being delivered through a number of upgraded technologies:
• ‘Densification’ acts to try to reduce physical distance and number of hops that data needs to travel within the cluster to reduce latency
• Upgraded networking equipment such as interconnects and switches able to handle higher bandwidth and lower latency. NVSwitch, for example, is a high bandwidth, low latency interconnect technology which facilitates communication between multiple GPUs in a cluster
• 800G and other high bandwidth networking technologies are important for facilitating the high-speed transfer of data within the cluster
• New technologies such as remote direct memory access (RMDA) which allows data to be transferred directly between the memory of two nodes without involving the chip, reducing latency
All of this networking infrastructure needs to be more scalable and adaptable. AI models scale hugely in size and have significant “peaky” network traffic patterns, requiring flexible, scalable networking designs to handle increasing computational and data transfer needs. AI also requires dynamic reconfiguration of network paths to optimise data flow. Programmable networking technology like P4 (a programming language for packet processing) allow for such adaptability. AI applications for critical applications in industries such as healthcare and finance also require a very high level of reliability and redundancy.
This extra networking equipment also needs to be more energy-efficient to keep down the already high energy consumption of AI clusters, whether better cable management or energy-efficient switching.
How networking between data centres and end-customers must advance
The architectures of AI applications can vary significantly, ranging from deployment in a single enterprise-owned data centre to distributed setups spanning multiple data centres across different countries, ensuring low-latency model outputs across an enterprise’s geographical footprint. Inter-data centre networking will need to be upgraded in the same way as intra-data centre networking.
Data centres have never been data islands, they have always needed to be deeply connected with all of their customers and other data centres. Enterprise applications today are often run locally in sites near the enterprise itself, reducing latency and navigating data soveriengty and security concerns of end-customers and other stakeholders. Data recovery or backup instances are a core component of any redundancy strategy. Transferring data is expensive, and therefore minimising any egress fees and high bandwidth transfers is also a driver for distribution. Multi-cloud and hybrid cloud architectures add to the level of distribution. All of this requires a high level of connectivity between distributed sites.
However, AI is likely to place more emphasis on the edge than most previous applications:
1. Data sovereignty and security is in greater focus that ever, not just by enterprises themselves, but by governments and regulators. This is partly driven by recent geopolitical events such as the CrowdStrike outage affecting IT systems globally, unsettlement in the Middle East affecting key supply chain routes or increased tensions between superpowers such as US, China and Russia. The EU has announced €1.2 billion in funding for sovereign cloud initiatives in Europe through its IPCEI Cloud programme. Stakeholders want their data to be safe within their own borders, and with wariness about the potential of AI in the hands of malign actors this is particularly reinforced.
2. Inferencing for many applications requires low latency, this means there will naturally be a role for edge computing whether on-premise or at a 3rd party facility. Training the model will occur in one location and you will want to receive model output in many other locations so there is a need for distribution. All of these edge locations need to be connected, and in many instances with very high bandwidth connections.
3. Access to power is one of the major themes in data centre industry discussions around AI. More power-hungry than legacy workloads, AI is forcing data centres to need more power than ever from the grid. Rack density has gone from 5 – 10kW to 50+kW per rack and data centres with a capacity of hundreds of MW and even 1GW+ are being planned. This comes at a time when power is harder to come by, particularly as there is pressure to use green power. As a result, it is likely that AI architectures start to take advantage of where green power is available: we will likely see training workloads located near to abundant, renewable energy such as in the Nordics. These sites then need to be highly connected with inferencing locations up to thousands of miles away.
Data centres have always been part of a wider network, but with AI this network will become more distributed and transferring greater amounts of data than previously, needing upgraded networking technologies and equipment.
Figure 2: AI is likely to place more emphasis on the edge than previous applications
What steps must data centre operators take to future-proof their networking capability
There are then several steps that data centre operators should follow to future-proof their networking capability:
1. Upgrade to high-bandwidth networks
1.1. Implement 800G and beyond
1.2. Upgrade interconnects using technologies like Compute Express Link (CXL)
2. Optimise intra-data centre connectivity
2.1. Densify network architectures to reduce physical distance and minimise hops
2.2. Implement programmable network protocols for dynamic path optimisation
2.3. Upgrade networking hardware including high-performance switches, routers and RDMA-enabled NICs
3. Strengthen inter-data centre connectivity
3.1. Establish low latency links with other key data centre and end-customer hubs
3.2. Build strong networks with edge sites for AI inferencing closer to end-customers and in compliance with data sovereignty laws
3.3. Implement federated AI approaches to minimise raw data movement
4. Design for scalability and security
4.1. Ensure networks can scale effectively as AI workloads grow
4.2. Implement stringent network security including secure data transfer between clusters and regions
5. Prioritise energy efficiency
5.1. Invest in energy efficiency networking hardware
5.2. Locate training clusters, where possible, near renewable energy sources
6. Automate and monitor networks
6.1. Use AI itself within the network for dynamic traffic routing, congestion prediction and fault recovery
6.2. Deploy predictive maintenance tools to anticipate hardware faults
7. Prepare for emerging technologies
7.1. Explore quantum communication which may enable ultra-secure, high-speed data transfer
7.2. Evaluate photonic technologies to achieve even lower latency and higher throughput
Figure 3: 7 key steps to future proof data centre networking
Are you looking for advisory services in data centres?
This year in data centres: The global push for scale and sustainability
A recap of some of the stories that we found interesting in the data centre world in 2024.
Data Centre Security: Key Principles and Best Practices
Discover data centre security key principles, best practices and compliance frameworks to safeguard infrastructure, data, and operations effectively.
GPU-as-a-Service: What it is, Trends and Leading Providers
As AI demand grows, GPU-as-a-Service (GPUaaS) offers scalable, cost-effective access to powerful GPUs, avoiding heavy upfront costs. Explore its trends here.