How water displaces air in Lenovo supercomputers

Today, there is a steady trend to extract the maximum power from each computing unit of the data center. Users want to see more and more local disk storage inside data center nodes, and circulatory data transfer systems that combine cluster complexes with network speeds of 200 Gbit / s no longer seem surprising.

Central processing units and graphics accelerators are getting hotter every year. However, installing thermal packages designed for 240W discharge in dense server form factors is too risky due to high temperature. Today, due to inefficient heat dissipation, most data centers are less than half full of equipment.

A couple of years ago, it was thought that the maximum limit for air cooling was 600W per unit of Cabinet space. Today, using the most modern technologies, it is possible to create a small 1U computing " box " that consumes up to 1 KW. Theoretically, with the help of air, it is possible to divert up to 1.2 KW per unit. However, if the trend continues to increase the power of each individual computing unit, it will be necessary to provide cooling of at least 2 KW.

Due to the increased density of computing resources, next-generation scale information centers will be severely limited in power, cooling, space, and maintenance costs. In addition, the need to comply with regulatory regulations on energy efficiency and reducing CO2 emissions will have a major impact on the industry.

Already this year, processors with a power of up to 300 or even 350 W will appear on the market, requiring huge radiators and fans to cool down. By 2022, we can expect offers for capacities up to 500 watts per socket. New processors will require twice as much cooling power. Technically, this means a 4-fold increase in fan speed and an eight-fold increase in volume.

If the air cooling system breaks down, the equipment will have to work out sky-high numbers. For example, in a system with 5 fans, if one fails, the remaining four should produce 25% more power. Therefore, the rotation speed should increase by 50%, and the volume should double.

Neptune to replace air

Water cooling works on the principle of transferring high temperature from a hotter object to a colder one. That is, as long as the temperature of the refrigerant is below the server’s operating temperature, the heat generated will be dissipated in the liquid. In comparison with air, water is able to transport 4000 times more heat, so the temperature excess is easily removed from the server equipment. In the future, hot water can be used to heat the building.

Lenovo_neptune

Figure 1 Simplified diagram of a data center using direct water cooling technology.

All chips and modules of modern computer systems are designed to operate at temperatures up to 80 °C and above. Due to the wide temperature range – more than 50 °C, the liquid cooling system can set precise parameters taking into account the specifics of a particular computer center. Microchannel radiators allow you to absorb temperature surpluses directly from the source-processor, memory module, hard disk or network adapter.

Lenovo Neptune assumes the use of hot water at a relatively low pressure, due to which a smaller volume of liquid passes through the cooling loop at the moment. Innovations reduce the thermal resistance and overall power consumption of the data center - due to the high temperature of the liquid, it does not need to be cooled using energy-intensive chillers.

waterpool

As long as the outdoor air temperature is below the water temperature, free air cooling will be sufficient for these purposes. The heat generated can be reused to heat nearby homes, swimming pools, and administrative buildings.

SuperMUC-NG-high performance and energy efficiency for the greatest scientific discoveries

The Leibniz Supercomputer center (LRZ) is one of the world’s largest academic data centers. LRZ provides the scientific community with world-class HPC services and resources, supporting innovative research from cosmology to medicine.

Lenovo_supermuc

High-performance computing is the cornerstone of modern science. More and more researchers are using simulations and simulations in their work.

Over time, the capacity of the existing cluster became insufficient, and the Leibniz supercomputer center signed a contract with Lenovo to design and build a new system designed for processing and visualizing big data. The project was named SuperMUC-NG (NG – New Generation) and was the third phase of the SuperMUC series of supercomputers.

The innovation cluster has four times exceeded its predecessor in capacity. SuperMUC-NG consists of 6480 Intel Xeon Scalable series processors with 311,000 cores and a peak performance of 26.7 petaflops. The cluster has 700 TB of RAM and 70 PB of data storage, and more than 60 km of cables.

Like its predecessors, the SuperMUC-NG is an extremely energy-efficient machine. The basis of the cluster was based on the technology of high-density Lenovo ThinkSystem SD650. Computing nodes are equipped with direct to Node thermal cooling, which uses the water temperature at the inlet up to 50 °C.

Thanks to Lenovo Neptune liquid cooling technology SuperMUC-NG uses 30-40% less energy than comparable systems and uses waste heat to heat all LRZ buildings. Among other things, Lenovo’s system has allowed the computing center to reduce CO2 emissions by up to 85%, which in absolute terms is 30 tons per year.

MareNostrum 4-optimal real-time calculations

Every year, more than 10,000 people visit the Torre Girona chapel on the outskirts of Barcelona to watch MareNostrum 4, one of the largest and most powerful supercomputers in the world. The cluster consists of 3,456 Lenovo ThinkSystem SD530 nodes with Intel Xeon Platinum processors, and has a computing power of 11 petaflops.

Lenovo_supermaelstorm

Despite being ten times faster than its predecessor, MareNostrum 4 uses just 30% more energy at 1.3 MW per year. The cluster is recognized as one of the top ten systems in Europe by the rating of the most energy-efficient computing systems GREEN500.

Power and energy have become critical constraints for HPC systems.

The performance and power consumption of parallel applications depends on a number of factors, such as:

  • Architectural parameters of the computer
  • Configuration of the computing node during code execution
  • Characteristics of application software
  • Input

Selecting the optimal parameters is a very difficult task, which is usually performed manually. This is a labor-intensive process of selecting resources, and then capacity, which is performed when putting a supercomputer into operation. Over time, the optimal parameters may change, as well as vary from node to node within the cluster.

Energy Aware Runtime (EAR), a joint development of Lenovo and the Barcelona Supercomputer Center, allows you to select the optimal mode of operation of equipment automatically, based on the analysis of experience with a particular task.

EAR, as part of Lenovo Neptune technology, supports setting automatic and dynamic processor frequency selection modes based on various factors. Then the projection of the performance and power consumption of the supercomputer as a whole is performed. The last step is to configure the necessary thresholds for defining user or system policies to select the processor frequency. For example, the system supports power saving mode by reducing the frequency, and Vice versa, it can limit the decrease in performance.