Question No. 2: what about compatibility with VMWare vSphere
Since AMD EPYC’s native environment is “clouds”, all cloud-based operating systems support these processors without complaints and without limitations. And it’s not enough to just say that it " starts and works." Unlike Xeon-s, EPYC processors use chiplet layout. In the case of the first generation (7001 series), there are four separate chips with their cores and a memory controller on the common CPU body, and a situation can occur when the virtual machine uses computing cores belonging to one NUMA domain, and the data lies in memory strips connected to the NUMA domain of another chip, which causes an extra load on the bus inside the CPU. Therefore, SOFTWARE manufacturers have to optimize their code for the features of the architecture. In particular, VMWare has learned to avoid such distortions in the allocation of resources for virtual machines, and if you are interested in details, I recommend reading this article. Fortunately, EPYC 2 on the Rome kernel does not have these layout subtleties due to the layout features, and each physical processor can be initialized as a single NUMA domain.
Those who are beginning to be interested in AMD processors often have questions: how will EPYC interact with the products of competitors in the field of virtualization? After all, in the field of machine learning, Nvidia still reigns Supreme, and in network communications - Intel and Mellanox, which is now part of Nvidia. I want to give one screenshot, which shows the devices that are available for passing through in the virtual machine environment, bypassing the hypervisor. Given that AMD EPYC Rome has 128 PCI Express 4.0 lines, you can install 8 graphics cards in a server and throw them into 8 virtual machines to speed up Tensorflow or other machine learning packages.
Let's take a little lyrical digression and set up our mini-data center for machine learning needs with Nvidia P106-090 video cards that do not have video outputs and are designed specifically for GPU computing. And let evil tongues say that this is a "mining stub", for me it is a "mini Tesla", perfectly coping with small models in Tensorflow. Assembling a small workstation for machine learning, installing desktop video cards in it, you may notice that a VM with one video card runs fine, but to make this whole design work with two or more GPUs that are not designed to work in the data center, you need to change the initialization method of the PCI-E device in The VMware ESXi configuration file. Enable access to the host via SSH, connect under the root account:
and in opened file we should find
be sure it has (instead ffff it will be your ID’s of devices)
10de ffff d3d0 false
10de ffff d3d0 false
Then overload the host, add a video card in the guest operating system and include it. Install / run Jupyter for remote access “a La Google Colab”, and make sure that the training of the new model is running on two GPUs.
Once I had to quickly count 3 models, and I ran 3 VM Ubuntu, throwing in each one GPU, and accordingly on one physical server at the same time considered three models, which without virtualization with desktop graphics cards can not be done. Just imagine: for one task you can use a virtual machines with 8 GPUs, and for another-8 virtualok, each of which has 1 GPU. But do not choose gaming video cards instead of professional ones, because after we changed the initialization method to bridge, as soon as you turn off the Ubuntu guest OS with discarded video cards, it will not start again until the hypervisor restarts. So for home / office this solution is still tolerable, but for Cloud-data center with high requirements for Uptime-no longer.
But these are not all pleasant surprises: since AMD EPYC is a SoC, it does not need a South bridge, and the manufacturer delegates to the processor such pleasant functions as probros to the SATA controller virtualizer. Please: there are two of them, and you can leave one for the hypervisor and give the other to the virtual software data warehouse.
Unfortunately, I can’t show SR-IOV working in a live example, and there is a reason for that. I will leave this pain “for later” and pour out my soul further down the text. This feature allows you to physically throw a single device, such as a network card into several virtual machines at once, for example, Intel X520-DA2 allows you to share a single network port on 16 devices, and Intel X550 - on 64 devices. You can physically throw one adapter in one VM several times to poshamanit with several VLANs. But somehow this feature does not find much use even in cloud environments.