PCIe Gen 4.0 - Impact for Media Servers

7th May 2021

PCI Express 4.0 has been available for quite a few months now ( AMD being very quick to adopt and integrate it ). But with Intel's Rocket Lake processors, as well as the latest offering from both AMD and nVidia, PCI Express 4.0 will soon be the default on new systems of all types and specifications.

In this article, we take a quick look at the performance of PCIe 4.0 as well as some other technologies that can utilise it and may impact the next generation media server going forward.

PCI Express 4.0 - Transfer Rate

PCI Express is the main bus standard used for connecting many devices in a computer together, such as graphics, capture & network cards, as well as NVMe based storage devices. The table below shows a comparison between the old version 3.0 transfer speeds and the newer version 4.0.

Version Line Code Transfer Rate Throughput
x1 x2 x4 x8 x16
3.0 128b / 130b 8.0 GT/s 0.985 GB/s 1.969 GB/s  3.938 GB/s   7.877 GB/s 15.754 GB/s
4.0 128b / 130b 16.0 GT/s 1.969 GB/s 3.938 GB/s  7.877 GB/s 15.754 GB/s 31.508 GB/s

The new 4.0 specification offers double the bandwidth of the previous generation.
The transfer rate per lane of PCI Express is given in GT/s ( giga - transfers per second ), however to calculate the actual throughput we have to account for the encoding of the data ( Using the 128b/130b technique ).

This calculation for the throughput of a single lane is:

transfer speed x ( 128b / 130b ) = throughput in bits per second

So for a single PCI Express 4.0 Lane:

16GT/s x ( 128 / 130 ) = 15.754 Gbps

To convert the gigabit per second answer into gigabytes per second we can divide the answer by 8, giving us the 1.969 GB/s per PCIe 4.0 Lane. Putting these numbers in perspective, a 16x PCIe 4.0 Connection could transfer over a single layer BluRay disc's worth of data every second.

Having this much faster connection between various components in a server is going to have a big impact on the overall performance of the system.

NVMe Storage on PCIe 4.0

One of the standard system components that have seen a significant performance increase from the move to PCIe 4.0 are NVMe based storage devices. NVMe is an interface specification for accessing non-volatile storage over PCI Express, and as such these storage devices just got a significant upgrade.

Comparison of datasheet specification between Samsung's PRO Series M.2 SSD previous and next generation storage devices.

Edition Capacity Interface Flash Memory Type DRAM Cache Max Sequential Max Random 4K
Read Write Read Write
970 Pro 512GB 4x PCIe 3.0 2-bit MLC V NAND 512MB 3500 MB/s 2300 MB/s 370k IOPS 500k IOPS
980 Pro 500GB 4x PCIe 4.0 3-bit MLC V NAND 512MB 6900 MB/s 5000 MB/s 800k IOPS 1000k IOPS

Just from the advertised speeds on these two drives, we can see a dramatic improvement ( nearly double ). The main bottleneck in solid state storage seems to have been the interface rather than the flash memory itself. Looking at our previous table, we can see the maximum throughput for a 4x PCIe 4.0 interface is around 7.877 GB/s ( or ~7800 MB/s ), so there may yet be a bit more performance that can be gained from these drives over the next few iterations.

These new PCIe 4.0 NVMe storage devices boast speeds previously unachievable without multiple storage devices in a RAID group together ( More detail on RAID ). As such this can seriously reduce the cost for a media server requiring high levels of data throughput, such as uncompressed image sequence playback.

However, if you want to push the envelope, there are PCIe 4.0 RAID controllers available also. Such as the HighPoint SSD7505, this is a 16x PCIe 4.0 RAID controller, it supports up to four NVMe M.2 SSD's and in RAID 0 ( stripped ) boasts speeds of up to 25,000 MB/s and supports up to 32TB of storage. This seems like a perfect solution, most media server applications do not currently come close to the this bandwidth requirement, so that should be the end of it, right?

As you may know, removing one bottleneck in a system only moves it somewhere else. Fortunately, some new solutions are now coming out to either reduce or remove these next bottlenecks.

Resizable Bar

This technology has been in the PCIe specification since 2008, however it is only recently that AMD and nVidia have begun to utilise it.
Previously, graphics cards have limited the amount of VRAM ( Memory on the GPU ) accessible to the processor to 256MB over the PCIe bus. With the resizable bar implemented on a firmware and driver level with newer GPU's ( With compatible motherboards and CPU's ), now much larger amounts of VRAM can be access directly by the CPU.

In the traditional method all uploads from system memory into the GPU have to go through that 256MB buffer. Once uploaded the GPU can move the data elsewhere in VRAM and then fetch the next set of data in the queue. With much larger sections of the VRAM available to the CPU, this data can be uploaded whole and directly to its final location. Additionally these transfers can happen concurrently instead of being put into the queue.

From a media server perspective, a compressed frame from a video stream that has been decoded by the CPU can now be copied directly in VRAM upon completion of the decode rather than entering the queue. And if multiple streams are coming through, the data can be sent concurrently utilising much more of the available bandwidth of PCIe 4.0.

Direct Storage

Another new development that is coming through is Direct Storage. This technology enables data to travel to and from the GPU and a local or remote storage device, Such as an NVMe SSD, without the need to go through the CPU or system memory. 

Figure 1: The standard path between GPU memory and NVMe drives uses a bounce buffer in system memory that hangs off of the CPU. The direct data path from storage gets higher bandwidth by skipping the CPU altogether.

Source - nVidia Developer Blog - GPUDirect Storage: A Direct Path Between Storage and GPU Memory

This technology will greatly increase the I/O bandwidth between GPU's and storage devices, allowing for larger bit-rate video's and image sequences. If working with uncompressed data, this can be copied directly into VRAM for use and if compressed ( providing it is in a format that the GPU can directly decompress ) alleviates the load on the CPU.

Finally { }

All in all, these new technologies, combined with the latest hardware, could offer a significant performance increase over older systems with video and media playback. We are excited to what real-world improvements we will see once these techniques get integrated into various media server software(s), running on the latest DVS hardware.