Deep Learning and GPUs : A Rant
August 8, 2018
Please read the previous article titled Deep Learning and GPUs : The History over here before reading this one. It gives you all background info you need to know to understand most of this rant.
The rant starts with NVIDIA and goes on and on about them. NVIDIA came out with cuDNN in 2014. It was a library that essentially made it extremely easy to make machine learning libraries leverage the GPU. Thus, training machine learning models would become much faster. There were a couple of difficult to use machine learning libraries that could run on the GPU before this came out, but the availability of this meant that a large number of easy to use libraries leveraging the GPU came out. Essentially, deep learning for the masses was born.
If one remembers, cuDNN is a library for CUDA, which is a proprietary technology of NVIDIA and works only on NVIDIA GPUs. Hence, these libraries worked only on NVIDIA GPUs. If they were to work with AMD’s GPUs, the libraries needed to be re-written in OpenCL, which is kind of an open source version of CUDA supported by both NVIDIA and AMD. OpenCL did not have something like cuDNN standing for it, which meant that the developers of these libraries had a very tough task ahead for themselves if they wanted to bring in support for it. Till date, many of these frameworks do not have OpenCL support, or have only very rudimentary support.
The effect of the situation described in the previous paragraph was an NVIDIA Monopoly for the Machine Learning/Scientific Computing market. As with any monopoly, NVIDIA had complete control of the market. The pricing of GPUs was entirely up to them. This resulted in NVIDIA crippling their consumer brand of GeForce GPUs, removing support for efficient methods of certain types floating point computations (FP16 and FP64) and releasing their Titan brand of GPUs, which were the exact hardware as their GeForce counterparts, but with this ability added along with a hefty price tag. Originally, the pricing was something like $699 for the GeForce 1080Ti and $1200 for the equivalent Titan Xp. Later on, they launched the successor of the Titan Xp, the Titan V, for a whopping $3000. A wonderful example for abuse of market power. The concept of free market no longer exists. Even though AMD makes competitive GPUs, all sales go to NVIDIA. Being in such a position, facing no competition, NVIDIA is in no position to innovate. In one sense, they have no competition to keep up to. The last set of GPUs they released (on the high end) was 2 years ago (2016), and they are yet to release an updated line since then. Monopoly, market power abuse, curbing of innovation, the customer has to face it all.
The rant about NVIDIA now continues to the realm of software support. Linux, as one may know, is an open source operating system kernel. The code is available for the public to view and contribute to. The Linux kernel essentially contains the code to support different pieces of hardware, i.e the drivers. Many hardware companies have embraced the open nature of Linux and provide Linux support for their hardware by directly adding code to the Linux kernel. Thus, the hardware becomes plug and play on any Linux system. Normally, since access to code is there, the code is tweaked slightly before being built so that it runs well on any kind of system. NVIDIA, however have taken a different approach. Their code is “closed source”, which means they keep the code to themselves, and release built code in the form of what people call “binary blobs”, which will run on the kernel. Since tweaking the code before building is not possible in this case, getting the NVIDIA GPU driver to work on Linux systems is hit or miss. To add insult to injury, while installing updates, some times things breaks along the way resulting in an unusable system for the user.
Support for the Deep Learning libraries on Windows wasn’t on par with Linux in the early stages, hence many people stuck with Linux to use GPUs for Machine Learning. Thus, Machine Learning researchers have been putting up with NVIDIA’s troublesome software for all these years, and the situation is not improving. Installing updates to the NVIDIA driver still remains murky waters to tread into, a system downtime is unacceptable. NVIDIA really is not going to change it’s approach cause it does not need to, people will put up with it cause they have no other option.
What is AMD’s stance on this? Well, they obviously cannot get CUDA running on their GPUs. One framework, Theano, has OpenCL support, however the popular ones like Tensorflow and pyTorch still lack it. AMD has been pushing for the use of OpenCL wherever possible. Outside the realm of Machine Learning, popular video rendering software support using both CUDA and OpenCL for GPU accelerated rendering.
Reading till this point should have provided an understanding of what goes on in the minds of Machine Learning researchers and computer enthusiasts. NVIDIA is in a very powerful position in the market due a large number of factors, some of them explained above. In the long run, this situation does not promise good times for customers, which is essentially why this rant exists in the first place. AMD really needs to step up it’s game. Google, with it’s TPUs in the cloud may also prove to a strong competitor in the future, but for now, NVIDIA reigns supreme.