The main issues with having a GPU accelerated Tensorflow installation is the myriad compatibility issues. The easiest way proposed online is to use a docker image. However, the docker image didn’t work and it took up too much space. I discarded the docker image idea mostly because of space constraints. I will return to it later during the production phase. The main issue with tensorflow is that the tensorflow version must be compatible with the CUDA version installed.
Tensorflow 2.3.1 needs CUDA 10 and above and NVIDIA 450 above preferably nvidia-455
These are the steps to get a working GPU accelerated tensorflow environment (Debian based system).
1. Purge nvidia drivers
sudo apt remove --purge “*nvidia*”
2. Install latest Nvidia drivers
sudo apt install nvidia-driver-455
Check your GPU and CUDA version
nvidia-smi
Or you can skip this step if installing the older nvidia=450 drivers in step #4 below.
3. Create a virtual environment to contain the tensorflow
pip install virtualenv
cd ~
python3 -m venv tf-env
source tf-env/bin/activate
Replace tf-env by the name of your choice. This will create a directory structure which will contain all the python packages, so it’s best to create in a drive with lots of free space, although it is easy to move.
4. Install CUDA following the recommendations from tensorflow website
Trying to install CUDA independently from NVIDIA website will break it in all possible ways. I have tried all possible combinations – CUDA 11.1 with tensorflow nightly, CUDA 10.1 with tensorflow stable. Something always breaks. The best method is to follow the install instructions on the tensorflow website to the dot.
https://www.tensorflow.org/install/gpu
The only exception is that I didn’t install the older nvidia-450 drivers. I kept the newer nvidia-455 driver.
5. Make sure all links are working
Make sure there’s a link from cuda to the actual CUDA installation in /usr/local
$ ls -l /usr/local/
lrwxrwxrwx 1 root root 9 Oct 9 17:21 cuda -> cuda-11.1
drwxr-xr-x 14 root root 4096 Oct 9 17:21 cuda-11.1
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64
6. Install tensorflow
Start virtualenv if not in it already
$ source tf-env/bin/activate
And then install tensorflow
(tf-env) $ pip install tensorflow
If you already have installed the nightly (unstable) version from #4 above then it is better to uninstall it first with
(tf-env) $pip uninstall tf-nightly
7. Test tensorflow
(tf-env) $ python
>>> import tensorflow as tf
2020-10-09 18:24:57.371340: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.__version__
'2.3.1'
>>> tf.config.list_physical_devices()
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
All seems to be running OK
8. Setup virtualenv kernel to Jupyter
While in the virtual environment install ipykernel
(tf-env) $ pip install ipykernel
Add current virtual environment to Jupyter
(tf-env) $ python -m ipykernel install --user --name=tf-env
tf-env will show up in the list of Jupyter kernels. The name for the Jupyter kernel can be anything. I kept it the same for consistency.
You can find the Jupyter kernels in ~/.local/share/jupyter/kernels
Test tensorflow gpu support in jupyter
(tf-env) $ jupyter notebook
import tensorflow as tf
tf.config.experimental.list_physical_devices()
tf.config.list_physical_devices()
tf.test.gpu_device_name()
Note: The tensorflow GPU detection in Jupyter will only work when Jupyter is run from within the virtual environment. Running Jupyter outside the virtualenv will not work even if the virtualenv kernel (tf-env) is chosen over regular system python kernel.