Fix: No GPU support in Tensorflow

I came across a problem where my Tensorflow installation did not recognize the installed gpu, despite of Cuda and Nvidia drivers being installed properly.

test:

python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

returned an empty list. Furthermore, it tells it cannot find the cuda library:

2024-01-30 14:57:42.015454: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

Output of the Nvidia tool is correct and shows Cuda is installed:

nvidia-smi

ubuntu@ip-bla-foo:~/build-nb$  nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Which tells us it is version 12. Ahhh!💡

Now, 12 is a version from 2023 and my idea was that Tensorflow 2.13 might not know this version, see https://blog.tensorflow.org/2023/11/whats-new-in-tensorflow-2-15.html

Ok, the latest version pip offered was TF 2.13 on Python 3.8. Here is the fix:

  1. upgrade Python: sudo apt install python3.9
  2. a new venv: virtualenv –python /usr/bin/python3.9 ~/.env-python3.9
  3. source ~/.env-python3.9/bin/activate
  4. pip install –upgrade pip
  5. python3 -m pip install tensorflow[and-cuda]==2.15.0.post1

Test: python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

2024-01-30 15:27:04.458720: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 15:27:04.458772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 15:27:04.459601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 15:27:04.465334: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 15:27:05.115551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-30 15:27:05.560865: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.585883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-30 15:27:05.586100: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now we see the GPU in Tensorflow.

Git setup in Jenkins Pipeline

I’ll show how to set up Jenkins and SSH keys to clone a git repo in a build step in the Jenkins pipeline. This wasn’t straight forward at all and several obstacles were in the way and had to be removed.

I am using an Ubuntu 22.04 host, Jenkins 2.375.1, Jenkins Pipeline and docker based agents running Ubuntu as well.

SSH Setup for git clone

Pipeline is able to do the git clone, so we don’t need to hassle with running ‘git clone’ on the agent, which comes with its own problems. In this case we would have to find a safe and secure way to put your ssh keys into the agent. Luckily, Pipeline can do the git checkout, and the key stays with the host.

Basically, what we need to do is generate ssh keys for the jenkins user on Ubuntu, distribute them correctly to the git server, and set up the credentials in Jenkins. Then we can add the git step in the Pipeline.

You might easily run into the problems here, as it is a bit tricky to find the right settings. This is what worked for me, lets start:

Create an ssh key under the jenkins user. The easiest way to do this is by logging in into the user, and create the keys in its home dir. First, give the jenkins user a password:

sudo passwd jenkins

Now you can login as this user:

su jenkins

This users home dir, JENKINS_HOME, is under /usr/lib/jenkins/ (at least on Ubuntu 22.04). Make sure you stand in this directory when you create your key.

ssh-keygen

You can leave the passphrase empty. It generates a secret and an id file under /usr/lib/jenkins/.ssh/ The .ssh folder and the files need the following permissions:

-rw-------  1 jenkins jenkins 2602 Dec  6 11:10 id_rsa
-rw-r--r--  1 jenkins jenkins  569 Dec  6 11:10 id_rsa.pub

which corresponds to 600 and 644. The .ssh folder itself has 700 (drwx——)

Add the key to your git server:

ssh-copy-id git@yourgitserver

Test it:

ssh -vvv git@yourgitserver

If something goes wrong with your key, ssh will offer you password login and it looks like this:

debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Offering public key: /var/lib/jenkins/.ssh/id_rsa RSA SHA256:*************************
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,password
debug1: Trying private key: /var/lib/jenkins/.ssh/id_ecdsa
debug3: no such identity: /var/lib/jenkins/.ssh/id_ecdsa: No such file or directory
debug1: Trying private key: /var/lib/jenkins/.ssh/id_ecdsa_sk
debug3: no such identity: /var/lib/jenkins/.ssh/id_ecdsa_sk: No such file or directory
debug1: Trying private key: /var/lib/jenkins/.ssh/id_ed25519
debug3: no such identity: /var/lib/jenkins/.ssh/id_ed25519: No such file or directory
debug1: Trying private key: /var/lib/jenkins/.ssh/id_ed25519_sk
debug3: no such identity: /var/lib/jenkins/.ssh/id_ed25519_sk: No such file or directory
debug1: Trying private key: /var/lib/jenkins/.ssh/id_xmss
debug3: no such identity: /var/lib/jenkins/.ssh/id_xmss: No such file or directory
debug1: Trying private key: /var/lib/jenkins/.ssh/id_dsa
debug3: no such identity: /var/lib/jenkins/.ssh/id_dsa: No such file or directory
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password
git@victory's password: 

Check everything again, also permissions of the files on the git server side, e.g. authorized_keys (spelling, 600/-rw——-)

If things go well, it looks like this:

debug3: receive packet: type 60
debug1: Server accepts key: /var/lib/jenkins/.ssh/id_rsa RSA SHA256:*************************
debug3: sign_and_send_pubkey: using publickey-hostbound-v00@openssh.com with RSA SHA256:*************************
debug3: sign_and_send_pubkey: signing using rsa-sha2-512 SHA256:*************************
debug3: send packet: type 50
debug3: receive packet: type 52

...

debug2: shell request accepted on channel 0
Welcome to Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-56-generic x86_64)

Now you can set up an identity to use this key. Go to Dashboard – your project and click on Configure. Click on Pipeline Syntax (at the bottom), choose the Snippet Generator and choose git. Under Credentials is a + button to add one. Choose ssh username with private key, leave the scope as it is. ID and Description according to what makes sense to you. Username jenkins. Private key – enter directly. Here you paste the content of your ~/.ssh/id_rsa file (the private part). Caution here, this shouldn’t slip anywhere else, sharpen your copy and paste skills. Then click add.

Git in the Pipeline

Now, when you also put your git repo there, you get right away the right git clone snippet for your Jenkins pipeline:

git credentialsId: 'jenkins-git', url: 'git@yourgitserver:/your-repo.git'

Embedded in the pipeline it can look like this:

pipeline {
    agent any

    stages {
        stage('Git Checkout') {
            steps {
                git credentialsId: 'jenkins-git', url: 'git@yourgitserver:/your-repo.git'
            }
        }

Now, under Dashboard – Manage Jenkins – Manage Credentials you should see your key, and you can change it there.

Host key acceptance

In order to use git in Jenkins Pipeline, you must also ensure they host key is accepted. I got errors in the clone step indicating the host key is not known and cannot be accepted.

stdout: 
stderr: No ECDSA host key is known for victory and you have requested strict checking.

To fix this, you go to Dashboard – Manage Jenkins – Configure Global Security, look for Git Host Key Verification Configuration and change the strategy to Accept first connection.

Now you can build your project and it should be able to clone your git repo.

How to fix grub rescue on an AWS machine

Ubuntu on EC2 boots into grub rescue. No way to access it. Here is how I fixed it.

I recently upgraded one of my machines from Ubuntu Bionic to 20.04 Focal.
And of course, while the upgrade process went mostly smooth, something went wrong. It kept asking me if I want to keep my current grub conf file and I said yes (many times).

Which was wrong.

The machine eventually rebooted and never came back. No SSH and no other method offered by AWS allowed me to connect to it, not even the serial console.
But I was able to see a screenshot, it showed the machine booted into the Grub rescue system. There is no way to access it without doing some prior modification to the grub conf file.
If you want to setup your grub to be accessible via serial console, you can follow these steps: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/grub.html

But how can we access the file when it doesn’t want to boot?

The idea: a rescue system

I created a quick rescue system using Ubuntu on a T3.micro instance and mounted my broken system volume to /mnt/rescue. The idea was to look around and see what can be fixed.

And with chroot, it is possible to even re-install grub.

Before starting, I had to detach the volume of the bricked instance, and attach it to the newly created rescue system. Then, as root on the rescue system do this:

# see all available block devices:
lsblk

# For me, it showed nvme1n1 with a partition nvme1n1p1. Make sure you choose the correct partition! 

# we will need these 2 variables:
rescuedev=/dev/nvme1n1p1
rescuemnt=/mnt/rescue/

mkdir /mnt/rescue/

mount $rescuedev $rescuemnt

# mount all the special file systems:
for i in proc sys dev run; do mount --bind /$i $rescuemnt/$i ; done

# now jump inside:
chroot $rescuemnt

Verify we are chroot

Jumping into chroot is unspectacular. The prompt doesn’t change, and one AWS Ubuntu machine looks similar to another AWS Ubuntu machine. So I wanted a method to verify I am inside chroot, before doing anything potentially dangerous with grub.

#!/bin/bash                                                                                                                                                                                               
if [ "$(stat -c %d:%i /)" != "$(stat -c %d:%i /proc/1/root/.)" ]; then
  echo "We are chrooted!"
else
  echo "Business as usual"
fi

I found the script snippet here: https://unix.stackexchange.com/questions/14345/how-do-i-tell-im-running-in-a-chroot

The solution

Now we are virtually inside the broken machine. I changed my grub conf file as explained in the link above. But it turns out it wasn’t even necessary. The 1 step that did the trick was:

grub-install /dev/nvme1n1

Now, after re-attaching the volume to the bricked machine, it booted normally and I was able to ssh into it.

Work with large numbers of files in a folder

A few shell commands to list, sort or move files in a large folder.

I have a component that writes a log file every time a task is executed, and they all go to the same folder. ls or Midnight Commander take a long time on that folder. I’d like to move the old files out of the way, e.g. to a folder called, well, old. Or I want to delete them. Also, I’d like to list the most recent files, maybe the largest of today, or just the last file that was written. Here a few commands that work on my Ubuntu bash:

find /data/logs/ -maxdepth 1 -mtime +30 -type f -exec mv "{}" old/ \;

Finds the files older than 30 days and moves them to a folder old/

find /data/logs/ -maxdepth 1 -type f -mtime -1

Lists the files no more than 1 day old.

find /data/logs/ -maxdepth 1 -type f -mtime -1 -exec du -ah  "{}" \;  | sort -n -r | head -n 10

Finds the files no older than 1 day, sorts them by size, and shows the largest 10.