How to fix grub rescue on an AWS machine

Ubuntu on EC2 boots into grub rescue. No way to access it. Here is how I fixed it.

I recently upgraded one of my machines from Ubuntu Bionic to 20.04 Focal.
And of course, while the upgrade process went mostly smooth, something went wrong. It kept asking me if I want to keep my current grub conf file and I said yes (many times).

Which was wrong.

The machine eventually rebooted and never came back. No SSH and no other method offered by AWS allowed me to connect to it, not even the serial console.
But I was able to see a screenshot, it showed the machine booted into the Grub rescue system. There is no way to access it without doing some prior modification to the grub conf file.
If you want to setup your grub to be accessible via serial console, you can follow these steps: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/grub.html

But how can we access the file when it doesn’t want to boot?

The idea: a rescue system

I created a quick rescue system using Ubuntu on a T3.micro instance and mounted my broken system volume to /mnt/rescue. The idea was to look around and see what can be fixed.

And with chroot, it is possible to even re-install grub.

Before starting, I had to detach the volume of the bricked instance, and attach it to the newly created rescue system. Then, as root on the rescue system do this:

# see all available block devices:
lsblk

# For me, it showed nvme1n1 with a partition nvme1n1p1. Make sure you choose the correct partition! 

# we will need these 2 variables:
rescuedev=/dev/nvme1n1p1
rescuemnt=/mnt/rescue/

mkdir /mnt/rescue/

mount $rescuedev $rescuemnt

# mount all the special file systems:
for i in proc sys dev run; do mount --bind /$i $rescuemnt/$i ; done

# now jump inside:
chroot $rescuemnt

Verify we are chroot

Jumping into chroot is unspectacular. The prompt doesn’t change, and one AWS Ubuntu machine looks similar to another AWS Ubuntu machine. So I wanted a method to verify I am inside chroot, before doing anything potentially dangerous with grub.

#!/bin/bash                                                                                                                                                                                               
if [ "$(stat -c %d:%i /)" != "$(stat -c %d:%i /proc/1/root/.)" ]; then
  echo "We are chrooted!"
else
  echo "Business as usual"
fi

I found the script snippet here: https://unix.stackexchange.com/questions/14345/how-do-i-tell-im-running-in-a-chroot

The solution

Now we are virtually inside the broken machine. I changed my grub conf file as explained in the link above. But it turns out it wasn’t even necessary. The 1 step that did the trick was:

grub-install /dev/nvme1n1

Now, after re-attaching the volume to the bricked machine, it booted normally and I was able to ssh into it.