Fully Autonomous Containerized Deployment, part 2

Building upon the previous article, let’s review what we’ve built so far:

  1. Deployed a self-managing container OS based on Fedora CoreOS
    • Configured Fedora CoreOS to automatically update itself
    • Setup intelligent health checks and automated rollbacks for any failed updates

I’m going to tackle this first from an upstream approach with Fedora CoreOS, however there’s an excellent article written on this same topic for RHEL for Edge by Brian Smith that covers this incredibly well. I’d encourage everyone to have a read of his article as well!

Right, so let’s tackle this last bullet point. For this, we’re going to install a tool called greenboot onto the Fedore CoreOS image. The technical term for this is actually called layering, as we’re building on top of the base ostree image.

There’s some work upstream to make this a little friendlier, but for now we can use a systemd unit file to install greenboot and have this embedded into our Ignition configuration. Here’s my completed example:

variant: fcos
version: 1.4.0
storage:
  disks:
    - device: /dev/vda
      wipe_table: false
      partitions:
        - label: root
          number: 4
          size_mib: 10240
          resize: true
  files:
    - path: /etc/hostname
      mode: 0644
      contents:
        inline: coreos.calgaryrhce.ca
    - path: /etc/NetworkManager/system-connections/enp1s0.nmconnection
      mode: 0600
      overwrite: true
      contents:
        inline: |
          [connection]
          type=ethernet
          interface-name=enp1s0

          [ipv4]
          method=manual
          addresses=192.168.100.50/24
          gateway=192.168.100.1
          dns=192.168.100.1
          dns-search=calgaryrhce.ca
    - path: /etc/zincati/config.d/51-rollout-wariness.toml
      contents: 
        inline: |
          [identity]
          rollout_wariness = 0.5
    - path: /etc/zincati/config.d/55-updates-strategy.toml
      contents: 
        inline: |
          [updates]
          strategy = "periodic"
          [[updates.periodic.window]]
          days = [ "Sat", "Sun" ]
          start_time = "22:30"
          length_minutes = 60
systemd:
  units:
    - name: rpm-ostree-install-greenboot.service
      enabled: true
      contents: |
        [Unit]
        Description=Layer greenboot with rpm-ostree
        Wants=network-online.target
        After=network-online.target
        # We run before `zincati.service` to avoid conflicting rpm-ostree
        # transactions.
        Before=zincati.service
        ConditionPathExists=!/var/lib/%N.stamp

        [Service]
        Type=oneshot
        RemainAfterExit=yes
        # `--allow-inactive` ensures that rpm-ostree does not return an error
        # if the package is already installed. This is useful if the package is
        # added to the root image in a future Fedora CoreOS release as it will
        # prevent the service from failing.
        ExecStart=/usr/bin/rpm-ostree install --apply-live --allow-inactive greenboot greenboot-default-health-checks zsh
        ExecStart=/bin/touch /var/lib/%N.stamp

        [Install]
        WantedBy=multi-user.target
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "[ public SSH key hash ]"
    - name: aludwar
      password_hash: "[ password hash, salted ]"
      ssh_authorized_keys:
        - "[ public SSH key hash ]"
      groups: [ sudo, docker ]

The line to pay extra attention to here is the ExecStart, specifying the rpm-ostree install command. This adds zsh as well as both greenboot and some example health checks we can use to make sure a system boots up into a healthy, functioning state.

After modifying our config above and creating a new VM, you will see a second deployment listed when running ‘$ rpm-ostree status’ that should indicate the layered packages you’ve added. The “–apply-live” flag we used in the rpm-ostree install command should have applied the changes to the image right away and have them persist. You can also reboot the system so it boots into the new modified image with, ‘systemctl reboot’. The dot next to the deployment indicates which one is currently active and booted into. Here’s my end result:

[aludwar@coreos ~]$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2022-07-07 21:22:43 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/stable
                   Version: 36.20220618.3.1 (2022-07-05T23:09:41Z)
                BaseCommit: 474557e51b1013d4e737e0fd41f4e3d482546e3615a2480a3b34bb186a8ada94
              GPGSignature: Valid signature by 53DED2CB922D8B8D9E63FD18999F7CBF38AB71F4
           LayeredPackages: greenboot greenboot-default-health-checks zsh

  fedora:fedora/x86_64/coreos/stable
                   Version: 36.20220618.3.1 (2022-07-05T23:09:41Z)
                    Commit: 474557e51b1013d4e737e0fd41f4e3d482546e3615a2480a3b34bb186a8ada94
              GPGSignature: Valid signature by 53DED2CB922D8B8D9E63FD18999F7CBF38AB71F4

Now that we’ve got greenboot installed, let’s use one of the example health checks to test a successful boot condition, and a custom script to test an unsuccessful boot triggering a rollback. You can dive deeper into the greenboot documentation for how exactly this works, but I’ll provide a fast track example below.

I’ll take one of the health check examples located in /usr/lib/greenboot/check directory. There two categories of checks we can do, ones that MUST NOT FAIL (required) for a successful boot and ones that MAY FAIL (wanted). There are corresponding directories to place these health checks in:

/etc
└── greenboot
    ├── check
    │   ├── required.d
    │   └── wanted.d

I’ll take the example ‘/usr/lib/greenboot/check/required.d/01_repository_dns_check.sh’ check and put it the required directory. This check essentially makes sure the networking on the host is valid and the host can resolve external domains it needs to update itself. Here is where I’ll also enable the services required:

$ cp /usr/lib/greenboot/check/required.d/01_repository_dns_check.sh /etc/greenboot/check/required.d/
$ systemctl enable greenboot-task-runner greenboot-healthcheck greenboot-status greenboot-loading-message

You can try a test run of the script to see the output before attempting a reboot:

$ ./etc/greenboot/check/required.d/01_repository_dns_check
All domains have resolved correctly

So with that, let’s reboot and see what we get. After rebooting, you’ll need to SSH into the host to see the status message:

$ ssh 192.168.100.50
Fedora CoreOS 36.20220618.3.1
Boot Status is GREEN - Health Check SUCCESS
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

[aludwar@coreos ~]$ 
[aludwar@coreos ~]$ sudo systemctl status greenboot-status
● greenboot-status.service - greenboot MotD Generator
     Loaded: loaded (/usr/lib/systemd/system/greenboot-status.service; enabled; vendor preset: disabled)
     Active: active (exited) since Thu 2022-07-14 16:27:35 UTC; 44s ago
    Process: 973 ExecStart=/usr/libexec/greenboot/greenboot-status (code=exited, status=0/SUCCESS)
   Main PID: 973 (code=exited, status=0/SUCCESS)
        CPU: 22ms

Jul 14 16:27:35 coreos.ludwar.ca systemd[1]: Starting greenboot-status.service - greenboot MotD Generator...
Jul 14 16:27:35 coreos.ludwar.ca greenboot-status[979]: Boot Status is GREEN - Health Check SUCCESS
Jul 14 16:27:35 coreos.ludwar.ca systemd[1]: Finished greenboot-status.service - greenboot MotD Generator.

Looks good! Boot status is green. Now, let’s test a boot failure by creating a script for a failure condition. I’m going to borrow Brian’s test scripts from the RHEL for Edge article, but change one to test for zsh. This script we will place in /etc/greenboot/check/required.d/ directory as zsh.sh:

#!/bin/bash

if [ -x /usr/bin/zsh ]; then
echo "zsh shell found, check passed!"
exit 0
else
echo "zsh shell not found, check failed!"
exit 1
fi

And the bootfail.sh script, which will help highlight the rollback behaviour. We will place this script in the /etc/greenboot/red.d/ directory as bootfail.sh.

#!/bin/bash

echo "greenboot detected a boot failure" >> /var/roothome/greenboot.log
date >> /var/roothome/greenboot.log
grub2-editenv list | grep boot_counter >> /var/roothome/greenboot.log
echo "----------------" >> /var/roothome/greenboot.log
echo "" >> /var/roothome/greenboot.log

Now with that, let’s test a rollback by removing zsh from the host. What should happen is the greenboot service will run the zsh.sh script and fail, then reboot. It will repeat this 3 times before marking the boot as ‘failed’ and will then execute a rollback. Brian’s bootfail.sh script here will help us capture that as this process will happen quickly during booting.

[aludwar@coreos ~]$ sudo rpm-ostree uninstall zsh
...
Removed:
  zsh-5.8.1-1.fc36.x86_64
Changes queued for next boot. Run "systemctl reboot" to start a reboot
[aludwar@coreos ~]$ sudo systemctl reboot

After a minute or two, let’s go back into the host and see what’s occurred. Checking greenboot-status’ status, we see the log line of a fallback boot detected and an rpm-ostree rollback executed as a result:

[aludwar@coreos ~]$ systemctl status greenboot-status
● greenboot-status.service - greenboot MotD Generator
   Loaded: loaded (/usr/lib/systemd/system/greenboot-status.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2022-07-14 11:07:18 EDT; 25s ago
  Process: 1152 ExecStart=/usr/libexec/greenboot/greenboot-status (code=exited, status=0/SUCCESS)
 Main PID: 1152 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 23432)
   Memory: 0B
   CGroup: /system.slice/greenboot-status.service

Jul 14 11:07:18 coreos.test.ludwar.ca systemd[1]: Starting greenboot MotD Generator...
Jul 14 11:07:18 coreos.test.ludwar.ca greenboot-status[1159]: Boot Status is GREEN - Health Check SUCCESS
Jul 14 11:07:18 coreos.test.ludwar.ca greenboot-status[1159]: FALLBACK BOOT DETECTED! Default rpm-ostree deployment has been rolled back.
Jul 14 11:07:18 coreos.test.ludwar.ca systemd[1]: Started greenboot MotD Generator.

Checking the bootfail.sh script output, we see 3 attempts to boot and check, which failed and decremented the boot_counter:

[root@coreos ~]# cat /var/roothome/greenboot.log 
greenboot detected a boot failure
Thu Jul 14 11:06:40 EDT 2022
boot_counter=2
----------------

greenboot detected a boot failure
Thu Jul 14 11:06:52 EDT 2022
boot_counter=1
----------------

greenboot detected a boot failure
Thu Jul 14 11:07:05 EDT 2022
boot_counter=0
----------------

And when we check the rpm-ostree status, we see a new entry for an image that had zsh removed, but the active image is the previous working image which had zsh installed:

[aludwar@coreos ~]$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2022-07-14 16:20:51 UTC)
Deployments:
● fedora:fedora/x86_64/coreos/stable
                   Version: 36.20220618.3.1 (2022-07-05T23:09:41Z)
                BaseCommit: 474557e51b1013d4e737e0fd41f4e3d482546e3615a2480a3b34bb186a8ada94
              GPGSignature: Valid signature by 53DED2CB922D8B8D9E63FD18999F7CBF38AB71F4
           LayeredPackages: greenboot greenboot-default-health-checks zsh

  fedora:fedora/x86_64/coreos/stable
                   Version: 36.20220618.3.1 (2022-07-05T23:09:41Z)
                BaseCommit: 474557e51b1013d4e737e0fd41f4e3d482546e3615a2480a3b34bb186a8ada94
              GPGSignature: Valid signature by 53DED2CB922D8B8D9E63FD18999F7CBF38AB71F4
           LayeredPackages: greenboot greenboot-default-health-checks

Perfect. So at this point we have setup and confirmed intelligent health checks and automated rollbacks for failed image updates!

Additional Resources:

Another excellent article for customizing Fedora CoreOS specific to an application is this article from developers.redhat.com.