While presenting a demo at Hannover Messe 2023, OpenShift, which hosted the OAI-based 5GC, did not show any life sign; with the help of my colleague Sagar, we managed to bring it back, and I am documenting the steps below. Note that the below steps do not show the original debug order; it is reordered to provide useful pointers in the future.

The whole problem would probably not have appeared if the cluster had had an internet connection, and if all certificates where valid. For isolation purposes, and since the internet access might not be good at the conference, we did not provide internet to the cluster. However, after a hard shutdown, it tried to download images from the internet because of certificate errors, and so did not manage to start.

Preceding notes

The following is what should be checked to make sure the cluster is in good heath. This should be checked before making any other changes.

First, run

oc get nodes -o wide

This should show that status Ready. If not, possibly check more details using oc describe nodes. Furthermore, check the cluster operator for its health with

oc get clusteroperators.config.openshift.io

All operators should be available (AVAILABLE true), not progressing anymore, or even being degraded.

Assuming this is the case, proceed with making sure that certificates are all accepted: the following line should show that all certificates are approved:

oc get csr

If not, proceed with the next section.

No logs using `oc logs`

During bringing back the cluster, there were errors like the below:

$ oc logs ubi
Error from server: Get "https://172.21.16.105:10250/containerLogs/oai5gcn/ubi/ubi": remote error: tls: internal error

For some reason, the cluster could not serve any logs. Checking the certificates in the cluster showed many pending certificates

$ oc get csr
NAME        AGE   SIGNERNAME                      REQUESTOR                                     REQUESTEDDURATION   CONDITION
csr-284dw   31h   kubernetes.io/kubelet-serving   system:node:xyz.some.network.eurecom.fr       <none>              Pending
...

Basically, the certificates had to be accepted. This was done using the following line:

oc get csr -o name | xargs oc adm certificate approve

which approved all of them. Certificates have to be approved in order to allow the cluster to pull images from its registry. The CONDITION of each should be Approved,Issued.

Giving OpenShift the images for the cluster

While giving internet during Hannover Messe allowed to bring back the cluster, the 5GC did not start:

$ oc get pods
NAME                             READY   STATUS             RESTARTS       AGE
iperf-pod                        1/1     Running            1              16h
mysql-5dd5d7db4c-fgxjm           1/1     Running            1              16h
oai-amf-7dd447b89c-kfrb9         1/2     ImagePullBackOff   3              16h
oai-ausf-75944f666b-dgmd8        1/2     ImagePullBackOff   2              16h
oai-nrf-f47894f8-6kk2k           1/2     ImagePullBackOff   2              16h
oai-smf-79bc5776df-2ndc6         1/2     ImagePullBackOff   3              16h
oai-spgwu-tiny-6d97f7778-nfsp4   1/2     CrashLoopBackOff   8 (5m8s ago)   16h
oai-udm-7bd5595c77-lcsqx         1/2     ImagePullBackOff   2              16h
oai-udr-675dcf984f-mrch7         2/2     Running            2              16h

Note that at that point, it should have been enough to accept the certificates, but we did not have done this at that time. Instead, we manually enabled OpenShift to deploy the 5GC.

In particular, oc get istag listed all images, but the values.yaml of the 5GC specified imagePullPolicy: IfNotPresent. Since certificates were not approved, the cluster did not pull images from the registry anymore and wanted to pull them again, which failed.

The solution was to start images individually and manually using a YAML test.yaml as the below one:

apiVersion: v1
kind: Pod
metadata:
  name: ubi
spec:
  containers:
    - name: ubi
      image: oai-ausf:v1.5.0
      command: ["/bin/sh", "-c"]
      args:
        ["sleep inf"]

This specifies an image ubi with the container image that was missing. It was started using oc create -f test.yaml, resulting in

$ oc get pods
NAME                     READY   STATUS             RESTARTS   AGE
ubi                      1/1     Running            0          33s

Removing the pod with oc delete -f test.yaml and repeating this process for all other pods that failed with ImagePullBackOff allowed to bring up the complete 5GC. I could then connect the gNB, and my demo phone had a PDU session. It worked again!

The probable reason why such manual intervention worked is that the permissions for manually deploying pods and from a helm chart are different. OpenShift could find the images and deploy above ubi images; however, with the automatic/helm chart way, images were not found unless started manually once.

If all else fails

During Hannover Messe, the first step we did was identifying the problem. Note that what should be done first is checking the health of the cluster by checking node and cluster operator health. However, if everything else fails, you can check the system services that run OpenShift:

systemctl status kubelet              # Kubernetes node agent
systemctl status crio                 # container runtime
systemctl status kubens               # Manages a mount namespace
systemctl status rpm-ostreed          # rpm-ostree System Management Daemon
systemctl status nodeip-configuration # Write IP address configuration for Kubernetes

While going through this, nodeip-configuration showed at Hannover Messe

avril 18 08:14:52 xyz.some.network.eurecom.fr bash[2539]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:156a06636a074e36238556f1a7e8c40ef252b168ed>
avril 18 08:15:52 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:15:52Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source doc>
avril 18 08:16:53 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:16:53Z" level=warning msg="Failed, retrying in 1s ... (2/3). Error: initializing source doc>
avril 18 08:17:54 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:17:54Z" level=warning msg="Failed, retrying in 1s ... (3/3). Error: initializing source doc>
avril 18 08:18:55 xyz.some.network.eurecom.fr bash[2539]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:156a06636a074e3623855>

Oops, it tried to download images but did not have internet connection.

Providing Internet

The cluster was hard-coded to use a specific gateway for internet access. In order to change as little as necessary on the cluster, I reconfigured my laptop to provide that internet address, and shared my internet from the laptop:

sudo sysctl net.ipv4.ip_forward=1
sudo iptables -t nat -A POSTROUTING -o usb0 -j MASQUERADE
sudo iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
sudo iptables -A FORWARD -i enx0050b6e96af2 -o usb0 -j ACCEPT

where usb0 was interface of the tethered connection from my phone, and enx0050b6e96af2 the USB-to-ethernet adapter to the network with the cluster. Afterwards, the cluster was able to download images.

Notes From Hannover Messe

2023/05/30

Preceding notes

No logs using `oc logs`

Giving OpenShift the images for the cluster

If all else fails

Providing Internet

Notes From Hannover Messe

2023/05/30

Preceding notes

No logs using oc logs

Giving OpenShift the images for the cluster

If all else fails

Providing Internet

No logs using `oc logs`