While presenting a demo at Hannover Messe 2023, OpenShift, which hosted the OAI-based 5GC, did not show any life sign; with the help of my colleague Sagar, we managed to bring it back, and I am documenting the steps below. Note that the below steps do not show the original debug order; it is reordered to provide useful pointers in the future.
The whole problem would probably not have appeared if the cluster had had an internet connection, and if all certificates where valid. For isolation purposes, and since the internet access might not be good at the conference, we did not provide internet to the cluster. However, after a hard shutdown, it tried to download images from the internet because of certificate errors, and so did not manage to start.
Preceding notes
The following is what should be checked to make sure the cluster is in good heath. This should be checked before making any other changes.
First, run
oc get nodes -o wide
This should show that status Ready. If not, possibly check more details using
oc describe nodes. Furthermore, check the cluster operator for its health
with
oc get clusteroperators.config.openshift.io
All operators should be available (AVAILABLE true), not progressing anymore,
or even being degraded.
Assuming this is the case, proceed with making sure that certificates are all accepted: the following line should show that all certificates are approved:
oc get csr
If not, proceed with the next section.
No logs using oc logs
During bringing back the cluster, there were errors like the below:
$ oc logs ubi
Error from server: Get "https://172.21.16.105:10250/containerLogs/oai5gcn/ubi/ubi": remote error: tls: internal error
For some reason, the cluster could not serve any logs. Checking the certificates in the cluster showed many pending certificates
$ oc get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-284dw 31h kubernetes.io/kubelet-serving system:node:xyz.some.network.eurecom.fr <none> Pending
...
Basically, the certificates had to be accepted. This was done using the following line:
oc get csr -o name | xargs oc adm certificate approve
which approved all of them. Certificates have to be approved in order to allow
the cluster to pull images from its registry. The CONDITION of each should be
Approved,Issued.
Giving OpenShift the images for the cluster
While giving internet during Hannover Messe allowed to bring back the cluster, the 5GC did not start:
$ oc get pods
NAME READY STATUS RESTARTS AGE
iperf-pod 1/1 Running 1 16h
mysql-5dd5d7db4c-fgxjm 1/1 Running 1 16h
oai-amf-7dd447b89c-kfrb9 1/2 ImagePullBackOff 3 16h
oai-ausf-75944f666b-dgmd8 1/2 ImagePullBackOff 2 16h
oai-nrf-f47894f8-6kk2k 1/2 ImagePullBackOff 2 16h
oai-smf-79bc5776df-2ndc6 1/2 ImagePullBackOff 3 16h
oai-spgwu-tiny-6d97f7778-nfsp4 1/2 CrashLoopBackOff 8 (5m8s ago) 16h
oai-udm-7bd5595c77-lcsqx 1/2 ImagePullBackOff 2 16h
oai-udr-675dcf984f-mrch7 2/2 Running 2 16h
Note that at that point, it should have been enough to accept the certificates, but we did not have done this at that time. Instead, we manually enabled OpenShift to deploy the 5GC.
In particular, oc get istag listed all images, but the values.yaml of the
5GC specified imagePullPolicy: IfNotPresent. Since certificates were not
approved, the cluster did not pull images from the registry anymore and wanted
to pull them again, which failed.
The solution was to start images individually and manually using a YAML
test.yaml as the below one:
apiVersion: v1
kind: Pod
metadata:
name: ubi
spec:
containers:
- name: ubi
image: oai-ausf:v1.5.0
command: ["/bin/sh", "-c"]
args:
["sleep inf"]
This specifies an image ubi with the container image that was missing. It
was started using oc create -f test.yaml, resulting in
$ oc get pods
NAME READY STATUS RESTARTS AGE
ubi 1/1 Running 0 33s
Removing the pod with oc delete -f test.yaml and repeating this process for
all other pods that failed with ImagePullBackOff allowed to bring up the
complete 5GC. I could then connect the gNB, and my demo phone had a PDU
session. It worked again!
The probable reason why such manual intervention worked is that the permissions
for manually deploying pods and from a helm chart are different. OpenShift
could find the images and deploy above ubi images; however, with the
automatic/helm chart way, images were not found unless started manually once.
If all else fails
During Hannover Messe, the first step we did was identifying the problem. Note that what should be done first is checking the health of the cluster by checking node and cluster operator health. However, if everything else fails, you can check the system services that run OpenShift:
systemctl status kubelet # Kubernetes node agent
systemctl status crio # container runtime
systemctl status kubens # Manages a mount namespace
systemctl status rpm-ostreed # rpm-ostree System Management Daemon
systemctl status nodeip-configuration # Write IP address configuration for Kubernetes
While going through this, nodeip-configuration showed at Hannover Messe
avril 18 08:14:52 xyz.some.network.eurecom.fr bash[2539]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:156a06636a074e36238556f1a7e8c40ef252b168ed>
avril 18 08:15:52 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:15:52Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source doc>
avril 18 08:16:53 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:16:53Z" level=warning msg="Failed, retrying in 1s ... (2/3). Error: initializing source doc>
avril 18 08:17:54 xyz.some.network.eurecom.fr bash[2539]: time="2023-04-18T08:17:54Z" level=warning msg="Failed, retrying in 1s ... (3/3). Error: initializing source doc>
avril 18 08:18:55 xyz.some.network.eurecom.fr bash[2539]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:156a06636a074e3623855>
Oops, it tried to download images but did not have internet connection.
Providing Internet
The cluster was hard-coded to use a specific gateway for internet access. In order to change as little as necessary on the cluster, I reconfigured my laptop to provide that internet address, and shared my internet from the laptop:
sudo sysctl net.ipv4.ip_forward=1
sudo iptables -t nat -A POSTROUTING -o usb0 -j MASQUERADE
sudo iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
sudo iptables -A FORWARD -i enx0050b6e96af2 -o usb0 -j ACCEPT
where usb0 was interface of the tethered connection from my phone, and
enx0050b6e96af2 the USB-to-ethernet adapter to the network with the cluster.
Afterwards, the cluster was able to download images.