I tried submitting a case with IFS on this but was advised to try here first. Please advise if anyone is able to help with what we are experiencing below. We are hosting ourselves on-prem and recently upgraded from IFS 9 to 24R1.
TL;DR
The <server> cluster is in HA mode but it is not using the typical robust cluster state database software that we would expect in a production ready environment (using Etcd). In addition to that, each node refers to itself for the Kubernetes API instead of a load balanced IP address in the event a node goes down. This is symptomatic of a non-production/non-enterprise setup. Microk8s is best suited for small, single node implementations and/or local development. It does have addon features to provide HA and other production style features, however, those features are not very robust. This is how IFS chose to package their product, and they don’t appear to offer any other on-prem options. So, we appear to be stuck with it as is.
As a reference for the rest of the explanation, here’s a link to one of the posts that I found online of someone who went through issues working with Microk8s, it’s a short read. https://www.thegalah.com/which-kubernetes-distribution-should-you-choose-lessons-from-failure
Note: The purpose of this note is not to change the way it’s setup, but more to understand what to expect from this implementation and why it behaves differently from what we expect from a standard HA environment. If a problem was found in relation to the setup script (meaning it didn’t do what it said on the tin) then to provide additional information to be used with IFS support.
Initial problem statement (the thing that sparked this review)
When the primary node goes offline (ie. rebooting after updates) the cluster goes down. This is unexpected behaviour since the IFS instructions for setting up an HA cluster were followed and appeared to be successful. So why does this behaviour occur?
First main issue
Every Kubernetes cluster (Microk8s included) needs a database to store the cluster’s state. That is, a place to store everything that the cluster needs to know in order to keep the cluster running. This database needs to be distributed across all nodes in the cluster and is one of the requirements to make the cluster Highly Available. Typically, most implementations use Etcd as the database of choice due to its robustness and maturity in this environment. However, Microk8s defaults to using DQLite instead. This is a distributed version of SQLite, which is configured to distribute itself over each node. This works, however, the issue comes in when Node 1 goes offline, and quorum is lost. DQLite will need to elect a new leader (which is a whole process) and resume control. From what I’ve read online, DQLite seems to be very bad at that and is highly prone to errors and resulting in not sorting itself out in time before the cluster falls on its face. Note: Microk8s shares some of this blame too as it may not be aggressive enough in forcing the re-election of a leader.
Second main issue
I have verified that each node in the cluster is running an API server, which is good and expected. However, it is setup in a differently than expected. Normally, in an HA setup, you have a Virtual IP (VIP) configured in a Load Balancer (like HAProxy). So, you configure each node to look for the API server at the VIP, which will get redirected to a working/responsive API server node. So, if there’s a node failure, you can still talk to the API on another node seamlessly.
However, in this case, each node in the <server> cluster is configured to look to itself for the API Server instead of a VIP. This is likely done to minimize the amount of infrastructure needed to provide HA but reduces the robustness of the cluster. For example, if the local API server stops responding you won’t be automatically routed to another API server. It will just fail, taking that node offline, thus potentially pushing DQLite into trying to elect a new leader and keep the cluster alive.
So, what now?
I don’t want to change up the way things are “wired up” as IFS has built this as a product, as opposed to a product that runs on a “bring your own Kubernetes” model. We should be aware of potential issues and if we do encounter performance, stability, or downtime issues, then we will need to engage their support to correct them. If we change up the architecture and then experience any problems, they may not want to support us because of it. Not having support for production isn’t good for us.
For now, it may just be that we have to accept this is the way it is. At least, we know what could happen with this software set, and in the end, if the worst happens, we are able to completely rebuild the cluster in a reasonable amount of time using their install scripts.
We would like to take OS patches without downtime, so any help is much appreciated.