Cluster Maintenance
OS Upgrade
- Assuming you have a cluster with a few nodes and pods serving applications. Suddenly one of the worker nodes goes down, and of course, the Pods are inaccessible.
- If the node comes back online immediately, then the Kubelet process starts and the Pods come back online. However, Kubernetes will consider the Pods dead if the node doesn’t return online after 5 minutes (which is the default Pod eviction timeout value).
- If the Pods were part of a ReplicaSet then they are recreated on other nodes.
- A safer way to upgrade your node without losing your Pods is to drain the node. So that all the workloads are moved to other nodes in the cluster. Technically, they are not moved. When you drain the node the PODs are gracefully terminated from the node that they’re on and recreated on another and the node is also marked as unschedulable “cordoned”.
$ kubectl drain node1
- After upgrading your node, make sure to run the
$ kubectl uncordon node1
to make it schedulable again. and new Pods can be scheduled again on that node.
- Another command
$ kubectl cordon nod1
and unline the drain command doesn’t terminate pods, instead it just marks the node unschedulable to make sure that new Pods are not scheduled on that node.
Cluster Upgrade
- Kube API Server is the primary component in the control plane and none of the other components should ever be at a higher version than it.
- The controller manager and scheduler can be at one version lower.
- If the Kube API server was at version X, the controller manager and the scheduler could be at X-1 and the Kubelet and kube-proxy components could be at X-2. Kubectl can be either at X+1, X-1, or X.
- The coreDNS and etcd are external components so you don’t have to worry about them.
- Three Strategy to Upgrade Worker Nodes
- The first is to upgrade all of them at once. But then Pods are down and users are no longer able to access the applications.
- The second is to upgrade one at a time.
- The thirst strategy would be to add a new node to the cluster. Nodes with newer software versions.
- To know the information about the current version, Kubeadm tool version, latest stable version of Kubernetes. Run the following command →
$ kubeadm upgrade plan
- To upgrade your Kubeadm tool use the following command →
$ sudo apt-get upgrade -y kubeadm=1.12.0-00
- To upgrade your Kubernetes cluster →
$ kubeadm upgrade apply v1.12.0
- To upgrade the kubelet →
$ sudo apt-get upgrade -y kubelet=1.12.0-00
and then restart it$ systemctl restart kubelet
- Do the same on the worker nodes
#upgrading the control plane $ k drain controlplane --ignore-daemonsets $ apt-cache madison kubeadm && apt-mark unhold kubeadm $ apt-get update && apt-get install -y kubeadm='1.27.0-00' $ apt-mark hold kubeadm $ kubeadm upgrade plan $ kubeadm upgrade apply v1.27.0 $ apt-mark unhold kubelet kubectl $ apt-get update && apt-get install -y kubelet='1.27.0-00' kubectl='1.27.0-00' $ apt-mark hold kubelet kubectl $ systemctl daemon-reload && systemctl restart kubelet $ kubectl uncordon controlplane # upgrading the node $ k drain
Backup and Restore
Backup - Resource Configs
- At times, we use the imperative way of creating an object by executing a command, such as creating a namespace, secret, configmaps, or exposing applications.
- A preferred approach is to use the declarative way by first creating a definition file and then running the kubectl apply command on that file. Since we can save our manifests and manage them on a source code repository like GitHub and easily reuse them at a later time or share with others. With that even when we lose our entire cluster, we can redeploy our application again by simply applying these manifests on them.
- What if our team memberes use the imperative way? Well, at this point we use solution like velero to constantly query the Kube API server.
$ kubectl get all --all-namespaces -o yaml > all-deploy-services.yml
Backup - ETCD
- The etcd cluster stores information about the state of the cluster, nodes and about every other resource created within the cluster. So it is really important to backup the etcd server to recover Kubernetes cluster under disaster scenarios, such as losing all control plane nodes.
- Backup etcd
# Backup example $ ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save sanpshot.db $ ETCDCTL_API=3 etcdctl --write-out snapshot status snapshot.db # backup options $ ETCDCTL_API=3 etcdctl \ --enpoints=https://127.0.0.1:2379 \ --cacert=<truster-ca-file> \ --cert=<cert-file> \ --key=<key-file> \ snapshot save <backup-file-location> ## Backup lab ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt snapshot save /opt/snapshot-pre-boot.db
- Restore etcd
#Restore example $ ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot restore snapshot.db #Restore options $ ETCDCTL_API=3 etcdctl --data-dir <data-dir-location> snapshot restore snapshot.db => where <data-dir-location> is a directory that will be created during the restore process ## Restore Lab $ ETCDCTL_API=3 etcdctl --data-dir=/var/lib/etcd-restore snapshot restore /opt/snapshot-pre-boot.db $ vim /etc/kubernetes/manifests/etcd.yml %s/\/var\/lib\/etcd/\/var\/lib\/etcd-restore
Notes about etcd
- You can either install etcd server as an object in your kubernetes cluster, or as an external server.
- To grab the ip address of the external etcd server use
$ ps -ex | grep etcd
or check the content under the manifest file of the Kube API server.
#list proc $ ps -ef | grep etcd #change context $ kubectl config use-context cluster2 #get the members of the etcd cluster $ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem member list ## Restore from external etcd $ ETCDCTL_API=3 etcdctl --data-dir=/var/lib/etcd-data-new snapshot restore cluster2.db # change on the /etc/systemd/system/etcd.service file # make sure to have the right permission and owners of the new --data-dir # restart the kubelet service, and delete the scheduler/cmanager pods. # reload the daemon and you are good to go.