# on new install: * `tofu apply` to create machines * change hostname to be fqdn with hostnamectl, changing with a running cluster will break the cluster * register dns with `knotctl add -z rut.sunet.se -n internal-sto4-test-k8sm-1.rut.sunet.se. -d 2001:6b0:6c::449 -r AAAA` * ./prepare-iaas-debian ${each host} * ./add-host -b {each host} * ./edit-secrets ${each controller host} ``` --- +microk8s_secrets: + kube-system: + cloud-config: + - key: cloud.conf + value: > + ENC[PKCS7,MIID7gYJKoZIhvcNAQcDoIID3zCCA9sCAQAxggKSMIICjgIBAD + B2MF4xCzAJBgNVBAYTAlNFMQ4wDAYDVQQKDAVTVU5FVDEOMAwGA1UECwwFRV + lBTUwxLzAtBgNVBAMMJmludGVybmFsLXN0bzQtdGVzdC1rOHNtLTIucnV0Ln ``` * Add to cosmos-rules: ``` '^internal-sto4-test-k8sc-[0-9].rut.sunet.se$': rut::infra_ca_rp: sunet::microk8s::node: channel: 1.31/stable sunet::frontend::register_sites: sites: kubetest.rut.sunet.se: frontends: - se-fre-lb-1.sunet.se - se-tug-lb-1.sunet.se port: '30443' '^internal-sto4-test-k8sw-[0-9].rut.sunet.se$': rut::infra_ca_rp: sunet::microk8s::node: channel: 1.31/stable '^internal-sto4-test-k8spg-[0-9].rut.sunet.se$': rut::infra_ca_rp: sunet::microk8s::node: channel: 1.31/stable ``` * add nodes by adding a provisioning key on the first management node with `microk8s add-node` * Add all other _Controller_ nodes with `microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565` * Add all other _Worker_ nodes with `microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565 --worker` * Taint controller nodes so they wont get workload:` microk8s.kubectl taint nodes --selector=node.kubernetes.io/microk8s-controlplane=microk8s-controlplane cp-node=true:NoExecute` * Taint Postgres nodes so they wont get workload:` microk8s.kubectl taint nodes --selector=sunet.se/role=cnpg pg-node=true:NoExecute` * `kubectl get nodes` should show something like: ``` NAME STATUS ROLES AGE VERSION internal-sto4-test-k8sc-2.rut.sunet.se NotReady 16d v1.28.7 internal-sto4-test-k8sw-5.rut.sunet.se Ready 15m v1.28.7 internal-sto4-test-k8sw-1.rut.sunet.se Ready 15m v1.28.7 internal-sto4-test-k8sw-2.rut.sunet.se Ready 14m v1.28.7 internal-sto4-test-k8sc-3.rut.sunet.se Ready 16d v1.28.7 internal-sto4-test-k8sw-3.rut.sunet.se Ready 18m v1.28.7 internal-sto4-test-k8sw-4.rut.sunet.se Ready 16m v1.28.7 internal-sto4-test-k8sw-0.rut.sunet.se Ready 21m v1.28.7 internal-sto4-test-k8sc-1.rut.sunet.se Ready 16d v1.28.7 ``` * Enable needed addons for rut: `microk8s enable ingress` `microk8s enable cert-manager` `microk8s enable community` `microk8s enable cloudnative-pg` `microk8s enable metrics-server` * `kubectl create namespace sunet-cnpg` * `kubectl label node internal-sto4-test-k8spg-0.rut.sunet.se sunet.se/role=cnpg` * `kubectl label node internal-sto4-test-k8spg-1.rut.sunet.se sunet.se/role=cnpg` * `kubectl label node internal-sto4-test-k8spg-2.rut.sunet.se sunet.se/role=cnpg` * Setup storage class: `rsync -a k8s internal-sto4-test-k8sc-0.rut.sunet.se: && ssh internal-sto4-test-k8sc-0.rut.sunet.se kubectl apply -f k8s` * **Profit** # Setting up auth (satosa) and monitoring with thruk+naemon+loki+influxdb * Get shib-sp metadata with `curl https://monitor-test.rut.sunet.se/Shibboleth.sso/Metadata > internal-sto4-test-satosa-1.rut.sunet.se/overlay/etc/satosa/metadata/monitor.xml` * Get satosa metadata with `curl https://idp-proxy-test.rut.sunet.se/Saml2IDP/proxy.xml > internal-sto4-test-monitor-1.rut.sunet.se/overlay/opt/naemon_monitor/satosa.xml` * Publish backend metadata to swamid. `ssh internal-sto4-test-satosa-1.rut.sunet.se cat /etc/satosa/metadata/backend.xml |xmllint --format - > rut.xml` ## Day 2 operations: ### Rolling upgrade: Drain one controller at the time with: kubectl drain internal-sto4-test-k8sc-0.rut.sunet.se --ignore-daemonset After the first node is drained and upgraded, restart the calico controller with: kubectl rollout restart deployment calico-kube-controllers -n kube-system After that restart the calico-node running on that host by deleting it. It should be automatically recreated by the controller. kubectl delete pod calico-node-???? -n kube-system Continue with the workers (Including PG nodes): kubectl drain internal-sto4-test-k8sw-0.rut.sunet.se --force --ignore-daemonsets --delete-emptydir-data --disable-eviction kubectl delete pod calico-node-???? -n kube-system ``` ### Calico problems Calico can get in a bad state. Look for problems like `Candidate IP leak handle` and `too old resource version` in calico-kube-controllers pod. If theese are found calico can be restarted with: kubectl rollout restart deployment calico-kube-controllers -n kube-system kubectl rollout restart daemonset calico-node -n kube-system This will disrupt the whole cluster for a few seconds. ### Backup Install Velero backup from https://github.com/vmware-tanzu/velero/releases velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.2.1 --bucket velero --secret-file ./credentials-velero --use-volume-snapshots=false --backup-location-config region=sto3,s3ForcePathStyle="true",s3Url=https://s3.sto3.safedc.net velero backup create rut-backup --selector 'backup notin (ignore)'