Prod ops repo for the RUT project.

Find a file

Rasmus Thorslund 04a8104515 added mikands ssh key		2025-02-06 14:32:39 +01:00
ansible	added ansible inventory script	2025-01-22 15:58:24 +01:00
default	import	2013-09-02 16:01:50 +02:00
docs	Multiverse master has been renamed to main, so updating documentation to reflect that	2023-07-03 15:14:52 +02:00
fabfile	remove 'make db'	2023-02-07 14:21:29 +01:00
global	added mikands ssh key	2025-02-06 14:32:39 +01:00
internal-sto4-prod-k8sc-0.rut.sunet.se	added secrets for controller nodes	2025-02-06 12:23:47 +01:00
internal-sto4-prod-k8sc-1.rut.sunet.se	added secrets for controller nodes	2025-02-06 12:23:47 +01:00
internal-sto4-prod-k8sc-2.rut.sunet.se	added secrets for controller nodes	2025-02-06 12:23:47 +01:00
internal-sto4-prod-k8spg-0.rut.sunet.se	added mikands key	2025-02-06 12:34:33 +01:00
internal-sto4-prod-k8spg-1.rut.sunet.se	added mikands key	2025-02-06 12:34:33 +01:00
internal-sto4-prod-k8spg-2.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-0.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-1.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-2.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-3.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-4.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-k8sw-5.rut.sunet.se	added the rest of the fleetlock keys in secrets	2025-02-06 12:53:11 +01:00
internal-sto4-prod-monitor-1.rut.sunet.se	added satosa.xml	2025-01-16 09:56:23 +01:00
internal-sto4-prod-satosa-1.rut.sunet.se	added satosa metadata monitor.xml	2025-01-16 09:50:00 +01:00
scripts	recreated cluster and removed old eyaml files	2025-02-06 12:16:47 +01:00
.gitignore	added frontend conf	2024-05-30 13:36:31 +02:00
addhost	PREPARE/ADDHOST: allow the ues of proxyjump with ip address	2023-11-29 12:10:34 +01:00
apt	simple kvm builder using cloud images	2014-10-22 14:28:05 +02:00
bump-tag	Allow running of bumptag with out signed commits or tags	2023-12-04 14:24:34 +01:00
cosmos-rules.yaml	import	2013-09-02 16:01:50 +02:00
cosmos.conf	Changed tag from eduid-cosmos to the more generic cosmos-ops	2016-08-27 17:05:11 +02:00
edit-secrets	Make sure of separator	2023-02-07 08:49:31 +01:00
host-puppet-conf-test	shellcheck fixes	2023-02-07 16:09:37 +01:00
iaas-enable-root.sh	Make debian iaas prepare scripts handle ubuntu	2022-11-14 12:54:08 +01:00
iaas-setup.sh	Bookwork image runs netplan	2023-10-16 09:25:57 +02:00
kubernetes.tf	terraform files	2024-05-30 10:52:01 +02:00
main.tf	terraform files	2024-05-30 10:52:01 +02:00
Makefile	remove 'make db' target as well	2023-02-07 15:04:01 +01:00
outputs.tf	terraform files	2024-05-30 10:52:01 +02:00
pgcluster.tf	added pgcluster tf files	2025-01-14 13:19:01 +01:00
prepare-iaas-debian	PREPARE/ADDHOST: allow the ues of proxyjump with ip address	2023-11-29 12:10:34 +01:00
prepare-iaas-ubuntu	PREPARE/ADDHOST: allow the ues of proxyjump with ip address	2023-11-29 12:10:34 +01:00
README	short readme	2013-10-15 10:55:04 +02:00
README.md	recreated cluster and removed old eyaml files	2025-02-06 12:16:47 +01:00
vars.tf	recreated cluster and removed old eyaml files	2025-02-06 12:16:47 +01:00

README.md

on new install:

tofu apply to create machines
change hostname to be fqdn with hostnamectl, changing with a running cluster will break the cluster
register dns with knotctl add -z rut.sunet.se -n internal-sto4-test-k8sm-1.rut.sunet.se. -d 2001:6b0:6c::449 -r AAAA
./prepare-iaas-debian ${each host}
./add-host -b {each host}
./edit-secrets ${each controller host}

---
+microk8s_secrets:
+  kube-system:
+    cloud-config:
+        - key: cloud.conf
+          value: >
+            ENC[PKCS7,MIID7gYJKoZIhvcNAQcDoIID3zCCA9sCAQAxggKSMIICjgIBAD
+            B2MF4xCzAJBgNVBAYTAlNFMQ4wDAYDVQQKDAVTVU5FVDEOMAwGA1UECwwFRV
+            lBTUwxLzAtBgNVBAMMJmludGVybmFsLXN0bzQtdGVzdC1rOHNtLTIucnV0Ln

Add to cosmos-rules:


'^internal-sto4-test-k8sc-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable
  sunet::frontend::register_sites:
    sites:
      kubetest.rut.sunet.se:
        frontends:
        - se-fre-lb-1.sunet.se
        - se-tug-lb-1.sunet.se
        port: '30443'
'^internal-sto4-test-k8sw-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable
'^internal-sto4-test-k8spg-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable

add nodes by adding a provisioning key on the first management node with microk8s add-node
Add all other Controller nodes with microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565
Add all other Worker nodes with microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565 --worker
Taint controller nodes so they wont get workload: microk8s.kubectl taint nodes --selector=node.kubernetes.io/microk8s-controlplane=microk8s-controlplane cp-node=true:NoExecute
Taint Postgres nodes so they wont get workload: microk8s.kubectl taint nodes --selector=sunet.se/role=cnpg pg-node=true:NoExecute
kubectl get nodes should show something like:

NAME                                     STATUS     ROLES    AGE   VERSION
internal-sto4-test-k8sc-2.rut.sunet.se   NotReady   <none>   16d   v1.28.7
internal-sto4-test-k8sw-5.rut.sunet.se   Ready      <none>   15m   v1.28.7
internal-sto4-test-k8sw-1.rut.sunet.se   Ready      <none>   15m   v1.28.7
internal-sto4-test-k8sw-2.rut.sunet.se   Ready      <none>   14m   v1.28.7
internal-sto4-test-k8sc-3.rut.sunet.se   Ready      <none>   16d   v1.28.7
internal-sto4-test-k8sw-3.rut.sunet.se   Ready      <none>   18m   v1.28.7
internal-sto4-test-k8sw-4.rut.sunet.se   Ready      <none>   16m   v1.28.7
internal-sto4-test-k8sw-0.rut.sunet.se   Ready      <none>   21m   v1.28.7
internal-sto4-test-k8sc-1.rut.sunet.se   Ready      <none>   16d   v1.28.7

Enable needed addons for rut: microk8s enable ingress microk8s enable cert-manager microk8s enable community microk8s enable cloudnative-pg microk8s enable metrics-server
kubectl create namespace sunet-cnpg
kubectl label node internal-sto4-test-k8spg-0.rut.sunet.se sunet.se/role=cnpg
kubectl label node internal-sto4-test-k8spg-1.rut.sunet.se sunet.se/role=cnpg
kubectl label node internal-sto4-test-k8spg-2.rut.sunet.se sunet.se/role=cnpg
Setup storage class: rsync -a k8s internal-sto4-test-k8sc-0.rut.sunet.se: && ssh internal-sto4-test-k8sc-0.rut.sunet.se kubectl apply -f k8s
Profit

Setting up auth (satosa) and monitoring with thruk+naemon+loki+influxdb

Get shib-sp metadata with curl https://monitor-test.rut.sunet.se/Shibboleth.sso/Metadata > internal-sto4-test-satosa-1.rut.sunet.se/overlay/etc/satosa/metadata/monitor.xml
Get satosa metadata with curl https://idp-proxy-test.rut.sunet.se/Saml2IDP/proxy.xml > internal-sto4-test-monitor-1.rut.sunet.se/overlay/opt/naemon_monitor/satosa.xml
Publish backend metadata to swamid. ssh internal-sto4-test-satosa-1.rut.sunet.se cat /etc/satosa/metadata/backend.xml |xmllint --format - > rut.xml

Day 2 operations:

Rolling upgrade:

Drain one controller at the time with:

kubectl drain internal-sto4-test-k8sc-0.rut.sunet.se  --ignore-daemonset

After the first node is drained and upgraded, restart the calico controller with:

kubectl rollout restart deployment calico-kube-controllers -n kube-system

After that restart the calico-node running on that host by deleting it. It should be automatically recreated by the controller.

kubectl delete pod calico-node-???? -n kube-system

Continue with the workers (Including PG nodes):

kubectl drain internal-sto4-test-k8sw-0.rut.sunet.se --force --ignore-daemonsets --delete-emptydir-data --disable-eviction
kubectl delete pod calico-node-???? -n kube-system ```

Calico problems

Calico can get in a bad state. Look for problems like Candidate IP leak handle and too old resource version in calico-kube-controllers pod. If theese are found calico can be restarted with:

kubectl rollout restart deployment calico-kube-controllers -n kube-system
kubectl rollout restart daemonset calico-node -n kube-system

This will disrupt the whole cluster for a few seconds.

Backup

Install Velero backup from https://github.com/vmware-tanzu/velero/releases

velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.2.1 --bucket velero --secret-file ./credentials-velero --use-volume-snapshots=false --backup-location-config region=sto3,s3ForcePathStyle="true",s3Url=https://s3.sto3.safedc.net
velero backup create rut-backup --selector 'backup notin (ignore)'