2024-04-12 13:28:58 +02:00
# on new install:
2024-04-12 13:26:26 +02:00
* `tofu apply` to create machines
* change hostname to be fqdn with hostnamectl, changing with a running cluster will break the cluster
* register dns with `knotctl add -z rut.sunet.se -n internal-sto4-test-k8sm-1.rut.sunet.se. -d 2001:6b0:6c::449 -r AAAA`
2024-04-16 10:12:45 +02:00
* ./prepare-iaas-debian ${each host}
* ./add-host -b {each host}
* ./edit-secrets ${each controller host}
+ kube-system:
+ cloud-config:
+ - key: cloud.conf
+ value: >
+ lBTUwxLzAtBgNVBAMMJmludGVybmFsLXN0bzQtdGVzdC1rOHNtLTIucnV0Ln
2024-04-12 13:35:32 +02:00
* Add to cosmos-rules:
2024-04-12 13:26:26 +02:00
2024-04-16 10:12:45 +02:00
2024-04-12 13:26:26 +02:00
2024-11-29 07:36:25 +01:00
channel: 1.31/stable
2024-04-12 13:26:26 +02:00
- se-fre-lb-1.sunet.se
- se-tug-lb-1.sunet.se
port: '30443'
2024-11-29 07:36:25 +01:00
channel: 1.31/stable
channel: 1.31/stable
2024-04-12 13:26:26 +02:00
2024-12-03 09:21:11 +01:00
* add nodes by adding a provisioning key on the first management node with `microk8s add-node`
2024-04-12 13:26:26 +02:00
* Add all other _Controller_ nodes with `microk8s join`
* Add all other _Worker_ nodes with `microk8s join --worker`
2024-12-03 09:21:11 +01:00
* Taint controller nodes so they wont get workload:` microk8s.kubectl taint nodes --selector=node.kubernetes.io/microk8s-controlplane=microk8s-controlplane cp-node=true:NoExecute`
2025-02-04 10:18:57 +01:00
* Taint Postgres nodes so they wont get workload:` microk8s.kubectl taint nodes --selector=sunet.se/role=cnpg pg-node=true:NoExecute`
2024-04-12 13:26:26 +02:00
* `kubectl get nodes` should show something like:
2024-04-16 10:12:45 +02:00
internal-sto4-test-k8sc-2.rut.sunet.se NotReady < none > 16d v1.28.7
2024-04-12 13:26:26 +02:00
internal-sto4-test-k8sw-5.rut.sunet.se Ready < none > 15m v1.28.7
internal-sto4-test-k8sw-1.rut.sunet.se Ready < none > 15m v1.28.7
internal-sto4-test-k8sw-2.rut.sunet.se Ready < none > 14m v1.28.7
2024-04-16 10:12:45 +02:00
internal-sto4-test-k8sc-3.rut.sunet.se Ready < none > 16d v1.28.7
2024-04-12 13:26:26 +02:00
internal-sto4-test-k8sw-3.rut.sunet.se Ready < none > 18m v1.28.7
internal-sto4-test-k8sw-4.rut.sunet.se Ready < none > 16m v1.28.7
internal-sto4-test-k8sw-0.rut.sunet.se Ready < none > 21m v1.28.7
2024-04-16 10:12:45 +02:00
internal-sto4-test-k8sc-1.rut.sunet.se Ready < none > 16d v1.28.7
2024-04-12 13:28:58 +02:00
2025-01-28 15:15:11 +01:00
* Enable needed addons for rut: `microk8s enable ingress` `microk8s enable cert-manager` `microk8s enable community` `microk8s enable cloudnative-pg` `microk8s enable metrics-server`
* `kubectl create namespace sunet-cnpg`
2025-02-04 13:59:19 +01:00
* `kubectl label node internal-sto4-test-k8spg-0.rut.sunet.se sunet.se/role=cnpg`
* `kubectl label node internal-sto4-test-k8spg-1.rut.sunet.se sunet.se/role=cnpg`
* `kubectl label node internal-sto4-test-k8spg-2.rut.sunet.se sunet.se/role=cnpg`
2024-11-07 15:09:31 +01:00
* Setup storage class: `rsync -a k8s internal-sto4-test-k8sc-0.rut.sunet.se: && ssh internal-sto4-test-k8sc-0.rut.sunet.se kubectl apply -f k8s`
2024-04-16 10:12:45 +02:00
* **Profit**
2024-04-30 13:19:04 +02:00
# Setting up auth (satosa) and monitoring with thruk+naemon+loki+influxdb
2024-04-30 13:59:03 +02:00
* Get shib-sp metadata with `curl https://monitor-test.rut.sunet.se/Shibboleth.sso/Metadata > internal-sto4-test-satosa-1.rut.sunet.se/overlay/etc/satosa/metadata/monitor.xml`
* Get satosa metadata with `curl https://idp-proxy-test.rut.sunet.se/Saml2IDP/proxy.xml > internal-sto4-test-monitor-1.rut.sunet.se/overlay/opt/naemon_monitor/satosa.xml`
* Publish backend metadata to swamid. `ssh internal-sto4-test-satosa-1.rut.sunet.se cat /etc/satosa/metadata/backend.xml |xmllint --format - > rut.xml`
2024-11-28 14:21:45 +01:00
## Day 2 operations:
2025-02-04 10:20:32 +01:00
### Rolling upgrade:
2025-02-04 10:41:27 +01:00
Drain one controller at the time with:
2025-02-04 10:51:12 +01:00
kubectl drain internal-sto4-test-k8sc-0.rut.sunet.se --ignore-daemonset
2025-02-04 10:54:01 +01:00
After the first node is drained and upgraded, restart the calico controller with:
2025-02-04 10:51:12 +01:00
kubectl rollout restart deployment calico-kube-controllers -n kube-system
2025-02-04 10:41:27 +01:00
After that restart the calico-node running on that host by deleting it. It should be automatically recreated by the controller.
2024-11-28 14:21:45 +01:00
2025-02-04 10:51:12 +01:00
kubectl delete pod calico-node-???? -n kube-system
2025-02-04 10:41:27 +01:00
Continue with the workers (Including PG nodes):
2025-02-04 10:51:12 +01:00
kubectl drain internal-sto4-test-k8sw-0.rut.sunet.se --force --ignore-daemonsets --delete-emptydir-data --disable-eviction
kubectl delete pod calico-node-???? -n kube-system ```
### Calico problems
2025-02-04 10:52:39 +01:00
Calico can get in a bad state. Look for problems like `Candidate IP leak handle` and `too old resource version` in calico-kube-controllers pod. If theese are found calico can be restarted with:
2025-02-04 10:18:57 +01:00
2025-02-04 10:51:12 +01:00
kubectl rollout restart deployment calico-kube-controllers -n kube-system
kubectl rollout restart daemonset calico-node -n kube-system
2025-02-04 10:18:57 +01:00
2025-02-04 10:51:12 +01:00
This will disrupt the whole cluster for a few seconds.
2025-02-04 16:23:20 +01:00
### Backup
Install Velero backup from https://github.com/vmware-tanzu/velero/releases
2025-02-06 12:56:49 +01:00
wget https://github.com/vmware-tanzu/velero/releases/download/v1.15.2/velero-v1.15.2-linux-amd64.tar.gz
tar xzf velero-v1.15.2-linux-amd64.tar.gz
cp velero-v1.15.2-linux-amd64/velero /usr/local/bin/
Get s3 credential from sto3
Save into a file called credentials-velero to be used during the install:
velero install --features=EnableCSI --use-node-agent --provider aws --plugins velero/velero-plugin-for-aws:v1.2.1 --bucket velero --secret-file ./credentials-velero --use-volume-snapshots=true --backup-location-config region=sto3,s3ForcePathStyle="true",s3Url=https://s3.sto3.safedc.net --snapshot-location-config region=sto3 --wait
velero schedule create prod-schedule --schedule="0 3 * * *"
2025-02-25 16:44:40 +01:00
velero backup create rut-prod --default-volumes-to-fs-backup=false --from-schedule prod-schedule
### Kubernetes logs and instrospection
vi logging.yaml #Change destination to prod monitoring server
scp logging.yaml internal-sto4-test-k8sc-0.rut.sunet.se:
ssh internal-sto4-test-k8sc-0.rut.sunet.se
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade grafana/k8s-monitoring -f values.yaml