No description
Find a file
2025-04-15 08:39:50 +02:00
ansible Add default ansible invenvtory. 2025-01-20 14:50:22 +01:00
default import 2013-09-02 16:01:50 +02:00
docs Multiverse master has been renamed to main, so updating documentation to reflect that 2023-07-03 15:14:52 +02:00
fabfile remove 'make db' 2023-02-07 14:21:29 +01:00
global Add kushal's key. 2025-04-11 10:09:24 +02:00
internal-sto4-test-k8sc-0.rut.sunet.se New secrets. 2025-04-11 12:11:30 +02:00
internal-sto4-test-k8sc-1.rut.sunet.se New secrets. 2025-04-11 12:11:30 +02:00
internal-sto4-test-k8sc-2.rut.sunet.se New secrets. 2025-04-11 12:11:30 +02:00
internal-sto4-test-k8spg-0.rut.sunet.se internal-sto4-test-k8spg-0.rut.sunet.se added 2024-11-06 13:59:47 +01:00
internal-sto4-test-k8spg-1.rut.sunet.se internal-sto4-test-k8spg-1.rut.sunet.se added 2024-11-06 14:07:39 +01:00
internal-sto4-test-k8spg-2.rut.sunet.se internal-sto4-test-k8spg-2.rut.sunet.se added 2024-11-06 14:18:36 +01:00
internal-sto4-test-k8sw-0.rut.sunet.se internal-sto4-test-k8sw-0.rut.sunet.se added 2024-04-12 12:06:05 +02:00
internal-sto4-test-k8sw-1.rut.sunet.se internal-sto4-test-k8sw-1.rut.sunet.se added 2024-03-13 11:07:34 +01:00
internal-sto4-test-k8sw-2.rut.sunet.se internal-sto4-test-k8sw-2.rut.sunet.se added 2024-04-12 12:21:27 +02:00
internal-sto4-test-k8sw-3.rut.sunet.se internal-sto4-test-k8sw-3.rut.sunet.se added 2024-04-12 12:32:34 +02:00
internal-sto4-test-k8sw-4.rut.sunet.se internal-sto4-test-k8sw-4.rut.sunet.se added 2024-04-12 12:39:00 +02:00
internal-sto4-test-k8sw-5.rut.sunet.se internal-sto4-test-k8sw-5.rut.sunet.se added 2024-04-12 12:43:39 +02:00
internal-sto4-test-k8sw-6.rut.sunet.se internal-sto4-test-k8sw-6.rut.sunet.se added 2025-01-29 07:25:00 +01:00
internal-sto4-test-k8sw-7.rut.sunet.se internal-sto4-test-k8sw-7.rut.sunet.se added 2025-04-11 11:10:15 +02:00
internal-sto4-test-monitor-1.rut.sunet.se Regenerate dashboards to work with our deploy. https://github.com/grafana/k8s-monitoring-helm/issues/1073 2025-02-25 16:08:34 +01:00
internal-sto4-test-satosa-1.rut.sunet.se Add monitor metadata to satosa. 2024-04-30 13:15:39 +02:00
k8s Storageclass should not be here! 2025-02-07 12:42:18 +01:00
scripts add script and Makefile target to test multiverse in a docker container 2023-01-19 17:18:04 +01:00
.gitignore use run-cosmos in fabfile 2015-02-24 09:44:33 +01:00
addhost PREPARE/ADDHOST: allow the ues of proxyjump with ip address 2023-11-29 12:10:34 +01:00
apt simple kvm builder using cloud images 2014-10-22 14:28:05 +02:00
bump-tag Allow running of bumptag with out signed commits or tags 2023-12-04 14:24:34 +01:00
cosmos-rules.yaml import 2013-09-02 16:01:50 +02:00
cosmos.conf Renamed tag 2024-02-19 12:44:05 +01:00
edit-secrets Make sure of separator 2023-02-07 08:49:31 +01:00
host-puppet-conf-test shellcheck fixes 2023-02-07 16:09:37 +01:00
iaas-enable-root.sh Make debian iaas prepare scripts handle ubuntu 2022-11-14 12:54:08 +01:00
iaas-setup.sh Bookwork image runs netplan 2023-10-16 09:25:57 +02:00
kubernetes.tf Sync changes that been corrected in prod but not in test. 2025-04-11 07:24:53 +02:00
logging.yaml Correct logging url. 2025-03-27 15:59:35 +01:00
main.tf Sync changes that been corrected in prod but not in test. 2025-04-11 07:24:53 +02:00
Makefile remove 'make db' target as well 2023-02-07 15:04:01 +01:00
outputs.tf Add a monitor instance to the project. 2024-04-23 13:45:22 +02:00
pgcluster.tf Sync changes that been corrected in prod but not in test. 2025-04-11 07:24:53 +02:00
prepare-iaas-debian PREPARE/ADDHOST: allow the ues of proxyjump with ip address 2023-11-29 12:10:34 +01:00
prepare-iaas-ubuntu PREPARE/ADDHOST: allow the ues of proxyjump with ip address 2023-11-29 12:10:34 +01:00
README.md More formatting. The .md renderer in forgejo is different from retext. 2025-04-15 08:39:50 +02:00
vars.tf Sync changes that been corrected in prod but not in test. 2025-04-11 07:24:53 +02:00

on new install:

  • tofu apply to create machines
  • change hostname to be fqdn with hostnamectl, changing with a running cluster will break the cluster
  • register dns with knotctl add -z rut.sunet.se -n internal-sto4-test-k8sm-1.rut.sunet.se. -d 2001:6b0:6c::449 -r AAAA
  • ./prepare-iaas-debian ${each host}
  • ./add-host -b {each host}
  • ./edit-secrets ${each controller host}
---
microk8s_secrets:
  kube-system:
    cloud-config:
      - key: cloud.conf
        value: DEC::PKCS7[
[Global]
auth-url = https://v2.api.sto4.safedc.net:5000/v3/
application-credential-id = 12ed12312d123123423412
application-credential-secret = Shuoghij4thie4liith0peiki5eVul3oogaibaechiej8xaodeePhahjaemange4Ahla
region = sto4]!
  • Add to cosmos-rules:

'^internal-sto4-test-k8sc-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable
  sunet::frontend::register_sites:
    sites:
      kubetest.rut.sunet.se:
        frontends:
        - se-fre-lb-1.sunet.se
        - se-tug-lb-1.sunet.se
        port: '30443'
'^internal-sto4-test-k8sw-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable
'^internal-sto4-test-k8spg-[0-9].rut.sunet.se$':
  rut::infra_ca_rp:
  sunet::microk8s::node: 
    channel: 1.31/stable
  • add nodes by adding a provisioning key on the first management node with microk8s add-node
  • Add all other Controller nodes with microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565
  • Add all other Worker nodes with microk8s join 89.46.21.119:25000/12345678987654345678976543/1234565 --worker
  • Taint controller nodes so they wont get workload: microk8s.kubectl taint nodes --selector=node.kubernetes.io/microk8s-controlplane=microk8s-controlplane cp-node=true:NoExecute
  • kubectl label node internal-sto4-test-k8spg-0.rut.sunet.se sunet.se/role=cnpg
  • kubectl label node internal-sto4-test-k8spg-1.rut.sunet.se sunet.se/role=cnpg
  • kubectl label node internal-sto4-test-k8spg-2.rut.sunet.se sunet.se/role=cnpg
  • Taint Postgres nodes so they wont get workload: microk8s.kubectl taint nodes --selector=sunet.se/role=cnpg pg-node=true:NoExecute
  • kubectl get nodes should show something like:
NAME                                     STATUS     ROLES    AGE   VERSION
internal-sto4-test-k8sc-2.rut.sunet.se   NotReady   <none>   16d   v1.28.7
internal-sto4-test-k8sw-5.rut.sunet.se   Ready      <none>   15m   v1.28.7
internal-sto4-test-k8sw-1.rut.sunet.se   Ready      <none>   15m   v1.28.7
internal-sto4-test-k8sw-2.rut.sunet.se   Ready      <none>   14m   v1.28.7
internal-sto4-test-k8sc-3.rut.sunet.se   Ready      <none>   16d   v1.28.7
internal-sto4-test-k8sw-3.rut.sunet.se   Ready      <none>   18m   v1.28.7
internal-sto4-test-k8sw-4.rut.sunet.se   Ready      <none>   16m   v1.28.7
internal-sto4-test-k8sw-0.rut.sunet.se   Ready      <none>   21m   v1.28.7
internal-sto4-test-k8sc-1.rut.sunet.se   Ready      <none>   16d   v1.28.7
  • Enable needed addons for rut: microk8s enable ingress microk8s enable cert-manager microk8s enable community microk8s enable cloudnative-pg microk8s enable metrics-server
  • kubectl create namespace sunet-cnpg
  • Setup storage class: rsync -a k8s internal-sto4-test-k8sc-0.rut.sunet.se: && ssh internal-sto4-test-k8sc-0.rut.sunet.se kubectl apply -f k8s
  • Profit

Setting up auth (satosa) and monitoring with thruk+naemon+loki+influxdb

  • Get shib-sp metadata with curl https://monitor-test.rut.sunet.se/Shibboleth.sso/Metadata > internal-sto4-test-satosa-1.rut.sunet.se/overlay/etc/satosa/metadata/monitor.xml
  • Get satosa metadata with curl https://idp-proxy-test.rut.sunet.se/Saml2IDP/proxy.xml > internal-sto4-test-monitor-1.rut.sunet.se/overlay/opt/naemon_monitor/satosa.xml
  • Publish backend metadata to swamid. ssh internal-sto4-test-satosa-1.rut.sunet.se cat /etc/satosa/metadata/backend.xml |xmllint --format - > rut.xml

Day 2 operations:

Rolling upgrade:

Drain one controller at the time with:

kubectl drain internal-sto4-test-k8sc-0.rut.sunet.se  --ignore-daemonset

After the first node is drained and upgraded, restart the calico controller with:

kubectl rollout restart deployment calico-kube-controllers -n kube-system 

After that restart the calico-node running on that host by deleting it. It should be automatically recreated by the controller.

kubectl delete pod calico-node-???? -n kube-system

Continue with the workers (Including PG nodes):

kubectl drain internal-sto4-test-k8sw-0.rut.sunet.se --force --ignore-daemonsets --delete-emptydir-data --disable-eviction
kubectl delete pod calico-node-???? -n kube-system ```

Calico problems

Calico can get in a bad state. Look for problems like Candidate IP leak handle and too old resource version in calico-kube-controllers pod. If theese are found calico can be restarted with:

kubectl rollout restart deployment calico-kube-controllers -n kube-system
kubectl rollout restart daemonset calico-node -n kube-system

This will disrupt the whole cluster for a few seconds.

Backup

Install Velero backup from https://github.com/vmware-tanzu/velero/releases

wget https://github.com/vmware-tanzu/velero/releases/download/v1.15.2/velero-v1.15.2-linux-amd64.tar.gz
tar xzf velero-v1.15.2-linux-amd64.tar.gz
cp velero-v1.15.2-linux-amd64/velero /usr/local/bin/

Get s3 credential from sto3 Save into a file called credentials-velero to be used during the install:

[default]
    aws_access_key_id=123123123123123123123123123123
    aws_secret_access_key=12312312312312312312312213123

Export the kubectl config file for velero to use:

microk8s config > ~/.kube/config
velero install --features=EnableCSI --use-node-agent --provider aws --plugins velero/velero-plugin-for-aws:v1.2.1 --bucket velero --secret-file ./credentials-velero --use-volume-snapshots=true --backup-location-config region=sto3,s3ForcePathStyle="true",s3Url=https://s3.sto3.safedc.net --snapshot-location-config region=sto3 --wait
velero schedule create prod-schedule --schedule="0 3 * * *" --snapshot-move-data=true
velero backup create rut-prod --snapshot-move-data --default-volumes-to-fs-backup=false --from-schedule prod-schedule --exclude-namespaces velero

Restore

  1. Install a new cluster according to the new install instructions. The addons for rut are not needed as they are included in the backup.
  2. Install Velero.
  3. Do the actual restore.

velero restore create --from-backup test-schedule-20250411030038

Kubernetes logs and instrospection

vi logging.yaml #Change destination to prod monitoring server
scp logging.yaml internal-sto4-test-k8sc-0.rut.sunet.se:
ssh internal-sto4-test-k8sc-0.rut.sunet.se
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade grafana/k8s-monitoring -f values.yaml