net-ops

Author	SHA1	Message	Date
Patrik Lundin	aa88795ee0	sunet-fleetlock: also handle ReadTimeout Turns out this was not caught by ConnectionError.	2024-07-03 14:13:22 +02:00
Patrik Lundin	e315282bc5	Use more strict exception checking This is probably wide enough and we do not need weird extra handling of our own execption etc. Thanks to @mickenordin for keeping me honest :).	2024-06-17 12:40:12 +02:00
Patrik Lundin	4b8b8887f6	sunet-fleetlock: handle connection errors In order to handle upgrades of the fleetlock server when running only one server we need to handle connection errors like connection refused or timed out errors gracefully. Because there are several different ways the connection can fail and it is hard to keep track of them all, just catch everything. We then also need special handling of our own timeout execption so we are not accidentally stuck retrying forever. Also fix so we actually use the request_timeout arg for individual HTTP requests instead of the global timeout. While here run isort to keep imports tidy.	2024-06-17 12:07:22 +02:00
Patrik Lundin	7baf9affb1	Add fleetlock support to run-cosmos Makes run-cosmos request a fleetlock lock before running cosmos "update" and "apply" steps. This is helpful for making sure only one (or several) machine out of some set of machines runs cosmos changes at a time. This way if cosmos (or puppet) decides that a service needs to be restarted this will only happen on a subset of machines at a time. When the cosmos "apply" is done a fleetlock unlock request will be performed so the other machines can progress. The unlock code in run-cosmos will also run the new tool sunet-machine-healthy to decide things are good before unlocking. This way if a restarted service breaks this will stop the unlock attempt and in turn make it so the others should not break their service as well, giving an operator time to figure out what is wrong.	2023-06-17 08:10:00 +02:00

4 commits