Our health checks and deployments to certain target servers have started failing today, and I have been unable to determine a cause. It is specifically the servers to which Octopus should connect using SSH - all of them.
I’ve checked that we can connect to those servers ourselves, and I’ve checked the IP address whitelists to ensure that Octopus should be granted access, and yet all attempts to have Octopus connect to these servers has failed.
The failures seems to have started between about 5 and 21 hours ago - sometime overnight, our time.
Any advice on how we can more precisely determine the cause of the failure? Is there anything going on at the Octopus end that we should know about, that might be a factor?
I can pull the call stack of the failure out of the logs, but it doesn’t seem to say anything very interesting. We already know that the SSH simply cannot find the target.
I’ve checked the instance and other than an update to 2020.3.3 three days ago, there haven’t been any other changes to the instance.
Scanning the Octopus server logs for the past day, the only issue I can see is that your storage limit was reached around 7-8 hours ago causing a lot of of failures with log files and any other process that needs to write to the storage.
It looks like the space being used may have decreased very slightly since then, but it is still hovering near the limit: 19.01 / 20.00 GB. You will need to review the retention policies in place within lifecycles and the package repository to work on reducing this.
To look into the issue any further I’ll need to log in to your instance, are you happy for me to do this?
Regarding the storage space, yes, I’m aware that we hit the limit earlier today, and I deleted a number of unneeded packages to reduce the amount of space in use. Obviously this is only a stopgap measure.
Well, just as mysteriously as the connections started failing one night, they started working again the next night.
Since the target servers are also cloud services, there’s only so much “local network infrastructure” to check, but I’ll trawl through logs on the other end to see if I spot anything. At the moment, though, it’s working, so all good.
Thank you for your attempts to identify a cause. I’ll post again if it fails again.