Cloud SSH Targets disappearing

NickR · 27 August 2018 23:00

We have migrated to the cloud version of Octopus, and IP whitelisting and manual variable transfer aside, found it to be a relatively painless exercise. Except for SSH targets. I can successfully create them, they have a static ip on our side and are happily used day in and day out. Except randomly, usually after a weekend, we come in and they have disappeared completely from the system. No audit tasks/logs. No machine references anywhere. Even the target roles have vanished… The server themselves have not changed at all.
I recreated the SSH target, and Octopus treats it as a new machine, brings it all back up (it setup to redeploy) and everything is right… Until the next time that the SSH target disappears…
Has anyone experienced the same? Is this a known issue?

Matt.Richardson · 28 August 2018 07:11

Hi Nick

Thanks for getting in touch!

Sorry to hear that you’re having issues with SSH targets. At first glance, it appears that its related to machine policies - I’m guessing that its not able to contact the machine and then its getting deleted after it times out. Can you check to see if there’s a machine policy setup?

If it is the case, its a bit odd, as these should appear in the audit log - we’ll need to investigate.

If it isn’t the case, and you’re happy to give us permission to login to your instance, we can take a closer look and see whats happening. Make sure you let us know what your instance name is too.

Look forward to getting to the bottom of this.

Cheers,
Matt

NickR · 3 September 2018 00:53

Hi Matt,

Another weekend, and again no more SSH targets. I have just checked and we are using the Default machine policy. I have made no changes to this, but the SSH Endpoint Script Policy is to Use Custom Script, but the script is empty. The check window is 60 minutes. This is interesting, as we had SSH targets in our old self hosted version and we never had this issue - and it has the same machine policy…

I will re-add the instances this morning, but I am happy for you guys to jump in and take a look. Am I able to DM you the details?

Thanks

Robert_Wagner · 3 September 2018 03:33

Hi Nick,

Matt has passed this onto me for further investigation. I was able to log into your instance using our administration process, so no need to send details.

The easiest way to view why deletion occurred is to go to the Audit log and filter by Event Category of Document Deleted. You can use this link as a shortcut.

I can see that filtering by Machine Deleted does not show that record, which is most likely the filter you applied. I’ll have to followup and see if that is a bug.

The audit log reports that the machine was deleted after 2 days of being unavailable. Your machine policy is set to automatically delete deployment targets after 2 days (under the Cleaning Up Unavailable Deployment Targets header).

Unfortunately the Health Check logs from that period have already expired, so I can’t see why the machines were reported as unavailable. Are you expecting them to be unavailable?

I’ve removed my support account from your instance.

Regards,

Rob W

NickR · 4 September 2018 00:49

Hi Robert,

No, I am not expecting the machines to be unavailable.

The interesting thing here is that compared to the hosted version we are migrating away from, there is no discernable difference between our configuration when it comes to SSH targets. The 2 days would be the weekends, during which time we have not deployed anything. What I do not understand is how the health check is failing… admittedly we have not set a custom script, but we did not set one in the hosted version either. Reading the docs, my understanding is that there is a default check that is run, I assume when I do not enter anything in the custom bash script config. Again, reading between the lines, I assume that it a) carries out a connectivity check and b) does a system requirements check. It does not do a disk space check.

So given that these are my assumptions and that the machine policy is set to check every 60 minutes, why do I lose my machines over the weekend? Fair enough for the 2 days side of things, but the implication is that there are 0 connectivity checks being run on the SSH targets, as opposed to the tentacles which are obviously being checked every 60 minutes. Or my assumptions are incorrect (this may, therefore, require a documentation update) and I either need to write my own bash script OR choose to just to a connectivity check. The interesting thing is, I just checked the machine I readded yesterday, and there have been no deployments in the last hour, but it says there was a connectivity check done 2 minutes ago. So now I am more perplexed, the signs indicate that there are connectivity checks happening, but at some point we are losing the servers - if it was just 1 server, I could maybe put it down to an AWS issue, but it has been all of them (only 3 at this point) - and none are affected by the hosted version… and for the ssh Targets, I have had them accessed from both the cloud and hosted versions… When I review the task audit logs for health checks, I don’t see any recorded events for Saturdays or Sundays from any machines… (Filter: Health Check, Status: All)

Should I be expecting to see Health Check Audit records for the weekends? Can you see them on your side?

Thanks

Nick

Robert_Wagner · 4 September 2018 01:28

Hi Nick,

Is the automatic deletion of machines something you want to happen (the option was enabled on the 1st of August, perhaps inadvertently)? If not, you can change that setting back to not automatically delete, or perhaps extend it to 3 days to cover the weekend.

The health check runs for both Tentacles and SSH targets. You can see the results and which machines were checked by opening one of the tasks on this list. If no script is supplied for the SSH targets, we check whether mono is installed (unless the machine is set to not require mono). We also check whether Calamari is present. This is the script that runs.

I can currently see that the health checks failed before 5:38am due to SSH target not responding. The health check at 6:38am succeeded again (QLD time).

Are you looking at the task list or the audit log? We only keep the last 7 health check logs (we should probably increase that), so the ones from the weekend would have been removed.

If you look at the audit log, and filter by the event group Machine Health changed (direct link), you can see that around 10pm every weekday the machine goes offline, then around 6am it comes back online. My guess is that you have an AWS auto-scale rule or automation script setup that turns the machines off to save cost.

Regards,

Robert W

NickR · 6 September 2018 00:41

Thank’s Robert,

Yes, you are right, we have added in the delete and that is the difference between on our prem and cloud versions. The big reason for that is as we migrated to the cloud we also migrated everything else to be more “auto-magical” as opposed to hand baked instance to start dev off with. So separately we need to deal with machine deletion for discarded instances, but I will look at getting a script to run in the ASG lifecycle hook when an instance is shut down to decommission it from Octopus. Having unreachable instances obviously causes deployments to fail - and that is our preference - but I obviously only want legitimate unreachable instances as registered infrastructure.

On another note, we don’t have a shutdown policy in place to save costs (though we should) so I find it interesting that we have that SSH connectivity unless there is a maintenance window in there caused by Opsworks. Is there a way to change the timing of the Octopus Health checks- I would be happy to have them hourly still, but maybe kick them off at 01 past the hour instead of 45 past?

Thanks

Nick

Robert_Wagner · 6 September 2018 01:56

Hi,

Unfortunately there isn’t a way to set the health checks to run on that schedule. The best you could do is set the next one to run in 16 minutes (so it runs on the hour) and then set it immediately back to an hour. However that will be throw off by a reboot of the server.

I just had a look again, and the SSH target was again unavailable between 10pm and 6am. Is there perhaps a firewall or network rule that blocks that port outside of those hours?

NickR · 6 September 2018 03:12

Robert,

I am going to have to do some digging on that one.

Thanks for your help

Nick

system · 6 October 2018 03:16

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.