External security groups sync job failure issue

Had an interesting issue this morning with the external security groups sync job. We use role-based group membership controls in Active Directory to control who can do what within the Octopus system, along with many other systems.

Had an issue where one of the two nodes running Octopus Deploy ran out of memory shortly before the external security groups sync job ran which caused COM exceptions for each call to AD asking for group membership.

Although this obviously isn’t an Octopus-caused issue in itself, the way Octopus handled it made for an interesting bit of troubleshooting. When the calls to AD failed, Octopus removed the group memberships from each user leaving everyone unable to do anything in the system!

Waiting an hour for the job to run again would have obviously solved the issue but I ended up updating the DB so it thought a run of the job was overdue so I was able to resolve fairly quickly.

Just wondering if it might be worth considering not dropping the user group membership if the call to AD fails?

Hi James,

Thank you for reaching out to us with your query.

You are correct that if the call to Active Directory fails the end result will be the removal of group memberships. This is because the internal mechanism that is used to ensure a user has the correct group memberships works on a clear-and-repopulate approach. It’s unlikely this will be changed as it aims to ensure that Octopus doesn’t end up with stale group memberships.

It might be useful to note that it is possible to manually re-run the relevant task using the API. This will fix the group memberships and should restore everything (assuming that whatever issue was preventing syncing has been resolved). You can find the script below:

I hope this is helpful. Please let me know if you have any questions.

Best Regards,

Charles

Thanks for taking the time to reply. This is my first significant issue with Octopus in a good number of years’ use.

That’s a reasonable aim, although I’d be more selective with my error handling personally.

There are likely to be exceptions where I’d consider this valid (failure to retrieve individual users etc. - we see this on a regular basis and are quite happy with the permissions being removed), but a failure to communicate with AD would probably be cause, in my mind, to abort the sync process and try again later.

Hi James,

Thank you for getting back to me.

It’s great to hear that you’ve had good experiences with Octopus!

There are likely to be exceptions where I’d consider this valid (failure to retrieve individual users etc. - we see this on a regular basis and are quite happy with the permissions being removed), but a failure to communicate with AD would probably be cause, in my mind, to abort the sync process and try again later.

I can completely see how taking a more selective approach would make sense. There is a difficult balancing act to be met between security and reliability and so it’s possible there isn’t a single correct approach. I’ll share your comments with the product team so they can consider if this is something we might want to change.

Thank you for sharing this feedback with us!

Best Regards,

Charles