Health metrics endpoint

Is there a way to get health metrics about the octopus deploy servers? Things like

Num of current running tasks,
Num of async tasks,
Num of logs queued (Not sure if this is valid since switching to halibut),
Maintenance mode status,
Thread count,
Total Num of packages/projects/environments/users,
Any internal metrics that are not currently external or easily found from physical access to the box.

The reason for this is we currently have a datadog script for 2.5.6 which loads information having to do with the amount of logs queued, total numbers for projects and whatnot, as well as current number of active tasks. However getting all this information takes several requests and in 3.0 some of these are not accessible. It would be nice to have this just be accessible from one endpoint to make it less intensive / provide a consistent way to get health stats from octo. We already plan on releasing our healthcheck to datadog to add to their large list of integrations so having this could really make. This health check provide a lot of useful information and help you guys debug issues for customers that have datadog already.

Hi Brent,

Thanks for getting in touch! We don’t currently expose this information, though you might be able to get it by querying SQL directly or using the API.

This is a nice list though. I’d really like there to be a good monitoring story for Octopus. If we did expose this information, what’s the best way to expose it? Would performance counters work?

Paul

Paul,

We currently have something in place for 2.6.X which reports metrics like the ones in the picture. We currently get the data by hitting the API in several different calls via a custom datadog check. This approach seems to work fairly well but some of the metrics are no longer valid in 3.0 / was hoping monitoring would be a first class citizen in octopus deploy. I’m not sure what the best way to expose this information is but an http request seems easy / simple to implement.

By performance counters do you mean wmi metrics? I feel this is a more complex solution and it limits how you can interface with the health data but even with datadog you can report on the metrics as it just uses python and can interface with wmi information.

I would be willing to talk more and explain how we currently monitor our instances as this is rather important to us. We love Octopus deploy and want to be able to spot issues before they arrive.

Guessing you have been busy with the Octo 3.0 release but wasn’t sure if there was anything more on this issue?

Hi Brent,

We’ve made a few changes:

  • In 3.1 we’ve implemented a reporting feature, which has a table you can query for everything related to deployment history, plus an XML feed that you can pull into Excel/PowerBI. It makes answering questions like “How many failed deployments this month” easy. And the data remains even when retention policies run, so you can see historic values.
  • In 3.0 we also added HTTP logging for individual HTTP requests, to find UI performance issues.

I’ll write up the documentation on 3.1-beta reporting soon, then maybe we can do a screen sharing session and I can demo it, and we can discuss any changes that you might need?

Paul

Forgot to add - I think that some of the other metrics (# of packages, # of projects, etc.) might be best queried from the database - I don’t think we’d build a specific endpoint anytime soon for those.

Paul

Paul,

I would definitely like to see that functionally however I feel these are quote different use cases from what we previously discussed. The use case that was previously described was for reporting health metrics to see if the actual server was healthy / try and find problems before they arise. If we need to try and pull this information from different sources we will certainly find way to do this in 3.0 even if we have to connect to the database directly to pull some information.

To provide some feedback for what you described above. That is awesome news!! We currently do have a reporting solution for getting deployment information however this involves polling the api every night to pull information out and push it into a sql database to process with tableau. It actually works pretty well right now and allows us to see if changes we roll out help increase the speed of deployments / are causing failures etc. The problem with it however is it runs once a night and is very slow so hopefully this will help with that!!

Thanks for the update.