I work on a project that has one production environment but several test environments. Each environment contains the main application as well as a number of services that work in conjunction with the application. If any of the services fails in any of those environments, we need to take actions to make sure they come back up. In production, we do have mechanisms in place for failures to trigger a support group to place a pager call, but the test environments don't have that level of support. One thing we are developing, however, to improve the support is a global monitoring system. This system checks each system and each service to make sure it's still running. The results are shown in a desktop tool that we run. If anything fails, the green indication turns red and we can see what caused the problem. This helps us respond proactively to problems rather than waiting for our clients to realize that something is wrong and ask us to fix it. This seems like a tool that could be useful in many different environments.
Download