More and more customers are moving to cloud architectures to fulfill the alternating resource requirements in the IT. Traditional monitoring approaches with checking the availability of a single system or resource instance does only make limited sense in this new era. Resources are provisioned and removed on dynamic request and have no long term life date.
It no longer matters whether a named system exists or not, it is about the service, and its implementing pieces. The number of systems will vary in accordance to the workload covered by the service. In some cases the service itself may disappear, when it is not permanently required. The key metric is the response time the service consumers achieve. But how can we assure this key performance metric at the highest level, without being hit by an unpredicted slowdown or outage.
We need a common monitoring tool watching the key performance metric keys on a resource level and frequently check the availability of these resources, like:
Application containers will also be handled like resources, e.g.:
Also resources from database systems, messaging engines and so on are monitored. With IBM Monitoring we have a useful and easy to handle tool, available on-premise and in the cloud.
With this data achieved by the monitoring tool, we can now feed a predictive insight tool. As described in a previous post, monitoring is the enabler for prediction. Prediction is a key success factor in cloud environments. It is essential to understand the behavior of an application in such an environment in a long term.
The promise of the cloud is, that an application has almost unlimited resources. If we are getting short on resources, we simply add additional ones. But how could be detect, that the application is behaving somehow suspicious? Every time we are adding additional resources these are eaten up by the workload. Does this correlate to the number of transaction, to the number of users or other metrics? Or is it a misbehaving application?
We need a correlation between different metrics. But are we able to oversee all possible dependencies? Are we aware of all these correlations?
IBM Operations Analytics Predictive Insights will help you in this area. Based on statistical models, it discovers mathematical relationships between metrics. A human intervention is not needed to achieve this result. The only thing to happen is, that the metrics are provided as streams in a frequent interval.
After the learning process is finished, the tool will send events on unexpected behavior, covering uni-variate and multivariate threshold violations.
For example, you have three metrics:
Number of request
Number of OS images handling the workload
Raising number of OS Images wouldn’t be detected by a simple threshold on a single resource, covered by the traditional monitoring solution.
Either the response time shows no anomaly nor the number of users does. Also the correlation between these to data streams remains inconspicuous. However, adding the number of OS images shows an anomaly in the relation to the other values. This could lead to a situation, where all available (even the cloud resources are limited, because we can’t afford it) resources are eaten up. In this situation our resource monitor would send out an alarm at a much later point of time.
For example, first, the OS agent would report a high CPU usage. Second, the response time delivered to the end users would reach a predefined limit. The time between the first resource event and the point in time where the user’s service level agreement metric (response time) is violated is too short to react.
With IBM Operations Analytics Predictive Insights we earn time to react.
So what is your impression? Did you also identify correlations to watch out for after analyzing the reason for a major outage and the way to avoid this outage?
Follow me on Twitter @DetlefWolf, or drop me a discussion point below to continue the conversation.
In my next blog I will start a discussion which values make sense to be fed into a prediction tool.