30 Mar

5 “Must Haves” of Monitoring Containerized Apps

We recently spoke on a panel called “Artificial Intelligence Powered Analytics in Production Operations” (you can watch the video replay here). Perspica’s CTO, JF Huard, was joined by Justin Fitzhugh, VP of Technical Operations for Instart Logic, and Manoj Choudhary, CTO of Loggly. One of the recurring themes of the panel was that manual thresholds don’t work in today’s big data world. Below is a summary of that thread of the conversation.

I was talking with a customer who runs technical operations in what he refers to as a “hyper-scale” data environment. When the conversation got to managing containers, he made a really interesting comment:

“Containers aren’t the problem – it’s that the infrastructure around them and the tools to support them don’t exist. Containers don’t fit into any of the molds or paradigms we have now.”

He hit the nail on the head. And it got me thinking, what needs to happen to the infrastructure and app monitoring tools so that guys like him can monitor and manage containerized applications?

Must Have #1:

New Tools Need To Keep the Short Life Span and High Density of Containers in Mind

The average Docker container has a life span of 3 days – compared to VMs which have a life span of 12 days – and typically run 5 containers simultaneously on each host. Add to that the fact that containers live in a bit of a monitoring “no man’s land” between the application and the hardware layers where neither application performance monitoring nor traditional infrastructure monitoring are effective, and a new approach to monitoring is needed.

Must Have #2:

Monitor the Applications, Not Just the Infrastructure

This is a key mind-shift for modern applications: look at what customers are looking at, not at the infrastructure. Meaningful KPIs will focus around customer-facing performance like transactions per second, response time and the like. Containers, by definition, are plentiful, portable and disposable. They are only in service of the application.

Must Have #3:

Get Visibility Into Your Entire Application Infrastructure Stack

Applications are running on complex and dynamic infrastructures with underlying resources that are constantly changing to meet these applications’ performance requirements. By having context of what objects, logs, events, and metrics are associated with your application, you can understand what needs your attention, and how to narrow a problem to the relevant components.

Must Have #4:

Use Analytics (Not Just Monitoring) To Understand What’s Happening In Your Infrastructure

Current monitoring tools are not enough for the complexities of containerized applications. These tools rely too deeply on the tribal knowledge of your staff to “connect the dots” between various silos of your infrastructure, and require trial and error to fix problems when they do occur.

Today’s analytics tools add a layer of intelligence on top of monitoring that brings all of your silo’d data together. By continuously ingesting data and logs from across all layers of the infrastructure, the system understands the relationships between all of the objects across a virtual environment’s topology, recognizes the difference between inconsequential anomalies and actual performance problems, and instantly shows you the root cause and recommends actions for remediation.

Must Have #5:

Use Machine Learning to Automate Your IT Operations

Containers, clouds, and micro-services have enabled us to create applications of unprecedented complexity. Throwing more people at the problem doesn’t work. You need machine learning analytics to ingest and analyze streaming data in real time and point your people to the right place to fix the problem.

Today’s IT operations tools trigger numerous alarms but they have no inherent capability to distinguish the critical, service impacting events from false positives that do not require the immediate attention of an operator.

Machine learning changes all that.

By using AI solutions to correlate application performance and availability metrics with events from the application infrastructure, the system can automatically baseline what is normal behavior. As these patterns change, the system adjusts the thresholds automatically. Continuous analysis of this machine-generated data allows the system to identify behavior that it outside the norm, and alerts your team accordingly.