A significant architectural shift toward containers is underway and, as with any architectural shift, this brings new operational challenges. It can be challenging for many of the legacy monitoring tools to monitor and have introspection on container platforms in fast moving, ephemeral environments. The good news is newer cloud-based offerings can ensure monitoring solutions are as scalable as the services being built and monitored. These new solutions have evolved to address the growing need to monitor your stack from the bottom to the top.
Two new ways to monitor your container environment come from the Dev side and Ops side.
- Application performance monitoring (APM) instruments your custom code to identify and pinpoint bottlenecks or errors.
Infrastructure monitoring collects metrics about the host or container, such as CPU load, available memory and network I/O.
In this post, I’ll describe three monitoring tools designated for the Ops side of the house but do have some components rooted in APM.
The first one is the 3-pronged open source approach of, Grafana-Alertmanager-Prometheus (GAP). Because it’s open source, it has great power and flexibility. This is a good starting point for monitoring your infrastructure and creating alerts on the items so designated or configured by you. Prometheus is a time series database and is also part of the Cloud Native Computing Foundation (CNCF). Prometheus does not have an agent, but instead scrapes data from the data points on hosts, storing it in its own time series database. It gives you finely grained metrics at a huge scale. Prometheus’s biggest strength is as a data source. The amount of metrics it gathers can quickly allow you to manage your data via labels, thus creating an aggregate of different components. With the right configuration, Prometheus can handle millions of time series. One of its shortcomings is difficulty in using it for capacity management or long term reporting due to the large amount of data (a month's worth) it collects. You can front the data being scraped by Prometheus with Grafana. Once up and running, add your Prometheus URL as a data source, then import one of the predefined Prometheus dashboards. Once those two components are linked, the only thing left to do is setup some alerts for Alertmanager to then forward onto a service such as Slack or Pager Duty. A complete integration of alerting services list can be found here. The GAP monitoring solution provides you with a highly configurable, open source option to your Ops monitoring challenge. You can follow this Git repository to get you started in installing the 3 pronged approach of GAP.
Our next Ops monitoring tool comes from Sysdig. Sysdig offers an open source component that allows you to monitor your systems similar to htop which can be found in these examples. I will highlight some of the features here:
|See the top processes in terms of network bandwidth usage||sysdig -c topprocs_net|
|View the CPU usage of the processes running inside the wordpress1 container||sudo sysdig -pc -c topprocs_cpu container.name=wordpress1|
|See all the GET HTTP requests made by the machine||sudo sysdig -s 2000 -A -c echo_fds fd.port=80 and evt.buffer contains GET|
|See the top processes in terms of disk bandwidth usage||sysdig -c topprocs_file|
|See the top processes in terms of CPU usage||sysdig -c topprocs_cpu|
|See the files where apache spent the most time||sysdig -c topfiles_time proc.name=httpd|
As you can see, the open source tooling at (sysdig.org) gives you some great troubleshooting activity, but does not feed into a full service ops monitoring solution or store your data for trending and playback. This open source tool is supported by the community only. But don’t worry; Sysdig also offers two options to enterprise customers, an on-premise or SaaS full service monitoring solution called Sysdig Monitor. Some of Sysdig’s strengths include a containerized implementation known as a daemonset, which makes installation very easy. They also offer an agent based solution to cover your non-container platforms. Another strength is the ability to implement RBAC into the tooling to just see what you need and nothing else. Whether that is project based access or service based access, you get what you want and only what you want. One of the detractors to this offering is the fact that Sysdig requires the installation of kernel headers, but they have tried to ease the pain of always being linked to a specific kernel version by automatically rebuilding their module when a new kernel is present. Sysdig has a service level agreement style implementation that allows you to view and alert on network response times as well as memory or CPU. It also has the ability to capture data during an alert event for a deeper dive into your issue. Sysdig has already created many "canned" dashboards to take advantage of and reduce your configuration. Some of the dashboards are: Apache, Cassandra, Go, Elasticsearch, etcd, and ZooKeeper. In summary, as stated by the company, “Sysdig Cloud is our commercial offering, designed to aggregate information from thousands of hosts, tens of thousands of containers, across multiple clouds. It offers a web-based interface for dashboarding, exploration, and alerting.”
Here is the official Sysdig link to the installation instructions for the SaaS install deamonset.
Our final Infrastructure monitoring tool is DataDog. DataDog is a full service Ops SaaS-only monitoring solution that incorporates APM into its product line. Some of its benefits include flexibility with its install process via Agent or within a Container. It also provides alert monitoring and dashboards via the UI. Another feature is the ability to provide APM for Ruby, Python, and Go (soon to have Java integration as well). DataDog has aligned its product with statsd, a standard in the industry, by integrating the once standalone daemon into the DataDog agent. According to DataDog, statsd is "where abstract performance or resource utilization metrics can be directly linked to application or product metrics that are directly relevant to the business.” They have enhanced the statsd implementation by supporting tagging. This allows you to add additional dimensions into your applications. DataDog can also consume other data from sources like Nagios, New Relic (an APM provider/competitor), or connect with cloud provider environments. Another feature that comes in handy is the ability to process other APIs that expose their data. Within its dashboards you can overlay graphs to further troubleshoot your environments. DataDog uses a push model for its metrics. Each host becomes a statsd aggregator which allows for a full stack monitor. DataDog also has anomaly detection which applies an algorithm to determine what is not normal for your environment. A similar feature is called outlier detection. Outlier detection is the ability to understand when hosts are misbehaving compared to others in the cluster and alert appropriately. DataDog also offers service discovery to continuously monitor your dockerized containers across hosts and environments. Here is the official DataDog link to the installation instructions for the SaaS install deamonset.
As the container world continues to grow, so will the tools required to monitor them. These three are great options to get you started monitoring your container infrastructure. There are some key differences that your company has to decide upon. Open source? Paid SaaS? On-premise solution? These are just some of the decisions you need to make to choose your tooling. I have provided a decision table below to help you clearly understand some of the options for these tools.
|LOE To Install Tool||Medium||Low/Medium (on-premise)||Low|
|Kernel Header Install Required||No||Yes||No|
|LOE Alert setup||Medium||Easy||Easy|
OpenShift Container Platform, OpenShift Dedicated