Breaking News: Grepper is joining You.com. Read the official announcement!

Monitoring and Alerting

Vinay Rawal answered on March 8, 2023 Popularity 5/10 Helpfulness 1/10

answer Monitoring and Alerting

Monitoring and Alerting

Comment

To collect metrics, we use Prometheus, an open-source tool originally developed at SoundCloud. It uses a pull model to scrape metrics from many endpoints (normally specific pods in the cluster), allowing you to store and query this time series data (generally using a frontend like Grafana). Our cluster-level monitoring system provides basic metrics for all pods, (though developers are also encouraged to expose custom business-logic metrics which will also be scraped automatically). To do so, we deploy the node-exporter DaemonSet, which exposes metrics for each pod on that node: CPU and memory usage, disk and network I/O, etc. Alerts (using the Prometheus Alert Manager) are triggered once specific conditions have been met (e.g. if the number of replicas backing a given service goes below certain threshold or if the latency for a given SQL query is too high).

We quickly reached the limits of a single-instance Prometheus setup and have since federated metrics across multiple instances. This has allowed us to scale to many thousands of metrics, capturing everything from CPU usage to individual SQL query latencies or inter-service RPCs.

Popularity 5/10 Helpfulness 1/10 Language whatever

Source: Grepper

Tags: monitoring whatever

Link to this answer
Share Copy Link

Contributed on Mar 08 2023

Vinay Rawal

0 Answers Avg Quality 2/10

Monitoring and Alerting

Contents

More Related Answers

Monitoring and Alerting

Grepper

Documentation

Social

Legal

Contact

Oops, You will need to install Grepper and log-in to perform this action.