Sign In Start Free Trial

Book Overview & Buying
Table Of Contents
Feedback & Rating

Hands-On Infrastructure Monitoring with Prometheus

By : Joel Bastos, Pedro Araújo

3.1 (7)

Hands-On Infrastructure Monitoring with Prometheus

3.1 (7)

By: Joel Bastos, Pedro Araújo

Overview of this book

Prometheus is an open source monitoring system. It provides a modern time series database, a robust query language, several metric visualization possibilities, and a reliable alerting solution for traditional and cloud-native infrastructure. This book covers the fundamental concepts of monitoring and explores Prometheus architecture, its data model, and how metric aggregation works. Multiple test environments are included to help explore different configuration scenarios, such as the use of various exporters and integrations. You’ll delve into PromQL, supported by several examples, and then apply that knowledge to alerting and recording rules, as well as how to test them. After that, alert routing with Alertmanager and creating visualizations with Grafana is thoroughly covered. In addition, this book covers several service discovery mechanisms and even provides an example of how to create your own. Finally, you’ll learn about Prometheus federation, cross-sharding aggregation, and also long-term storage with the help of Thanos. By the end of this book, you’ll be able to implement and scale Prometheus as a full monitoring system on-premises, in cloud environments, in standalone instances, or using container orchestration with Kubernetes.

Preface

Preface

Introduction to the book and the technology

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Section 1: Introduction

Section 1: Introduction

Monitoring Fundamentals

Monitoring Fundamentals

Definition of monitoring

Whitebox versus blackbox monitoring

Understanding metrics collection

Summary

Questions

Further reading

An Overview of the Prometheus Ecosystem

An Overview of the Prometheus Ecosystem

Metrics collection with Prometheus

Exposing internal state with exporters

Alert routing and management with Alertmanager

Visualizing your data

Summary

Questions

Further reading

Setting Up a Test Environment

Setting Up a Test Environment

Code organization

Machine requirements

Spinning up a new environment

Summary

Questions

Further reading

Section 2: Getting Started with Prometheus

Section 2: Getting Started with Prometheus

Prometheus Metrics Fundamentals

Prometheus Metrics Fundamentals

Understanding the Prometheus data model

A tour of the four core metric types

Longitudinal and cross-sectional aggregations

Summary

Questions

Further reading

Running a Prometheus Server

Running a Prometheus Server

Deep dive into the Prometheus configuration

Managing Prometheus in a standalone server

Managing Prometheus in Kubernetes

Summary

Questions

Further reading

Exporters and Integrations

Exporters and Integrations

Test environments for this chapter

Operating system exporter

Container exporter

From logs to metrics

Blackbox monitoring

Pushing metrics

More exporters

Summary

Questions

Further reading

Prometheus Query Language - PromQL

Prometheus Query Language - PromQL

The test environment for this chapter

Getting to know the basics of PromQL

Common patterns and pitfalls

Moving on to more complex queries

Summary

Questions

Further reading

Troubleshooting and Validation

Troubleshooting and Validation

The test environment for this chapter

Exploring promtool

Logs and endpoint validation

Analyzing the time series database

Summary

Questions

Further reading

Section 3: Dashboards and Alerts

Section 3: Dashboards and Alerts

Defining Alerting and Recording Rules

Defining Alerting and Recording Rules

Creating the test environment

Understanding how rule evaluation works

Setting up alerting in Prometheus

Testing your rules

Summary

Questions

Further reading

Discovering and Creating Grafana Dashboards

Discovering and Creating Grafana Dashboards

Test environment for this chapter

How to use Grafana with Prometheus

Building your own dashboards

Discovering ready-made dashboards

Default Prometheus visualizations

Summary

Questions

Further reading

Understanding and Extending Alertmanager

Understanding and Extending Alertmanager

Setting up the test environment

Alertmanager fundamentals

Alertmanager configuration

Common Alertmanager notification integrations

Customizing your alert notifications

Who watches the Watchmen?

Summary

Questions

Further reading

Section 4: Scalability, Resilience, and Maintainability

Section 4: Scalability, Resilience, and Maintainability

Choosing the Right Service Discovery

Choosing the Right Service Discovery

Test environment for this chapter

Running through the service discovery options

Using a built-in service discovery

Building a custom service discovery

Summary

Questions

Further reading

Scaling and Federating Prometheus

Scaling and Federating Prometheus

Test environment for this chapter

Scaling with the help of sharding

Having a global view using federation

Using Thanos to mitigate Prometheus shortcomings at scale

Summary

Questions

Further reading

Integrating Long-Term Storage with Prometheus

Integrating Long-Term Storage with Prometheus

Test environment for this chapter

Remote write and remote read

Options for metrics storage

Thanos remote storage and ecosystem

Summary

Questions

Further reading

Assessments

Assessments

Chapter 1, Monitoring Fundamentals

Chapter 2, An Overview of the Prometheus Ecosystem

Chapter 3, Setting Up a Test Environment

Chapter 4, Prometheus Metrics Fundamentals

Chapter 5, Running a Prometheus Server

Chapter 6, Exporters and Integrations

Chapter 7, Prometheus Query Language - PromQL

Chapter 8, Troubleshooting and Validation

Chapter 9, Defining Alerting and Recording Rules

Chapter 10, Discovering and Creating Grafana Dashboards

Chapter 11, Understanding and Extending Alertmanager

Chapter 12, Choosing the Right Service Discovery

Chapter 13, Scaling and Federating Prometheus

Chapter 14, Integrating Long-Term Storage with Prometheus

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

3.1 (7)

5 star

42.9%

4 star

0

3 star

0

2 star

42.9%

1 star

14.3%

Chapter 9, Defining Alerting and Recording Rules

This type of rules can help take the load off heavy dashboards by pre-computing expensive queries, aggregate raw data into time series that can then be exported to external systems, and assist the creation of compound range vector queries.
For the same reasons as in scrape jobs, queries might produce erroneous results when using series with different sampling rates, and having to keep track of what series have what periodicity becomes unmanageable.
instance_job:latency_seconds_bucket:rate30s needs to have at least the instance and job labels. It was calculated by applying the rate to the latency_seconds_bucket_total metric, using a 30-second range vector. Thus, the originating expression could probably be as follows:

rate(latency_seconds_bucket_total[30s])

As that label changes its value, so will the identity of the alert.
An...

Search

Your notes and bookmarks