加入 Prometheus 的维护团队,认识其构思、目标和历史。掌握 Prometheus 的基本概念,了解其在业内流行的缘由。首先,认识时序的概念和特征;其次,了解非层次型数据结构和表现形式;最后,用查询语言 PromQL 将它们关联起来。至此已完全理解 Prometheus,可以着手开发了

1.Introduction Background Operations & observability Outro Intro to Prometheus With a dash of operations & observability Richard Hartmann & Ben Kochie, @TwitchiH 2018-11-14 Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

2.Introduction Background Operations & observability Outro Who are we? Richard ”RichiH” Hartmann Swiss army chainsaw at SpaceNet Project lead for building one of the most modern datacenters in Europe Debian Developer FOSDEM, DebConf, DENOGx, PromCon staff Ben ”SuperQ” Kochie Staff engineer at GitLab Bit plumber Retired SRE FOSDEM infrastructure, PromCon staff Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

3.Introduction Background Operations & observability Outro Show of hands Who has heard of Prometheus? Who is considering to use Prometheus? Who is POCing Prometheus? Who uses Prometheus in production? Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

4.Introduction Background Operations & observability Outro Prometheus 101 Inspired by Google’s Borgmon Time series database unit64 millisecond timestamp, float64 value Instrumentation & exporters Not for event logging Dashboarding via Grafana Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

5.Introduction Background Operations & observability Outro Main selling points Highly dynamic, built-in service discovery No hierarchical model, n-dimensional label set PromQL: for processing, graphing, alerting, and export Simple operation Highly efficient Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

6.Introduction Background Operations & observability Outro Working assumptions & concepts Prometheus is a pull-based system Black-box monitoring: Looking at a service from the outside (Does the server answer to HTTP requests?) White-box monitoring: Instrumention code from the inside (How much time does this subroutine take?) Every service should have its own metrics endpoint Hard API commitments within major versions No built-in TLS yet, use reverse proxies for now Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

7.Introduction Background Operations & observability Outro Time series Time series are recorded values which change over time Individual events are usually merged into counters and/or histograms Changing values are recorded as gauges Typical examples Access rates to a webserver (counter) Temperatures in a datacenter (gauge) Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

8.Introduction Background Operations & observability Outro Efficiency 1,000,000+ samples/second no problem on currect hardware 200,000 samples/second/core 16 bytes/sample compressed to 1.36 bit/sample Cheap ingestion & storage means more data for you Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

9.Introduction Background Operations & observability Outro Exposition format http_requests_total{env="prod",method="post",code="200"} 1027 http_requests_total{env="prod",method="post",code="400"} 3 http_requests_total{env="prod",method="post",code="500"} 12 http_requests_total{env="prod",method="get",code="200"} 20 http_requests_total{env="test",method="post",code="200"} 372 http_requests_total{env="test",method="post",code="400"} 75 Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

10.Introduction Background Operations & observability Outro PromQL vs SQL avg by(city) (temperature_celsius{country="germany"}) SELECT city, AVG(value) FROM temperature_celsius WHERE \ country="germany" GROUP BY city rate(errors{job="foo"}[5m]) / rate(total{job="foo"}[5m]) SELECT errors.job, errors.instance, [...more labels...], \ rate(errors.value, 5m) / rate(total.value, 5m) \ FROM errors JOIN total ON [...all label equalities...] \ WHERE errors.job="foo" AND total.job="foo" Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

11.Introduction Background Operations & observability Outro Grafana Supports dozens of data sources Modern UI Allows for complex data manipulation and visualization Native Prometheus support New feature: Interactive exploration of Prometheus data Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

12.Introduction Background Operations & observability Outro Toil ”Toil is manual, repeated work with no lasting benefit which scales linearly with your service” If teams are busy firefighting, they don’t have time to engineer Keep legacy systems working, but have clear path forward Keep extra effort on the team low, if possible Strive for immediate benefits Focus on removing repeated, manual tasks of no lasting benefit Show that you free up time and reduce toil Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

13.Introduction Background Operations & observability Outro Sanity & sleep If it’s not actionable, it’s not an alert If it’s not urgent, it’s not an alert Important but non-urgent incidents are handled during business hours Predict your usage so you add capacity during business hours If there’s no playbook, it does not go into production If a service does not have proper SLOs and alerts, it does not go into production Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

14.Introduction Background Operations & observability Outro Perspective & Incentives ”An engineer can talk for hours about source code; try that with the CEO” Managers: revenue, process execution Architects: clean design, process definition Product/Service owners: Powerful dashboards Team leads: morale, quick execution Operators: reduce toil, increase sleep Tell everyone what they need to hear (but never lie) Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

15.Introduction Background Operations & observability Outro Post-Mortems Mistakes happen It is important to learn from mistakes so not to repeat them To write a good incident report, there must be no fear of retribution Blame-free post-mortems allow everyone to document exactly what went wrong and in what order It is important to build trust among the teams and management Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

16.Introduction Background Operations & observability Outro Leverage One combined system allows for correlation and combination Power usage against service load Optical networks against outside temperature Datacenter power feed load against new deployments ...and lots more Metrics are the starting point of most observability stories Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

17.Introduction Background Operations & observability Outro Oracle One source of truth for Tactical overview for current state Dashboards for drill-down Auto-generated PDFs for customers Global SLO statements for sales Usage exports for accounting If all you have is a hammer... choose your hammer well Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus

18.Introduction Background Operations & observability Outro Thanks! Thanks for listening! Questions? Richard Hartmann & Ben Kochie, @TwitchiH Intro to Prometheus