Besides our normal TechTalkThursday's in the evening, we tried new times during lunch and at 08:00 in the morning. Neither of them proved to be better than in the evening as we didn't have the same amount of participants.
We use this article to summarize the topics of Demian Thoma and Daniel Lorch.
Nine is hosting and managing thousands of servers for its customers. They recently moved to a new monitoring solution based on the open-source tools around Prometheus. Nine’s Demian Thoma talks about how nine implemented its new monitoring solution and how it gave them more insight into their infrastructure.
Nine was using Nagios before switching to Prometheus. By changing their monitoring stack, it allowed them to simplify the setup, get more insights into their services and to remove a separate analytics stack of infrastructure.
What does reliability mean to you? In his talk, Daniel Lorch reiterates the claim that reliability is the most important feature of any system. But services need to be just reliable enough to make its users happy - investing too much in reliability results in higher cost (engineering time and infrastructure) without added benefit. Investing too little on the other hand will result in unhappy users.
How do you determine and agree upon what “reliable enough” is to your services and your organization? Site Reliability Engineering provides tools and concepts to formalize this discussion, notably:
Watch the 30’ talk below to learn about these concepts and see how an example SLI/SLO is being defined for a fictitious game platform. Links to further information are provided at the end of the talk.
On this occasion, we would like to once again thank our speakers for presenting!
Find future TechTalks on our Meetup page:
Want to stay up to date?