Building a Data-Driven Engineering culture

Setting up Observability the right way

Featured on Hashnode
Building a Data-Driven Engineering culture

Software is eating the world. This is a cliché that most of you know!

But when you build software and something breaks, what’s the first reaction of your team?

A data-driven engineering culture enables teams to mitigate and resolve conflicts. But where does one start? Setting up observability and setting it the right way can accelerate your team’s journey to becoming data-first.

Observability practices to drive data-first culture:

From our personal experience and interactions with senior engineering leaders, here are 5 ways to set your team up for success.

1. Enable democratic access to observability data and monitoring dashboards:

While an engineer might not typically be bothered to check the health of a service that’s not related directly to his/her domain, it’s common to have indirect dependencies that need to be checked. Avoiding data silos ensure that teams can get to the root cause without making your DevOps / admins the bottleneck.

2. Creating a team accountability culture:

Empower and hold your engineering team accountable for follow-up communication (post-outage) to all stakeholders (both EXTERNAL and internal):

  • What was the root cause of the issue?

  • Why it wasn’t mitigated previously?

  • What remediation actions have been taken to avoid it in the future?

The same document should be circulated among your engineering team as well as to the relevant/impacted stakeholders in the company.

For some inspiration, here’s how Heroku’s engineering team publishes follow-up reports.

3. Make code performance and monitoring a part of Developers’ KPI:

Modern engineering teams deploy code in tandem - expecting every DevOps/SRE team to monitor every alert is inefficient. Instead, hold developers accountable for the performance & monitoring of their work pre-deployment. Some top-performing teams have the following within the charter of the developer’s responsibility:

  • Check for instrumentation of observability within the CI/CD pipeline

  • Measure the performance of the code after deployment/integration

  • If it’s a new feature/service, create a monitoring dashboard that can track the health of the service appropriately. If it’s a code change on the existing one, the existing dashboard should be re-jigged if needed.

4. Avoid data fatigue:

Have you noticed that your team members frequently leave alerting slack channels due to irrelevant/too frequent notifications? This creates an approach of ignoring data while investigating. Here are some strategies that you can consider:

  • Only have “Actionable” alerts - pair an alert with the “impact” of the alert to make it actionable.

  • Contextual alerting - Mapping notifications to relevant stakeholders (both horizontally across teams & vertically within the team)

  • Continuous improvement - “On-call engineer” to create a report at end of their rotation about what % of alerts were relevant.

Here’s a good article by Atlassian team on some best-practices to reduce alert fatigue.

5. Avoiding data scattered across multiple tools:

There are too many developer tools in the market. Period.

Adopting a new tool requires your engineer is like building a muscle - it needs conscious effort over a prolonged period of time. In this case, it could be getting used to the user interface or the querying language. Create guidelines for setting up the monitoring dashboard to enable ease of accessibility in times of crisis or urgency. As teams build habits of their respective tools, it only gets harder to migrate. (Sooner the better)

6. Look at the Total Cost of Ownership (TCO) and not just the tool pricing while evaluating options:

Just because a tool is open-source or free, doesn’t make it the go-to option. Sometimes, orchestrating and managing open source tools can be demanding - if your team is very lean or worked up, avoid tooling that will require constant maintenance and development.

Chose tools that save your team’s time. The quick time-to-value also improves adoption. Once you’re closer to a scale where the tool cost pinches you too much, the TCO will automatically start weighing toward the open-source option.

We're Just Getting Warmed Up

At Dr Droid, our team is building tools to simplify the lives of engineering teams. And we are listening to what you have to say!

If you have any engineering practices to share that help drive data-first culture, tell us in the comments below.