Buffer’s March Engineering Report: 5 Whys, Reliability, Open Source, and More

March was an incredible month for the engineering team at Buffer.  Looking back, it’s pretty exciting to see what all we’ve accomplished for moving the ball forward and how we’ve executed on some of our engineering goals.  Here are some stats and tl;dr for March:

  • 1 person converted to full-time (yay Dan Farrelly!)
  • 6 “5 whys” conducted
  • 1 new open source project started
  • 15 minutes of system wide downtime
  • Switched to New Relic for platform monitoring
  • Bufferbot (hipchat hubot) now deploys our app

5 Whys

One of the big and exciting changes we made in March is to be slightly more disciplined on reflecting back when something unexpected or unintended occurs. This is something every startup faces, and we’re no exception.  It seems like almost daily there would be something that occurs that’s not ideal and could have been prevented or could be prevented in the future. Some examples of these are unintended bugs, downtime, a mistake made by a developer, or a negative experience from a customer.  We’ve been heavily influenced by Eric Ries’ lean startup methodology, and one thing he suggests is conducting the 5 whys to learn from mistakes.

March was a great time to test this out as we had our fair share of issues. Conducting the 5 whys was slightly rough initially as we learned a bit more about what works, and we’ve now gotten into a great flow of doing this more often.

The idea behind the 5 whys is to dig at least 5 levels deep into a particular problem.  The idea originates from Taiichi Ohno while developing the Toyota production assembly line. The 5 whys method states that usually an unintended consequence could have been prevented on at least 5 different levels.  During the 5 whys we have a “5 whys master” who leads the discussion.  We try to include everyone who was involved in some way with the unintended consequence we’re discussing on the chat (we conduct these over Google Hangouts).  The “5 whys master” will ask the group 5 whys, and the group answers them.  At the end of the exercise, we go through each why question/answer pairing and come up with 5 correlated “corrective actions” that we agree on.  We assign one person the responsibility of owning that corrective action so that the issue is hopefully prevented in the future.

What I really like about this is that it lets us worry about issues when they happen, and it helps us work towards ensuring they won’t happen again. At the same time, it lets us not have to worry about issues that haven’t happened.  I now trust if something comes up that we didn’t foresee, we’ll conduct a 5 whys and learn from it.  We let the 5 whys dictate what documentation we need in place or adjustments to make in our on-boarding process.

Here are some examples of the 5 why’s we conducted this month.

System wide outage for 15 minutes due to weekly digest processing

Some paying business users were sent an email regarding why they had chosen to not continue even though they were still on the trial

Blogging

One of the big things we’re trying to do more of is engineering blogging.  Transparency has been one of the key values at Buffer, and with that I’m hoping to make the engineering team have a good habit of describing lessons and experiences to the rest of the world.

We had a great first month with this new focus.  I wrote a more in-depth post about the evolution of how Buffer’s scheduling core works on Medium that was well received.  Andy wrote a post showing a sneak peak at the new Buffer iOS7 app (that’s now available!).  And here’s Colin’s blog post in March: Yaks, Alligators and Bikes.

Open Source

Continuing with our value of transparency and how we’re hoping to really contribute back to the community, we finally got some time to open source our date time picker.  Our date time picker is unique in that we allow you to first pick the date and then a time in a distinct flow.  You’ll notice this is used whenever you set a custom scheduled post. There have been some calls in the bootstrap-datetimepicker community for us to open source it, and

Niel

was able to find time in his busy schedule to set this up.  We’ve already had some great pull requests to make this much better!

Reliability

Coming off a great February, we unfortunately had a bit more downtime in March.  There was about 15 minutes of downtime as we were sending out weekly digest emails on Monday, March 10, at around 4:40am PDT.  We learned a great deal from this through our new 5 whys post-mortem habit.  So far we haven’t had any other trouble in the following weeks of sending weekly digests after this issue.

One thing I was also very excited about was setting up status.bufferapp.com.  We plan to get into a good pattern to update status.bufferapp.com whenever we’re investigating, working through, or fixed any sort of issue that comes up.  We know that when our service isn’t working as it should, the only thing that would make that experience better is if we’re honest and keep everyone in the loop as early as possible. I’m very excited to have status.bufferapp.com play a key role in our aspiration to do that.

We also switched our monitoring tool to New Relic. After a couple weeks of trialling New Relic, we found some amazing new insight and detail into how our platform was behaving.  We especially enjoyed digging into profiling traces of various important transactions.  Integrating New Relic into our architecture was  a breeze, and we hope to really use it a lot to ensure Buffer is as reliable and performant as it can be.

Security

As in the past couple months, we’ve worked through a few different reports that were submitted through our security page.  We’ve closed up a few more security holes, added more csrf checks, and added protection against brute force login attacks.  We’ve been seeing an uptick in brute force logins, so this was key for us.

Bufferbot Changes

One of the challenges for a few of the engineers on the team is slower internet.  Niel especially has usually had some trouble in deploying our main application from his local repository.  This is most likely because of Niel’s local internet speeds in South Africa.  Motivated in part because our third Buffer retreat was in South Africa, we decided to adjust our deployment flow away from a local script and push.  Now instead what we do is whenever a developer pushes to a branch, the unit tests will run on our Jenkins server.  We’ve had our own hubot we call Bufferbot that’s set up to always be on our hipchat.  We extended Bufferbot so that it would help us deploy from Jenkins to our Elastic Beanstalk environments super easily!  It also has a nice side effect of coordinating deploys in hipchat.  We now simply do @Bufferbot deploy Production-Web, and Bufferbot will take care of the rest!

Looking forward to April

Seeing as I’m writing this post on April 14, I’m quite excited to write all that’s happened in April so far.  We got quite a lot done on our third Buffer retreat, and I’m excited to share that and more with you in a couple weeks. If you have any questions at all about what we did in March I’d love to answer them!  Just comment below or tweet me!