LISA 2019 Conference Report

portland_waterfront.jpg

LISA is one of my favorite conferences for infrastructure, due in large part to the fact that they have made a concerted effort to make the conference more welcoming and open in recent years. They’ve made a strong push towards DevOps and SRE work and away from the traditional system administration it was known for six or seven years ago. I also appreciate that, unlike SRECon, it is also not solely focused on building things out for a massive, global deployment. That, and the fact that it has such a broad cross section of folks as attendees and speakers, from Google SREs to people working at universities to folks at SMBs, is why think it’s also a great conference for people to give their first talk at (and why I like speaking there).

The big theme I saw this year in Portland was a heavy emphasis on containers, especially on using them with Kubernetes, This has been a growing trend the last few years , but this year, probably half the workshops I saw were about various aspects of working with them, from CI/CD pipelines to security to monitoring.

This year, my talk was What Connections Can Teach Us About Postmortems, which tied together two interests of mine -- history and incident analysis. However, I wanted to talk about some of the other talks I saw at LISA this year, to give folks a taste of what the conference has to offer and what I think is worth folks taking a look at if they didn’t have a chance to go this year. If you’re thinking about going to a conference in the infrastructure space, or giving a talk that is relevant to it, I highly recommend thinking about going next year in Boston (where it will be colocated with SRECon Americas East).

Keynotes

The opening keynotes this year were both really good; the first was Alice Goldfuss’s Container Operators Manual, a talk I’d seen a version of from her appearance at Lead Dev London. The general thrust of her talk was that the hype around containers hides the cost of running them in production, something she has been doing for several years now. Her talk was really good as an introduction to folks around the complexity of working with containers from her particular point of view (which is a deployment with many microservices on bare metal), but it was also a good demystification of containers as a technology. She led off the talk by making it clear that containers aren’t really anything new, that they are just processes run from tarballs inside namespaces which are inside cgroups (and broke down what each of those was). She also provided a good rundown of what works well in containers vs. what does not, and countered the myth that using containers is magic that will reduce headcount by explaining just what else is required in order to make container-based environments work well. Even if you already think you know everything about containers, this talk is a joy to watch; if you’re just getting into them, it’s a great look at what to look ahead for.

The other opening keynote this year was Rich Smith’s In Search of Security Shangri-La, which was a great talk about the breakdown of the relationship between ops and security, with some amazing pull quotes (“the security industry generates FUD to sell hope” and “hope is not a strategy but it makes a lot of money”). His talk was very critical of the security industry for failing to serve actual end users (often citing “human error” as the cause of incidents and producing “solutions” that are unusable by most people) and for failing to be a good partner (security folks often exalt obscurity in order to cultivate an air of mystery around their profession) — but he turned that around and also blamed ops for allowing them to get away with it. He outlined a number of antipatterns in security (most of which we’ve avoided here at Truss as we build out our security practice) that he has seen drive poor behavior — his big example was internal phishing tests, which are often used the wrong way (how many people click through them is a bad metric, how fast it is reported is a much better one). I thought this talk was really good as someone who is *not* coming from the security end of things.

The closing plenaries this year started off with J. Paul Reed with When /bin/sh Attacks: Revisiting "Automate All the Things", which was a look at the tradeoffs in automation but also a larger look at the human factors space. Specifically, he talked about how automation and the drive to scale can increase the dangers in the event of an incident and how we can be a bit myopic with how we look at incidents (and not-incidents). His talk began with some ATC traffic from JFK that was eye-opening to say the least, and had a lot of other great bits that you need to see (or hear) to get the full effect. The big takeaways from this talk were that improving incident response takes actual effort and that automating too many things can make it difficult to understand what is going on. We end up eliminating many of the not so bad failure cases so that only the truly devastating ones remain, which both have the potential to be a bigger danger and are harder to diagnose and fix.

The second closing plenary, Denise Yu’s Why Are Distributed Systems So Hard?, which had probably the cutest slides of the entire conference, was a great discussion of where distributed systems grew from, and an overview of the problem space around them. While I was familiar with most of the topics she covered in this talk, she did a great job citing the origins of many of our current assumptions (shoutouts to Sun Microsystems in the 1990s, and who knew the CAP theorem is only from 2000?), all with a ton of very cute, hand-drawn slides. I thought it would be a great presentation for folks who have not worked in the infrastructure space (or even in a technical position) to be introduced to the issues that we have to think about when designing distributed systems.

Talks/Workshops

  • Fuzzy Lines: Aligning Teams to Monitor Your Application Ecosystem from Kim Schlesinger and Sarah Zelechoski was about the issue of aligning dev and ops teams around monitoring, focusing on three specific arenas: people, process, and tools. For each of these, they had some recommended practices for aligning both sides on shared goals.
    For people, they talked about creating a group narrative (where did we come from and what problems are we solving) that anyone on the group can tell — it requires a shared culture and metaphorical language. The groups also need to commit to a set of shared values that everyone at all levels of the organization is held to, and regulate each other (they gave the example of people telling someone “we don’t do that here” when they violated those values). This sounded very familiar to me based on Truss’s method of promoting its values.
    For process, they talked about many of the same things we do at Truss — open communication and talking in shared Slack channels instead of DMs, for instance. They also talked about setting clear expectations around responsibilities, but not tasks — for instance, monitoring isn’t anyone’s job, but app developers are responsible for the health of the application and ops folks for the underlying infrastructure, so both will be doing monitoring. And in their weekly syncs, they try to build transparency and share the impact and value of the work they’re doing.
    For tools, they emphasized how important it was that everyone has confidence in their monitoring tools, and in order to have this, they have a shared monitoring platform that everyone uses, and all these monitors are written in code so that they can be templatized and the whole thing is less ambiguous (and, as they noted, it works better for screenreaders).

  • How Math, Science, and Star Trek Help Us Understand the Value of Team Diversity from Fredric Mitchell was a great talk about a subject we feel very strongly about here at Truss. He started off with a number of examples from science where discoveries didn’t really reach their full potential or real understanding until people with different perspectives were brought into the mix — the two big examples being the development of Warfarin and the way sperm fertilize ova. This reminded me a lot of my talk on Connections, where one of Burke’s points is that you can’t know where the next leap in one field will come from because it could easily end up being from some other completely different field.
    He then went on to illustrate the value of diversity by highlighting the different roles on a team using the crew of Star Trek Voyager, which was great (despite that show not being great, at least in my opinion). There was a lot of good stuff in here, but it would be hard to boil it all down. Some of the stuff I thought was really good was him pointing out that a value of diversity is that we’re more likely to question ideas coming from a “different” teammate, which leads to more questioning of assumptions. We should reframe the “prove me wrong” mentality as “show me what I’m missing” because it becomes less confrontational and more of a challenge to you to see someone else’s perspective.
    His advice on how to grow the diversity at your organization came down to having SMART goals for recruiting, sending challenges and your techstack to underprivileged organizations, following more people not like you, doing that reframing of “prove me wrong,” and making your goal being successful over being right when it comes to conflict.

  • Alex Hidalgo brought us Earthquakes, Forest FIres, and Your Next Production Incident, which was a retrospective on the Incident Command System and where it came from. Alex started with the Mount Laguna fire in 1970 and went through the 2002 Department of Homeland Security dictate that mandated its use by emergency services, and then went on to describe how this can be applied to our environments. Most of this was pretty well-trod ground for me — it was mostly a good reinforcement that this was a good model to follow, and emphasizing the importance of anticipating problems, training to deal with them, and testing your training with chaos engineering and drills.

  • Brad Shively’s Storytelling for Engineers was a topic near and dear to me (and very similar to what I would go over in my talk), so I was looking forward to it. Most of his talk was focused on email, but I think it applies equally to most forms of asynchronous written communication. In addition to highlighting the importance of storytelling (Jessica Hilt’s Strategic Storytelling talk from LISA 2016 goes into this a lot more), he also did a good job giving some tangible, engineer-targeted advice. I liked his framing of the “character” and character arc in your email being the subject (what is changing and how is it changing), couching the required knowledge to understand the email in terms of global and local variables, and how bad emails were like bad unit tests (they incur tech debt).

  • Pulling the Puppet Strings with Ansible from Brian Atkisson was not something that was immediately relevant to most of the work we do here at Truss, where we try to work with immutable deployments as much as possible. However, it was an interesting narrative on RedHat IT’s transition from deploying desktops using Puppet to using Ansible, going through a transition state where they were using both at the same time. If you’re working in an environment where you need to make this kind of transition, this is probably a great talk to take a look at for some ideas on how to deal with the problems of trying to switch your tools midstream.

  • Ops on the Edge of Democracy, by Chris Alfano and Julia Schaumburg of Code of Philly, was a good talk on civic tech on a smaller scale. The running theme of this talk was that civic tech at this scale has to optimize for something less than global scale, as well as not being optimized for profit, which is the big problem with a lot of top-down Silicon Valley “solutions.” Tech in this space has to be built at a smaller scale with longer-term visions; these solutions may be around for decades, and they need to cater to the local communities. Some of the examples given for Code for Philly were creating a transit map specifically for disabled commuters and improving school budget transparency, which were done in close cooperation with members of the concerned communities. I think my only concern with this talk was the citing of RMS’ Four Freedoms, which feels a little problematic at this point. That being said, I think this is a great talk for anyone who has an interest in working with local civic tech nonprofits.

  • Dan O’Boyle and Brian Artschwager presented Expect the Unexpected! A Method for Handling Unplanned Work, which talked about how they had dealt with planning for unplanned work as an ops team at StackOverflow. This was definitely more oriented towards the kind of work I’ve done at previous jobs, where the majority of the work would dribble in over the course of the work week, and was very difficult to plan for. To a large extent, this was just a more detailed description of how they are doing kanban there, but they had a few specific tips for what they learned bootstrapping the process: 1. Planning meetings are not the time for disagreements, 2. Have meetings to plan tasks (with an agenda) before planning, and 3. Let the process define the tools, not vice versa. They also had some specific Trello tweaks and other mechanisms they added. You can see a summarized breakdown of their entire process at https://www.evil.cards.

  • Friday morning, I went to Madhu Akula’s Defenders’ Guide to Container Infrastructure Security; this was good for me to see as someone who hasn’t done a ton of work in this area. The talk was focused on straight-up Linux containers and Kubernetes run on your own hosts. Unfortunately, this was a little less useful for me because here at Truss we try to use bare containers and Kubernetes is overkill for most of our projects. There were a few good tools he pointed out though: trivy, amicontained, dockle, dive, and truffleHog all looked interesting.

  • Right before my talk, Frances Hocutt presented Testing For the Terrified, which was an introductory talk on how and why you should get started writing tests, even for “simple” glue code that ops folks have traditionally had as the bulk of their programming tasks. Most of this wasn’t new to me at this point, but for infra folks who haven’t had as much exposure to writing tests, it was a good introduction.

  • Qui Nguyen presented Fast, Safe, and Reliable: The Future of Configuration, which was a semi-deep dive into Yelp’s system for doing service configuration. They wanted to make it easy for developers to change configurations but also make it safe to do so, so they looked at a number of options and decided to put configurations into files. Unlike putting them in a separate module, it made it easy to have environment-specific options, and unlike with environment variables, it would not require a restart in order to be put in place (the application can watch the file instead and know when to update). Qui pointed out that one of the driving reasons for the development of this system was that they realized that configuration changes contributed to as many incidents as code changes, and needed to be handled similarly. They’ve also thought about moving to a datastore with an API, but a files have a lot of great features: people are already used to working with them, git works great for version control, and rsync is easy to scale for distribution, so for now, they have decided to stick with what works.

  • Corey Quinn gave us Post No AWS Bills: Cloud Cost Optimization Without APIs, which was basically his rant about why most technical solutions for cloud cost optimization suck (though he might just be a tiny bit biased on this subject). However, his argument had some interesting parallels with a few of the other talks which pointed out the flaws of optimizing for infinite scale for everything. He places most of the blame on the fact that optimizing cost is actually an extremely difficult task if you want better insights than “turn off all those idle instances in your DR deployment.” He also highlighted the problem of differing expectations between what the finance folks are concerned about and what the engineering folks are concerned about. There were a lot of really interesting insights here and Corey is a great speaker, so this is worthwhile to watch, especially for folks building large cloud infrastructures. At the end of the talk he did give five quick tips for where to look for cost savings: start with the biggest numbers (don’t worry about trying to save 5 cents off a $2 charge when you have another $10,000 charge elsewhere), look at data transfer (this is not easy to figure out from AWS — it’s cheapest to transfer between us-east-1 and us-east-2, for instance, but that isn’t necessarily obvious), look at ALBs (the formula for computing LCUs is pretty wacky), managed NAT gateways (often a source of most of your traffic), and your data lifecycle policy (do you have a bunch of unnecessary data being kept around).

In addition to the talks I saw, there were a few others I heard good things about, but I didn’t have the opportunity to see because of conflicts. These are all up on Youtube now so you can watch them.

And that’s my view of LISA this year! I hope to see at least some of you at LISA next December in Boston!