The Leanpub Podcast: Roberto Vitillo, Author of The System Design Manual: Learn how to design, build, and operate large scale distributed systems

Len: Hi I'm Len Epp from Leanpub, and on this episode of the Frontmatter podcast I'll be interviewing Roberto Vitillo.

Based in London, Roberto has over 15 years of experience in the tech sector in a variety of roles, including software engineer, tech lead, and manager. He currently works for a company you may have heard of called Microsoft, where he has worked on a number of projects, including the launch of SaaS products, and he is responsible for one of the largest data pipelines in the world, which processes second-by-second events from billions of devices all around the world.

You can follow him on Twitter @ravitillo and check out his website at robertovitillo.com.

Roberto is the author of the Leanpub book The System Design Manual: Learn how to design, build, and operate large scale distributed systems. In the book, Roberto will show you how to design, build, and maintain large distributed systems, by discussing not only some of the details but also basing his observations on a firm grasp of universal core pricinples for managing systems.

In this interview, we’re going to talk about Roberto's background and career, professional interests, his book, and at the end we'll talk about his experience using Leanpub to self-publish his book.

So, thank you Roberto for being on the Leanpub Frontmatter Podcast.

Roberto: Thank you for inviting me.

Len: I always like to start these interviews by asking people for their origin story. So, I was wondering if you could talk a little bit about where you grew up and how you first became interested in computers and technology?

Roberto: I'm Italian, but I actually grew up in Switzerland. And so my passion for computing started at - I think, around five - when I watched Back to the Future. I became obsessed, wanting to build a time travel machine. I thought having a computer was one of the requisites to do so. And so I worked my parents until they gave me one. It was kind of like love at first sight. I didn't know how to use it, or my parents either. All I had was this book, and it told you how to use MS-DOS. And there were like three pages on QBasic.

That's how I started getting into programming - initially writing some simple translators from German to Italian, and then I got into games,like creating maps for - I don't know? Quake and that kind of thing.

And then the internet came, and the real fun started. I finally had access to all the knowledge in the world. I could learn whatever I wanted. I got deeper into this gaming thing. I built a 3D chess game and I started getting into physics, thanks to the gaming thing. So that's kind of like how my passion grew.

I realized in high school that I loved math and physics even more maybe than Computer Science. But, I chose to get a Computer Science degree. Because that was the one that could give me more opportunities.

Len: Just before we go on, I've got a couple of questions. I checked out your LinkedIn profile, so I think I know what maybe some of the next steps are in your career. But it's so interesting to talk to - I mean, most of the people that we interview on the podcast are people who are in software and technology and things like that. And there's a big difference between people who started out before the internet and people who started out after the internet. And those who started out before often reference books - paper books. That you went to the bookstore and you bought a book to learn how to do things.

Roberto: Yes.

Len: The next level of old timer was magazines.

Roberto: I remember those, yeah.

Len: And I mean, and the real old timers - like Jerry Weinberg liked to say - he was the first computer he ever met. His first job was to be a computer.

But so - the question I always like to ask people, because of this vast range of experience - and it's funny, you can be old when you're not that old in computing - if you were starting out now with the intention of having the kind of career you ended up having, would you do a formal Computer Science degree or not?

Roberto: I don't think it's really necessary these days, because of the wealth of knowledge available. Now what you need, though, is someone to guide you. The knowledge is all there, but you also need someone that can tell you, "Okay, here are the 10 things you need to learn, and these are the pointers to it." I think when that's missing, you might still sort of learn the wrong things, or learn in the wrong ways. And that tends to stay with you. So you still need that guide. But if you have that guide, do you need a formal degree? Maybe not.

Also, Computer Science can be very theoretical; in Italy, especially so. We do very little practice. So, a lot of math. Proving theorems and whatnot. I enjoy that. Now, do I get to probe theorems every day? No, I don't. So there is that as well. But I think there's a way you can have a good career without any formal degrees.

Len: We'll talk a little bit about the systems design part of interviews and stuff like that, but if you're - and a lot of the people listening to this podcast are often people who are trying to build their careers, whatever sort of stage they're at - if you're applying to work for a big company like Microsoft, does having a Computer Science degree make a big difference? Or does it depend on the circumstance? Is there any general advice that one can give even?

Roberto: Usually the candidates we get, like as engineers - we get candidates that already went through the recruiters. The recruiter's kind of like a firewall. They have some criteria. At Microsoft, in general, I think we look for Ivy League schools and that kind of thing. Usually the recruiters actually look for keywords. So I think that's a general part of an industry. They basically do pattern matching.

When I look at resumes, I don't even look what schools the candidate went to. I don't care what school they went to. If they have some work experience, I look at the work experience. Of course, if they don't have any work experience, then you need to look at something. So, the school might help, but if the school's not there - what if the person has maybe, I don't know, contributed to open source projects? Did they do anything of interest? I definitely will look at that as well. But yeah, I don't even actually remember last time I looked at schools actually.

Len: That's really good to hear. I should say, it is a really interesting observation though that - for people who haven't applied for a job before - someone has to have something to look at, they don't know who you are - and all they're going to know is what you've done. And even if what you've done isn't even directly relevant to the field - if you've done something, it gives people something to look at, and something to maybe pique their interest.

This is one of the reasons that, famously or infamously, extra-curricular activities can be very important when you're an undergraduate, if you're going to be applying for jobs without job experience.

Thank you for sharing that, that's really interesting to hear. I get lots of different answers to that question about, "Would you do things the same way now as you did back then when things were so different?"

And so you grew up in Switzerland, and I believe you studied for Computer Science in Italy.

Roberto: Yeah, that's right.

Len: And then - maybe I'm skipping something in the middle - but you ended up doing a Master's in Computer Science as well.

Roberto: Yeah, that's right. In Italy, we have an internship at the end of the Bachelor's. You usually work in a company, like an internship. So I went and worked at the Italian Institute for Nuclear Physics. And that's where physics came in. I still liked physics, and I wanted to be involved in something that had that, and some good with it.

The Institute was sort of how my career started. I worked monitoring software for the ATLAS experiment. It's one of the big four experiments at the Large Hadron Collider. The software I was working on was patching health data from the hardware, and tracking that everything was fine. So it was tracking health of its electronic parts, and it was explained that in the control room of this experiment, and there were interesting things.

I joined the project when the Large Hadron Collider was turned on. And so all of the software was tested with cosmic rays. It's kind of like particles have come from from outer space. And then when they turned on the machine, we had the particles go from there that were actually getting smashed in the machine. And those were happening with a way bigger frequency. So all the software was doing way more than it used to do before, and there was a lot of crashes and things not going well.

Len: That's really interesting. So I saw that on your LinkedIn bio. So you worked on the ATLAS experiment at the Large Hadron Collider. I didn't know you were actually there when it was turned on. And for those who don't know about it - the Large Hadron Collider is a giant physics experiment that was meant to sort of probe the fundamental kind of qualities of matter, and I mean everything else, right?

And when you're talking about the health of the systems, this is like - I mean, I'm just saying this as like a sort of pop science fan, I listen to Sean Carroll's Mindscape podcast and that kind of thing, that's about my level of understanding. But when you talk about the health of the systems, this is fundamentally important, because the things that are being probed are so sensitive, that if a cosmic ray comes in and flips a bit somewhere, it’ll change the recorded result of the experiment. I'm just guessing this is hopefully a somewhat correct explanation of what you were there for.

And so, you need to make sure it works. I actually had an acquaintance who was a theoretical physicist who worked on the LHC project for a while. And he once said to me, something that just really struck me, that like, "Everybody thinks about the science. And of course that's important. But getting thousands of scientists and programmers and stuff like that, working at a very high level, to cooperate coherently and effectively from all around the world on such a giant project," he goes, "That's the real achievement".

And so, what was it like for you? I know you were at the Berkeley Lab. So you were in the US for a while, I imagine?

Roberto: Yeah, that came later.

Len: Oh that came later. Okay, okay.

Roberto: In-between that, I continued to work on the ATLAS experiments, but just--

Len: So you went like halfway around the world to work on the same project?

Roberto: Exactly. I was still going to Geneva, getting flights - just the flights were a bit longer. I got very lucky, like that - there was an opening at the time. I was working on those sort of things. And so I could continue in Berkeley what I was doing in Italy. I had stopped working in monitoring, then I went on working on analysis frameworks. Think of it like the PandDAs of CERN, that sort of tooling, or the Spark of CERN, if you will? They have their own system tools and whatnot.

So, what was it like? For me as a scientist, it was extremely exciting. Because there were people from so very different disciplines from all over the world, and I really liked that - the fact that I was talking to a mathematician in the morning, then at lunch with the statisticians, and then in the evening with physicists.

The thing that always struck me is that physicists have such a big breadth of knowledge. They know a little bit of everything. They really understand things really well. They're well-versed often in math and programming and whatnot. So I always had very enlightening conversations and very interesting talks there. I loved that part. And I kind of miss like working with the variety of different people.

Len: Just to go into a little bit of detail, how do you check the health of a system like the ATLAS experiment at the Large Hadron Collider? What does that mean?

Roberto: Yeah. So the health is - it depends on the hardware you're checking. And what it means is like - the hardware needs [?] - this is all the things I'm doing right now. And knows what are good ranges. It's kind of like a car, that's - it's really a machine, and the machine has certain ranges in which it can operate well. And when it goes out of it, then some alarm bleeps and someone needs to look at it, if eventually the particular piece of hardware needs to be replaced, or another works for what is going on.

So it's a lot of like thresholding alerts, and making sure that whoever was looking at those dashboards was looking at the most impactful piece of information, because there are so many different pieces of hardware, so you won't be overwhelmed if you're just looking at everything. We had to find ways to prioritize what the people in the control room were looking at, so that they were catching the most important bits that were acting up. Len: And did you have sensors that were like looking for cosmic ray bursts or something like that? Again, I'm speaking from the sort of like podcast-listener level of understanding. So, you were checking the systems, but you were also checking the things that were affecting the systems at the same time?

Roberto: Yeah. So, we did have satellite dishes and whatnot. But actually when the machine was - the whole part, we always kind of shielded it. It was on the ground, and so there were -

Len: It was shielded, yeah. That's just fascinating. And so you - I mean, we could talk about that for a long time. But you moved on from there to Mozilla, another company people have probably heard of. And you had been in San Francisco for a while, and then you moved to London and worked for Mozilla. What was it like? I should say - long time listeners to the podcast know, I at one point move to London - actually I have had a few points in my life I moved to London. But what was it like working for Mozilla?

Roberto: That was very unexpected. My love for computer games was still there, so I was hacking some stupid game or something together. And I found a bug in the WebGL compiler. So I sent a pull request to fix that. And the crew then got in touch with me. So it was completely random. I didn't even know there was a company behind Mozilla, I thought it was just completely open source at the time. And then I realized, "Oh, there's actually a corporation behind it. There are 1,000 more people working on it.

Mozilla is a very interesting company. When I joined initially, I realized immediately how deeply management there cared about end users. Mozilla doesn't store any private information about users. I mean, they really do the right thing. And it also saddened me a bit, knowing that's competing against those big giants. And they just have so much money that they control their products - how can a small company like Mozilla succeed, right?

But being there, I was - I mean, I still use Firefox, because I know the people behind it, they really care about me. They will never try to get my data or make money out of it. I trust them that they will do the right thing. So that really struck me when I joined, and it's one of the reasons I loved working there.

Len: On that note, speaking of sort of sad things. Mozilla's been in the news lately, and it's had some - I mean, a lot of people might have thought of it as kind of a big company, but it is - as you say in the context - a small company. It's had a bit of a setback. How do you feel about that?

Roberto: I think they let go of 25% of the workforce. I know some of those people. Yeah, it's - I don't know the details behind it. I mean, I've read the public posts and - yeah, I know the user base is going down. On the other hand, I also know they renewed their contract with Google. Which is actually their main competitor, which is always - it's a bit weird, right?

Len: Yeah.

Roberto: Their main competitor's also their main source of income.

Len: Yeah.

Roberto: So I think they need to try to find new ways of monetizing before the float. I think they're downsizing some parts of Firefox. They can go and explore different avenues. I know they launched a VPN. I don't know what else they have in store. But yeah, we'll see how it goes.

Len: Thanks very much for sharing that. That's just so fascinating. And it is interesting to hear a little bit about the story of how Mozilla doesn't keep your data, but their biggest competitor is also their biggest - or was their biggest financial supporter, Google.

Roberto: Yeah.

Len: It's a really fascinating story. And for anyone interested in tech, if you don't know it, I recommend looking into it.

And so, before we go on to talk about your work at Microsoft and on big systems and your book, of course - I wanted to talk a little bit more about where you are right now. We were talking a little bit before this, and Roberto actually lives in the neighborhood I first lived in when I moved to London, Balham. But they're very different. Was that the first neighborhood you moved to when you moved to London?

Roberto: No, it was not. I was West Kensington before.

Len: Okay.

Roberto: And then I came here.

Len: Okay.

Roberto: I don't know how it was when you came here, but now this is like the baby area. If you have a baby, you come here. Because all the baby shops are here and you see a lot of people with strollers, and everything is very different.

Len: It was not like that when I moved to Balham in 1999. I mean, it was not like that. It was seen as - it wasn't like a romantically scary area, but it was seen as a bit of a scary area. And it had a little bit of rough and tumble about it, but you could - even a sort of dumb prairie guy like me could tell that it was kind of at the beginning of a transition to something different.

I remember going back there about six years ago and being just shocked to see a coffee shop. And there was nothing like that there, but it's sort of fun to hear that now it's kind of like this pleasant sort of stroller and babyies place. But it was not like that.

What was that like for you when you first moved to London? Did you enjoy the life there?

Roberto: Yeah, it's - I didn't like the weather. That was the first thing, coming from San Francisco, and every day you open the window, it's blue. Blue sky. And here's like grey sky. It's not even rain, because it doesn't rain that much. It's just this grey sky, perpetually grey sky which is - yeah, it takes a bit to get used to. That was probably the biggest difference.

The other thing I will say is that London doesn't feel like a big city really. Because everything - there aren't all of these skyscrapers and what not. So you always feel like you're in a neighborhood. I like that as well.

And the public transport is just amazing. I don't need a car or anything here. It's - you wait one minute, there's the next underground train coming. And in San Francisco with the bus, it's like - yeah, 20 minutes wait.

Len: Are you telling me the Northern Line is reliable now?

Roberto: Oh yeah, it is. Oh it wasn't like that?

Len: No, no.

Roberto: During the summer - I mean, it is livestock transportation. During the summer it's like you squeeze in and you can barely breathe. But it is reliable if you can get in.

Len: That's great to hear. Yyou're saying things that are making me very nostalgic. But yeah, that's fantastic.

I wanted to ask you something specific that comes up. Starting a few months ago, I time-stamp the interviews with a scary little, "This interview was recorded on such and such a date" at the beginning, because things are changing so much during the pandemic, and the date does matter. The weeks can matter. And so one thing I've been doing with guests is asking - how has the pandemic affected you, and particularly the people around you, in the place where you live?

Roberto: I think we started shelter-in-place here around the end of March, maybe before that? I guess for me, the main thing was - I have a small child, so it's one year old - or it was one year old at the time. So that was very hard, working from home with a small child - he obviously doesn't understand why you cannot be with him, although you're right next to him. So that was very hard. But personally, the employer I work for understands that, and I think that went on for about a couple of months.

And then the schools and nursery started re-opening. I think they re-opened at the end of June, mid-June or so? So that was a new child, so we weren't really sure if we should send our son there. It's like, "Okay, what do we do here? Should we do it, should we not?" But the thing is, we just couldn't function as human beings without help. So we decided to send him back to nursery. And that's when we got time to breathe again, and started having some resemblance of life.

Len: And were you working from home when shelter-in-place started?

Roberto: Yeah.

Len: Or were you already working from home?

Roberto: Yeah. Both me and my wife were working from home. We were just doing turns. I was working two hours, and then she was with our kid. And then vice versa. Obviously I was working less. But yeah, we found kind of a way to manage.

Len: And had you ever worked from an office in London, or had you always worked from home?

Roberto: Microsoft has offices. So I usually work at their office. Maybe I was working one day from home in normal times, but usually I was more working in the office than from home. And Mozilla as well. They used to have an office as well here. So although I wasn't going every day, I was still happy to go there and talk to my colleagues, see them in person. I used to work from home, but I also know it can easily spiral out, where you don't have any time be more with people. I like going to an office and see people, and I miss that as well. I'm a bit fed up of all the conversation over cameras and what not. Like I really feel like, "Okay, although I'm an introvert, I would like to see them in person." So yeah. I don't know, how is it for you?

Len: It's interesting that you say that. Leanpub is a distributed team, everybody works from home. I worked in offices in London when I worked there, and I always hated commuting. Maybe it was because I was on the Northern Line. But I really hated it. It seemed very arbitrary to me.

Specifically actually, I have a funny story. I come from the only place - I think it was the only place in North America that didn't have daylight savings time. So I just didn't know about it really. And I remember one day I got up and I went to the Balham Tube stop - and it was empty - and I was like, "What?" - as opposed to the sardine can you were describing before. I was like, "What's going on?" And I was an hour late.

I got to work, and I remember my boss - who was a really nice guy, but he kind of like tap, tap, tap on his watch. The only time in two years I had ever been-- Or well, it must have only been six months. But like, the only time I'd ever been late. And I was like, "Yeah, I made a mistake, what's the problem? There's nothing time-sensitive." It just made no sense to me, so I've always found the office world to be a little bit arbitrary.

But I will say - just yesterday, a friend of mine came to town - and we actually got to hang out in person, and that was the first time I'd hung out in person with anyone other than my colleague Peter from Leanpub for like four months. And yeah -

Roberto: Oh wow.

Len: Being in person with people makes a big difference. I'm never going to be a convert to like every day nine-to-five office life. But I'm definitely a convert to the idea that in-person interactions are really important. It makes a difference. Like a little tear came to my eye when I saw him. It really matters. And I think a lot of us are - even us work-from-home introverts are kind of realizing what we're missing.

And so, moving on. You worked for Mozilla for a while, but then you switched to working for Microsoft. I know you're working on like a huge, huge system for them. So could you explain a little bit about what you're doing for Microsoft now?

Roberto: Yeah. I'm working on internal service, it's a popular service platform. The idea's that if your office subscribes and come to us, we give you some tools that you can use to instrument your applications - and then those applications send back data to us. And the kind of data they send is usage data such as - how long does it take the program to start [?]? How many garbage collection cycles there are. So it's mostly performance and health measurements.

And our platform takes all the data coming in from over a billion devices in the world all over the place, and it aggregates it and makes it easy to consume. So that's kind of like what the platform does in a nutshell.

So I started my journey on Microsoft working on injection [?]. So it was quite a - in a way, a career change for me - because I always worked on analytic stores. And when I joined Microsoft, I switchted to the injection side of things. So it's really the first sort of service you hit when your phone sends a packet of data to Microsoft. So I worked on that. There were a lot of new, interesting problems to solve. And then later, I sort of started taking over the pipeline, the entire pipeline of this internal SaaS product.

It's been a great learning experience, a lot of fun. And the main reason actually why I joined Microsoft was I wanted to get experience on these large systems. There's only a few places where you can really work on systems that ingest a trillion events per day. And Microsoft is one of those. So, yeah.

Len: It's interesting. We'll start talking about your book finally, in a moment. But from what I gather, a lot of your work is about like anticipating what can go wrong.

Roberto: Yes, exactly.

Len: And it can be anything. It really can be anything. It can be like someone unplugs something accidentally, or there's a cosmic ray flips a bit, or anything that can go wrong. I guess I'll ask you two questions. What's the real world problem you've encountered that most scared you?

Roberto: Cascading failures. So, what is a cascading failure? It's when you get so much load, and this happened actually - we got close to it during COVID, during the COVID emergency. So the thing, your system is able to handle a certain amount of load.

Len: And "load," we mean like just data packets coming in?

Roberto: Yes.

Len: Like bits being thrown at you?

Roberto: Yes. Bits being thrown at you. It could be the number of impressions, the number of users. There are different dimensions of the mesh rolled in. But usually you're going to have - just think, "Okay, the system can handle X amount of load," and then you've got some buffer in it. And - but the problem is, if you get way more than what you planned for, then the system starts to crash. It's like - you've seen it on your computer, everything starts to slow down. And this happens alsoin distributed systems. Like it's just like it happens at the biggest fail. More machines are crashing at the same time. So what happens then, is that usually those machines are behind load balancers. And those load balancers detect, "This is crashing, let me remove it from the pool." Now you have one less machine to handle the same load. And then another - so the point is like, you keep removing machines behind this load balancer, while the load is still there. And so the machines that remain there, and need to handle way more than they used to do before, right?

So this creates a cascading, it's called "cascading failure," which is very hard to recover from. Because it kind of creates like this ring of doom where the machines that have been removed from this pool, come back online. But by the time they come back online, they get beamed to death again. So it's like this weird ping pong behaviour when everything just goes to hell and -

Len: And just to give people an image of that - I mean, this is the image that I was having, was like, there's a data center we're talking about. And there's actually stacks of literal machines. And there's a stream of information coming in. And if too much is being directed at one machine, the load balancer removes that machine from receiving the information for a period of time.

Roberto: That's right, yeah.

Len: Right, okay.

Roberto: And the way you can see it, there are load balancers at different levels. So in the data center, on the rack, you might have a load balancer. In the data center you might have another load balancer. And then across--

Len: Oh so it's a piece of hardware?

Roberto: Yeah. You can do it in hardware, the software. Usually this data is done in software.

Len: Okay.

Roberto: And it basically means those are all the machines that take your data before you get to see it in the first place. But their software is very lean, so the only thing they do is really just forward it to you and then do a bunch of health checks. It can get complex there, but it depends what you do with it. But at once, it's a piece of software that just acts as a proxy to your service. And you can have different levels of load balancers. So - to give an example - if one region has a problem, then the whole load for that whole region might be redirected somewhere else. I don't know? From North Europe to South Europe, and so on. So you can have these cascading failures that can affect your system at the global scale.

Len: And you said that this real world example, that you experienced with sort of the COVID surge. So you're talking about like all of a sudden everybody's using these services for things they used to do occasionally all day long.

Roberto: Exactly. It was initially a slow increase, but significant. And then all of a sudden we went into double-digit percentage increase every day. And at that point, like - from one wire to the other, you might get 30% more traffic. And, yeah - but that was quite a lot of fun.

So to avoid those cascading failures, it was really all hands on deck. People scaling out services. And there were not enough machines available. So, everyone in the Azure world worked really hard. Because those things are used in hospitals and whatnot. So we did our best to make sure they kept being available and useful for the people that needed them. I actually think most companies did a quite good job. But most people don't think about who is behind those services. And, yeah - probably a lot of engineers with sleepless nights.

Len: It's interesting you say that, because that will - after I ask you my last pre-book question, that's a nice segue to the book. Where a lot of people - even a lot of software engineers actually just don't think about the kind of stuff that we're talking about. The real machinery that makes things go.

But so, the last question before I talk about your book, is - so, I asked you what's the scariest real world thing that you've experienced. What's the scariest thing that you worried about that hasn't happened, but that like - if it did happen, that would be a calamity?

Roberto: Yes. The scariest thing will be probably leaking user data. That's probably the worst thing you can do. Like let's say - somehow data that has been entrusted to a company, gets out in the open. That's the biggest - one thing is like "the service is unavailable." It's bad, but you can survive that. But leaking user data means you breach the trust. Like imagine your documents being dumped somewhere in the open? So yeah, that's the thing like that worries me all the time. When I review code or where I write a code, I always think, "Okay, can this create a privacy or security incident in any way?" That's my biggest worry, yeah - always, always.

Len: Thanks very much for sharing that. I was guessing you were going to say "solar flare." But as you say - I mean, as long as the hospitals can keep running and things like that - yeah, leaking user data, that would be the worst. Because once it's out there, it's out there.

Roberto: Yeah, totally, yeah.

Len: It's gone. And the information about you and your past doesn't change. And once someone has it, they have it. So, that's a really, really scary thing.

Moving onto your book, The System Design Manual: Learn how to design, build, and operate large scale distributed systems. I wanted to ask you - what was your inspiration for writing this book?

Roberto: Actually, the inspiration came from the desire to create a product. I always have some sort of side thing I do, and, yeah, I think at the end of last year, I decided to create some sort of product. I didn't know what it was. It was just something I wanted to do on the side, and learn how to do it well. And so I'm thinking, "Okay, what can I do, and what can I not do?" And the thing is - I have a full-time job, I have a kid. So whatever it was, it will have been something that I had some sort of unfair advantage, so that I can reduce number of hours I was working on that.

And so a book or a course came naturally to me. Initially actually I wanted to do a class, and then I realized that recording is way harder than I originally expected it to be. And you cannot go back and change words, like in a book. I started with a class - and yeah, I went to a book. So that's how I got to here.

And then I started thinking, "Okay, what can I teach that provides most value for a reader?" And from all the things I thought I could teach, I thought, "Okay, maybe I could write something about distributed systems?" Because what's out there - it's not great. There's a lot on the theoretical side, a lot on the practical side. Things like Kubernetes and tools. But not much in the middle. And when I learned about this, I wanted to learn about distributed systems - I recall downloading books and reading about theoretical things, like, who understands how they fit in practice.

I was reading as well about things like Kubernetes, or that existed before Kubernetes - and that was one thing. I said, "Okay, I have this tool, I know how to use it - but how do you actually design a system? How do you get a blank canvas and design something from scratch? What is the process here? What do I need to know?"

And so, that's where the book came in. I said, "Okay, let me try to sort of condense that information in a book." And then that was basically just a table of contents. So I shared that with different people. I started getting a feeling, said, "Hey, does it make sense talking about this specific topic or not?" I showed it to some colleagues. And then I went from there.

Len: Thanks very much for sharing that. We can talk a little bit in the last part of the interview about the nuts and bolts of how trying to create a video class is very different from writing a book, and has its own advantages and challenges as well. But it's just fascinating for us to - I mean, I've heard this story from people before, where they're like, "I wrote 100 pages of speech that I read out, and recorded it. Ten hours of videos. And then something changed, and I can't go -"

You can't go back and change like a 10 second blip, without really ruining the whole experience. Whereas with a book, you can actually just go and change it. So it's like - but then again, with a book - as amazing as books are, it's a lot different than a video. And I interviewed someone - I don't know if you've heard of Nigel Poulton? He lives in the UK, and he actually shifted his whole career towards doing courses and creating products. And he talks about how you have to get it exactly right when you're doing video courses if you want them to be good.

And so like, you have to re-record entire segments. But then like the entire course - if all of a sudden something's out of kilter, it kind of ruins the experience. But -

Roberto: Yes, especially with technical topics. Like you maybe want to redefine a term. But you define at the beginning, and that really bugs me. It's like, "Oh, but this term - I need to redefine it. I have a way better way to express what I want to say."

The other thing is - if you're not a native English speaker, it's very hard as well. I mean I generally suck at languages. So there's that. Like writing's one thing, talking's really different.

Len: I think you're probably being modest, growing up Italian in Switzerland. I suspect--

Roberto: I speak many languages--

Len: I suspect you probably speak many languages better than most of us speak our own, but I know what you mean, and it's a very important point. That like, if you get something wrong in a text, you kind of select, delete, and rewrite.

Roberto: Exactly.

Len: And if you get something wrong in a video, it's like - you have to go back to the space you were in, and you have to put on the same shirt. And then you have to have the same ambient sound. It's actually kind of impossible. You have to re-shoot the whole segment. Which is one of the things that makes it so difficult.

I interviewed someone named Jane Friedman for this podcast. She's in publishing world, not the software or technology world. And she did a great course on how to publish a book. They make sure that you've got every word written down before you record anything. It all has to be spelled out. And that's one of the reasons that we actually created a courses platform on Leanpub, in addition to books - it was the insight that if you've got a course, you've actually got a book already.

Roberto: Yeah, that's true.

Len: And we're trying to see if that's true. And even if it is true, if we can convince people.

So the inspiration for your book partly that you wanted to create a product, and you wanted to create a product where it was something that you had an advantage on. But also, something that wasn't really out there.

You talk on your blog and in the book about the system design interview. And this is one of the reasons you wrote your book, was to help people who might confront that.

And so what this is, is that if you're applying for a job as a software engineer at a company - if you're applying to the kind of company that may have multiple rounds of interviews, one of those rounds of interviews might be the system design interview. And one of your observations is that people are often unprepared for this. So I was wondering if you could talk a little bit about what is the system design interview, and what are the most important things - other than buying your book - for people to do to prepare for this?

Roberto: The system design interview is where you get to design some piece of software. And usually in this day and age, it's a large scale system. It doesn't have to be. It could be also something smaller, it depends where you're applying for. But yeah - at least where I work, it's usually a large-scale system. And it's typically used for senior candidates. Because the typical round is, you do a bunch of coding exercises, and working puzzles. Which I'm not to fond of.

But then for senior candidates, usually you have several [?] of system design. So the idea, or what you're trying to answer there is like - can this person work independently on a solution to a problem? Because a senior [employee] is expected to do that. And just asking someone to write five lines of code to solve some moderate, weak puzzle, doesn't really show any of that, any of those skills. So the system design is - yeah, it's trying to answer that question.

Usually if you have someone with experience and you want to interview - I mean this is an interview question for someone with experience, and - yeah, you want the person to sort of to take the lead. You're kind of like following along. A person usually takes the lead, they'd ask questions. And kind of having an interaction with a person just like with a co-worker. You have a bunch of conversations. And if the person has some experience, then you will think about things that are usually missed.

So I feel like, yeah, the more senior the interviewers are, the more they think about those edge cases. What can go wrong? And how can we guarantee privacy and security? And well, if you're more junior - maybe you think, it's - the only thing that matters is scale. It's like, "Oh, how do we make this thing scale?" So actually, how do we make things scale is sometimes - most of the time, it's not the hardest part. It's - how do you make things that don't break after a few days you release them? And how do you make things that are maintainable? These are easy to operate and secure. So that's actually the challenge, rather than scalability.

Len: Thanks very much for sharing that. It's interesting. It's been a while since I've been in the - getting ready for the interview thing. But you do encounter the odd - you did talk earlier about how there are gatekeepers, and they have their own ways of doing key word searches and things like that. But sometimes you do encounter - I mean in the investment banking world where I came from, the version of the question that maybe people think is really cool - like in the olden days it was, "How many phone booths are there in Manhattan?" And you were supposed to sit there and come up with an answer. And that was what the consultancies asked you, right? And the investment banker answer is like, "I don't know." Right? Not make some shit up. And the sort of like answers that you're - that's the kind of questions that you're talking about - are like, "How can this system fail?" It doesn't matter how clever you are. You're going to be in a - "Well, I'm hiring you to be in a situation where you're going to actually have to solve a problem."

And so, I'm just sort of making a kind of theoretical observation about interviews. You have to be prepared for both the kind of tacky ones, because in order to get in, you're probably going to have to get through some gatekeepers. But if you get to the other side, there's another kind of resourcefulness that you're going to have to have available to you.

And one of the things you talk about in your book, you say - quote, "The tricky part is understanding failure modes, trade-offs, and costs, which is what skilled interviewers focus on". I was wondering if you could talk a little bit about an example of what a tradeoff would be? I think I'd listened to a talk you gave on - I saw it on YouTube about - I think there was a tradeoff between resilience and consistency?

Roberto: Oh yeah, that was maybe something I mentioned around the CAP theorem.

Len: Okay.

Roberto: Yeah. Let me give an example. So - and that's one of the main choices you have to make. Let's say you have a system that reads configuration settings in. And those configuration settings, they change continuously. It might be the quota for new users. Each user has a quota, and that user can only send up to the quota. And those quotas can change anytime, based on how they're built and users coming in, and whatnot. So what happens if the system that holds your configuration is no longer there? Do you just stop processing any requests, or do you continue? But then if you continue, you're no longer correct. Maybe all the quotas changed, or something has changed and you're letting things through. So this is one way to see - how to capture in saying where - if you have an export partition - so you have these two systems, they're no longer being able to talk to one another - your servers and the configuration system. Then you need to choose consistency or availability. So that's one of the tradeoffs you have to make. And there are many, many tradeoffs like this when you build a system. Should you use a SQL database, should you use a NoSQL one?

And the thing is, there is no right answer. It's really all about tradeoffs, and for you to think about your specific use case, and think, what makes more sense to you. And actually a large part of those engineers, is trying to explain the tradeoff. It's kind of like trying to understand, "Why does the person I'm talking to - why did the person make the choice he did or she did?" And there is a tradeoff there, and - yeah, you're trying to get to the bottom of it.

Len: That's really interesting. So I guess - when things are going wrong, like when you're having say a potential sort of cascading kind of failure, like, who decides what to do?

Roberto: Yeah. So in that case - yeah. Usually it's the person on call. So, for example, in my team - we develop the systems, but we're also on call for it. And so - yeah, whoever ends up being on call during that period of time, needs to make those decisions. Len: And what you're going to have to do afterwards is explain why why you did what you did, but not in a theoretical, "I'm trying to get a job" context. But like an actual like, "Why did I turn off these 10 million customers and not those 10 million customers? "

Roberto: So that's one thing. And then the next thing the person does, is - usually you have a postmortem, and then you find a way to automate this. You try to automate all of the failure scenarios you've seen. Because in order to guarantee availability, the kind of availability that people expect these days - your system can be unavailable just for a few seconds every day. And to get there, you really have to automate all the things. And in the documentation you need to make those tradeoffs. But if the documentation is lacking, then whoever is on call will have to make those decisions on the spot.

Len: That's really fascinating. On a much smaller scale, that is actually how we approach things at Leanpub as well. I think it's sort of just best practice for any company that's got people that they can get to automate things. It's like - if you encounter a problem, you deal with it. And then afterwards you look at it and say, "What can we do to make sure this will never happen again?" And that's what you mean by automating it. Like if the signals start coming in that this kind of bad thing is happening, how can we fix it automatically? And this is sort of the secret to why you and I can have a video chat, and it's seamless now, whereas we couldn't have 10 years ago - because people adopted these kind of practices, and they've really worked out in a lot of ways.

So moving on just to the last part of the interview, where we talk about your experience writing. You started out doing a video class, and then before you finished that, you decided to do a book - if I understand it correctly?

But you're writing your book, you're publishing your book in progress. So often I ask people, "Why did you pick Leanpub for your platform?" If you're publishing in-progress, that probably answers the question. But how did you come about deciding to publish an unfinished book bit by bit?

Roberto: Yeah, so I've - when I got into this, I started reading about it, different approaches. And yeah, the one thing I didn't want to do is create something that nobody wanted to read. And so I really tried to iterate. The first MVP for me was the table of contents. Actually the very first MVP was a landing page. And then the table of contents came, and then later,some of the chapters - I tried to pick the chapters that seemed more relevant. So that's kind of like our approach to things.

And the reason is, I want to make something that was actually useful. I want to hear feedback from my readers. I want to understand how I can improve. So, maybe explain some things better, and avoid certain topics. And the only way to do that is by releasing your book continuously.

And actually one of the things that I realized then - there is, like, we are so used to - you buy this physical book, which then is no longer updated. But why does it have to be that way? Why can't a book not be evergreen? Why can it not be like any other software, where you get updates from time to time? And, yeah - why can't it all be like that?

So I actually - initially I was very frustrated thinking, "Okay, what should I put in this table of contents?" And then I felt like it's never going to be fully complete, and I will never have all the things I want to put there. And then I realized, "This is just like any other software. I can keep adding things to it. I don't need to stop."

Len: Thanks very much for sharing that. That's music to my ears. It's funny, sometimes people ask us, "Is Leanpub only for programming books?" And it's like, "No, it's not." But then they ask, "Well, then why are they almost all programming books?" And it's because programmers have the attitude that you just expressed, better than I think anyone has on this podcast in the past, which is, "I want to be able to update it. I want to be able to change it. I want to be able to fix it. I want to be able to improve it. I want to be able to add to it and subtract from it when I need to." And the idea of - I mean, it used to be the case that programming was actually kind of like that. When you shipped discs in cellophane wrappers. When that release went out, that release went out.

But we programmers don't live in that world anymore. For most things, I mean. They can update stuff anytime. And that's one of the reasons that Leanpub has been so attractive to people who are in programming, is they're like, "Ah, there's a typo. I want to fix it." And if that typo is in a code sample, the code sample won't run when everybody tries it.

Roberto: Exactly.

Len: And then you're going to have hundreds or thousands of people who are all like tearing their hair out and angry at themselves, until they get mad at you when they realize it's not their own fault. And so, yeah - that's one of the reasons that like people with a certain kind of mindset have been attracted to Leanpub. Because we make it so easy to update books.

And so, you talked about how - and this is a very important thing to us, that it's actually kind of something where - a dimension where we have so much more work to do, but - interacting with readers. How do you do that? Do you put your email address at the beginning of your book? I know you have a newsletter that people can sign up for as well.

Roberto: Yeah, exactly. I collect a bunch of email addresses from potential readers through the landing page. I also get sometimes feedback from within Leanpub in the product, the other part. And I get random messages on LinkedIn actually from readers. Actually, I think most of the messages have been on LinkedIn.

Len: Oh, that's interesting.

Roberto: Yeah. I actually got very good feedback on LinkedIn. There's even some guy whoe keeps sending me feedback and finding small typos. I added him now in the acknowledgements, because he's doing such a great job. Yeah, and they just comment all over the place. I think everyone picks whatever he or she feels more comfortable using to contact the author.

Len: That's interesting. LinkedIn is one of the sort of services that we've actually not paid as much attention to as the others. But it's something that's been coming more and more on our radar, and that we might think more about sort of giving people an opportunity to - you can enter your Twitter handle and your Instagram and your Facebook on your author profile - but we don't have a LinkedIn option yet, and that sounds like something that we should probably add.

And so when you mentioned the landing page for your book - you have both a landing page for your book on Leanpub, but you also have one on the web as well, at systemdesignmanual.com, which I recommend everyone check out and sign up for the newsletter, and things like that.

I guess my last question about your process of writing the book - do you have - I mean, you want it to be evergreen - but do you have a schedule, like, "Every month I want a new chapter out," or something like that?

Roberto: Yeah, you're precisely right, actually. It's every month a new chapter. I mean that's - I've been on holidays for two weeks, so that definitely helped. I don't know how realistic it is with work and everything. But it's good to have a goal. Actually the goal is not just there in order to make sure you ship it, but also to cut scope. Sometimes if you don't have any timing constraint, then you just sort of go on tangents. And then you will never ship the full product, because you just spent so much time on something that nobody cares about.

So the fact that there is deadlines, means, "Okay, I know I can talk about this at length for another 400 pages. But I need to stop, because I need to finish, this chapter and I have only two weeks left." So that helps a lot. And then later, once you have all the chapters out, once you have an MVP of those chapters, you can always go back and add more to it. But at least you have something to build on on top of it.

Len: Thanks very much for sharing all of that. I should mention actually, I should've done this before - MVP means "Minimum Viable Product." I'm sure most people listening to the podcast will know that. But that's a talk from sort of lean startup world. And the idea there is partly that - whether it's a book or an app or something like that - get out the minimum thing, and show it to people - even if it's just a table of contents. And see if anyone expresses interest and get their feedback, is a really good idea for deciding both what to do and what not to do.

And so, one of the inspirations behind Leanpub was actually getting people to stop working on books - rather than spending three years writing a book - if your intention is to reach a wide audience, and you spend three years in isolation, and you then release your book and it reaches no one - it would've been much better to have released a table of contents, and then discover that you're not going to reach your right audience with that idea.

Well, thank you very much Roberto, for taking the time out of your day to do this, and coming direct from Balham. Which, just - I'm going to be thinking about that for the rest of my day. Thanks very much for being on the Frontmatter Podcast and for being a Leanpub author.

Roberto: Right, thank you for having me here.

Len: Thanks very much.

And as always, thanks to you for listening to this episode of the Frontmatter podcast. If you like what you heard, please rate and review it wherever you found it, and if you'd like to be a Leanpub author, please visit our website at leanpub.com.

About

Roberto Vitillo, Author of The System Design Manual: Learn how to design, build, and operate large scale distributed systems

Transcript