Leanpub Header

Skip to main content
The Leanpub Podcast Cover Art

The Leanpub Podcast

General Interest Interviews With Book Authors, Hosted By Leanpub Co-Founder Len Epp

Listen

Or find us on Stitcher, Player FM, TuneIn, CastBox, and Podbay.

Lee Baker, Author of Getting Started With Statistics: A Series of Bitesize Guides For Beginners

A Leanpub Frontmatter Podcast Interview with Lee Baker, Author of Getting Started With Statistics: A Series of Bitesize Guides For Beginners

Episode: #197Runtime: 01:02:26

Lee Baker is the author of the Leanpub book Getting Started With Statistics: A Series of Bitesize Guides For Beginners. In this interview, Leanpub co-founder Len Epp talks with Lee about his background, his early work in medical physics and artificial intelligence, how he got into statistics, setting up his own company, how to protect yourself from being misled by statistics, his books, and at the end, they talk a little bit about his experience as a self-published author.

This interview was recorded on March 4, 2021.

The full audio for the interview is here: https://s3.amazonaws.com/leanpub_podcasts/FM175-Lee-Baker-2021-03-04.mp3. You can subscribe to the Frontmatter podcast in iTunes here https://itunes.apple.com/ca/podcast/leanpub-podcast/id517117137 or add the podcast URL directly here: https://itunes.apple.com/ca/podcast/leanpub-podcast/id517117137.

This interview has been edited for conciseness and clarity.

Transcript

Getting Started With Statistics: A Series of Bitesize Guides For Beginners by Lee Baker

Len: Hi I'm Len Epp from Leanpub, and in this episode of the Frontmatter podcast I'll be interviewing Lee Baker.

Based in Scotland, Lee is a physicist, statistician, and programmer, who is CEO and co-founder of Chi-Squared Innovations, which helps scientists and businesses analyze, visualize, and model their data.

You can follow him on Twitter @eelrekab which is Lee Baker backwards, basically, and check out his website at chi2innovations.com.

Lee is the author of a number of books available on Leanpub, including Multivariate Analysis – The Simplest Guide in the Universe: A Holistic Strategy To Discover All The Relationships in Your Data, How to Lie with Numbers, Stats & Graphs: A Box Set Containing Truth, Lies & Statistics and Graphs Don't Lie, and most recently, Getting Started With Statistics: A Series of Bitesize Guides For Beginners.

In this interview, we’re going to talk about Lee's background and career, professional interests, his books, and at the end we'll talk about his experience with writing and self-publishing.

So, thank you Lee for being on the Leanpub Frontmatter Podcast.

Lee: It's great to be here, Len.

Len: I always like to start these interviews by asking people for their origin story. So, I was wondering if you could talk a little bit about where you grew up, and how you found your way into a career in data science?

Lee: Ooh, well that's actually a really, really long story - but I'll try and give you the bite sized version of it.

I started out originally as a physicist. I never had any interest in doing anything with data, really. But I just sort of fell into it, almost by accident. I did a couple of Master's degrees, in which I sort of moved away from my real passion, which was medicine. I did my first Master's in medical physics, and it was only after that that I started to work with data in different areas.

I started to get really interested in what you can do with data. These were really huge datasets at the time. What you might call "big data," although that particular term has been - it's completely out of proportion now from what it was back then. You need specialist tools now, which you didn't need back then.

But I was analyzing these large datasets, and starting to get a passion for telling stories with the data. The data itself, I didn't really particularly like - and nor did I like statistics.

I'm a self-taught statistician. I didn't like the statistics. It's a really annoying thing to have to do. But I just absolutely loved shaping the data and discovering things in it - and being able to tell stories. The story of that data.

That's my passion, and that's what I really love doing. And so I write a lot of - I know we're going to talk about the books, but I write books for beginners - to help them along that journey, to be able to tell stories with the data, rather than thinking of statistics of being just some kind of a tool.

Len: And are you from Scotland originally?

Lee: I'm not. I'm a very proud Yorkshireman. I came up to Scotland many years ago, back in 1999, just to finish off my education. I think they must have closed the gate behind me, because they've not let me out since.

Len: And you studied artificial intelligence as well.

Lee: I did. My PhD was in that, and that was a bit of a strange time, because I got a supervisor who was interested in me doing a PhD, and gave me a CD with some data on it, and an academic paper - and said, "Can you do something like that with these data?" These were agricultural data. It was all about soil. I'd never done anything to do with soil before. And so I looked at it, and this paper was about using artificial neural networks with some soil data.

I had a good read of it, and didn't understand most of it. But I took my time with it, spent a few days combing through it. And after a while, I thought, "Actually, yeah - I can do something with this." So he offered me a scholarship doing a PhD.

And so that's where I started doing neural networks. I'd never done anything like it before, but - yeah, I was really excited about it. It was a really interesting time. A really steep learning curve. I'd never done anything quite like it before. Bu it was a good time.

Len: It's really curious, actually - speaking specifically about the time that was starting about 20 years ago, I'm just looking at your LinkedIn profile - that you began your PhD. What was the work like at the time? I mean, did you have a big lab that you had to go to every day? Did you have a supercomputer available, or anything like that?

Lee: Oh God no. Supercomputers, that would've been nice. No, what we had was - I just turned up to my office every day. I got a bog standard computer, and most of the neural network stuff - we had to code it ourselves, and I was an absolutely terrible programmer. I had to learn to program, so I could then program neural networks - and then do all of that. And then - during the summertime, when all the students disappeared and all the computing labs were free - we'd then go and link computers up, so we could get some extra computing power.

Of course we have things like Hadoop and Spark now, to be able to do these things for you. But you couldn't do it back then, it didn't exist. You had to actually go in and physically link these computers up, and tell them to work with each other.

I had to get the tech guys to come and help me to do it. But we linked up all these computers, got all this extra computing power - and so most of the work actually was done during the summers, when there's no students there. I spent months then back at my desk, wondering why the hell it's so slow, when I don't have 20 computers all linked up.

Len: Thank you very much for sharing that. It's really interesting, this podcast - in a way - because so many people have worked with computers that have written Leanpub books, the podcast has become a bit of a kind of archive of stories of people from various eras in computing, telling about how they used to do things.

And it makes me think that - I remember when I started going to university in the mid-1990s, hearing stories about people like, "We used to be able to smoke in class," and "We also had to do punch cards for the computers." I'm sure that people who started university like just a few years ago, are thinking, "Well, actually physically hauling around and physically linking up computers to get more power, that's - I understand that's the way it used to be, but it sounds nuts."

Lee: Yeah, it does. And it's strange thinking back to the different eras. Because when I just first started doing my PhD in the office that I was talking about, it had just recently been refurbished. And before that, it used to be the staff bar. I mean, where do you get a staff bar in a university these days?

Len: You don't.

Lee: And times, they just keep changing - and trying to keep up with things can be - it can spin your head a bit.

Len: Yeah. And actually, one thing before we move on, I just wanted to ask you - you said you did your first Master's in medical physics. And for people who aren't accustomed to hearing those two words together, I was wondering if you could talk a little bit about what medical physics is?

Lee: Sure. It's where medicine and technology meet. So, you've got the doctors that go and make their diagnoses. But how do they make their diagnoses? They'll put people into a CT scanner or a PET scanner or some kind of a scanner, ultrasound scanner. And then they get the results back, and they use those results to be able to make their diagnoses. Well, all those machines that you use, that's what medical physics is all about. It's about creating those machines and making them better. Basically, it's machines to look inside the body without having to open the body up and get inside.

Len: All right, so, bombarding the body with some kind of rays to get information about what's going on inside, and things like that?

Lee: Yeah, that's usually the way - yeah.

Len: Okay.

Lee: It may be radiation or it may be sound waves or light waves. There are many, many different facets to medical physics. But yeah, it's usually done with some kind of radiation.

Len: Okay, okay - thanks. And so, okay - so getting back to your story. You studied medical physics. You ended up doing some artificial intelligence and soil work, and you finished the PhD. And then you faced the choice that everyone who finishes a PhD faces - which I did myself at one point - which is, "Should I pursue a career in academia, or should I do something else?" And you chose to do something else. I was wondering if you could talk a little bit about why you made that choice?

Lee: Yeah, actually there's a big chunk in-between there. After I finished my PhD, I then went back into the medical arena. And very strangely, I got - It's a convoluted path, and I'm not going to go through every step. But I ended up becoming a medical statistician, without ever having done any real statistics. I found myself in a research group. And I was very strong at maths. I was really good at working with data, but I'd not really done very much statistics. And I found myself helping everybody else to do their data analysis.

So they offered me the job of statistician. Then, I had to start doing a lot of statistics and learning about it all. And I started actually - rather than doing the statistics, because I didn't like doing it - I started programming the statistics, writing computer programs to do it for me. To do all the hard work for me. And I really, really loved doing that. Because it made sure that I didn't have to do the one thing that I didn't want to do, which was actually do the statistics. It is a really important path. Because it was at that point that I realized there was a huge gap out there in society for computer statistics.

Statistics is something where you start doing a piece of analysis. And then you've done a little bit of analysis, and then you have to do exactly the same analysis all over again, on a slightly different piece of data. And then you do the same analysis again on another piece of different data. And you keep repeating the same tasks over and over and over again. And it can take you months to do it. And that's why I started programming the statistics.

Now, I think you can imagine that there are a lot of companies out there that have got a lot of data, and they need to be able to do the same kind of analyses time and time and time again. Basically, they need to automate it. And there aren't that many companies out there who have the ability to be able to create statistical applications for companies to use - bespoke applications specifically written, built for them. So I realized there was a gap out there, and I decided, "It's time for me to -" basically, "expand my horizons, get out there into the real world, and start to help as many companies as I can." And that was, that's the origins of my company, Chi-Squared Innovations.

Len: I actually wanted to ask you specifically about that. A number of Leanpub authors that we've interviewed for the podcast are sort of independent types, who've made the choice to go and create their own companies and do consulting, and things like that. I was wondering - if you remember back in the day, how did you go about getting your first clients?

Lee: There's a certain amount of doing your own leg work. Getting out there. You go and network and you meet a lot of people, you shake a lot of hands, you drink a lot of coffee. And there's also - the other side of it - where people who already know you, say, "Oh, I've - this company's got a problem, and I know somebody who can sort this out." And they come and put you together. So there's a lot of work that you have to do for yourself to try and find clients. But there's also that little piece of luck, of knowing the right person - and that person thinking of you at just the right time.

But, yeah - it was a real hard slog at the beginning. It's the same for every company. It's not just us. Trying to get your first clients. And not just get your first clients, but then getting your next set of clients. Especially if you're working really hard on a project for a client right now - you're on your computer, you're working hard - every day, every week, every month. And when you get to the end of it, you don't have another client, because you've been too busy on the project that you've been doing. While you're in your office doing work, you're not out there in the world getting new clients.

And so, you can't be the CEO and a coder at the same time. You can't do two jobs. You can't do both of those jobs. You're either in the office doing work - or you're getting out of the office, trying to get new work in. And that was - it's obvious to anybody out there, that's got their own business - and they've been through the exactly the same path. But if there's anybody listening that's thinking about setting up their business, it's a big learning curve. You can't be in the office and out of the office at the same time.

Len: Actually, that gives me an opportunity to do a cheesy segue into the next part of the interview, which we introduced into these interviews about a year ago now, where we talk about the pandemic. You've mentioned getting out of the office. And so, I was wondering if you could talk a little bit about just how the pandemic has affected your life in Scotland, and how it's affected your work?

Lee: Ooh, it's been really, really tough in one aspect - and really easy in another. I'll explain that in a second.

The hard thing for us is that - we've got something over here called Brexit, and this has been going on for quite a few years now - and it's only just being resolved. Even though it's not been resolved, people think it's been resolved. But Brexit has been a big, big problem for us, because we are a research and development company.

And during Brexit, companies have been saying, "Now is not the right time for us. I want to talk about the things that you can do for us, the help that you can give us. But now's not quite the right time to actually go ahead and do things." So we've had a really difficult time during the Brexit period.

This is one of the reasons why I've written so many books. Because, frankly - I told all of our staff, "Get out there and create a new business for yourself that's going to bring money in, that is going to keep us ticking over. Because we've got to work through this Brexit period until it's all done."

And then, just as soon as we started to get a little bit of clarity over Brexit and companies started to come back to us and say, "It might be time for us now." As soon as that happened, the pandemic hit and everybody's closed again. And so we've had another year on top of that.

So it's been a really, really difficult time for us as a company, to have to try to find alternative sources of income. And Leanpub has just been great for us with that. We're publishing our books there. We've got other outlets as well. But it's been really tough for the business.

On the other hand, it's been really simple for me personally. Because more and more I was withdrawing to the office, because there wasn't so much work for us to be had out in the real world. My work became in the office. So for the last few years, I've been mostly office-bound, not been getting out there very much.

So when the pandemic hit, it was just business as usual. My office is up the stairs and turn left. It's my spare room. I've been in here for several years now. It feels like I'm never going to get out.

But that's where we are. So it's very mixed messages from me, when it comes to the pandemic - difficult in some ways, and very easy in others.

Len: Thank you very much for sharing that. That's actually not too different from the experience that I've had described to me from other Leanpub authors. Often it is people working for - especially if they work in things having to do with computing and consulting, and stuff like that. Often the office is at home, and the idea of remote work is leaving your house - not working in it. But people who do client-based work, that's been a rough go for some people. And it's been better than ever for others.

I actually wanted to ask you specifically about Brexit. I lived in the UK for about eight years, and ended up working in finance in the City in London. I've been watching this Brexit stuff with a mixture of horror and fascination for a few years now, and I wanted to ask you specifically about Scotland. I'm just asking - this is a total, like "your opinion" kind of question. But what's the mood like in Scotland - generally right now, with respect to things like independence, and stuff like that?

Because, I mean - we could go into all the complicated politics and history of it - but there is an independence movement in Scotland. Voters in Scotland - most of them didn't vote to leave the EU. And you've got Boris Johnson and his ilk running the show. And they're typically not - they're neither Scotland- nor Wales- nor Northern Island- or Cornwell-friendly. And I know I added one there that's not a country. But what's the mood generally like in Scotland right now? Are people really worried? Are they thinking that like the doldrums might never end?

Lee: I think it's probably fair to say that there isn't a single mood across anywhere in the UK about Brexit. Across the UK, they voted to leave the EU - by, I think it was something like 52% to 48%. So it was close. And then Scotland decided to have a referendum to cede from the UK. It went pretty much the same numbers. 52% to remain in the UK, and 48% to leave. So, again - very close. It all depends on who you talk to, as to whether they think it's a good idea to leave the EU or not. And about whether it's a good idea for Scotland to leave the UK or not. It's very, very easy to get into heated arguments with people about it.

They're both very, very complicated issues, and they enrage passions. For me, as a business owner - what we've got to do, the way that I looked at it - is that I had a vote, a single vote. Just as everybody else did, as to what we should do. Whether we should leave the EU or not. And I cast my vote. I'm not going to say which way I went. It doesn't matter. But I cast my vote. But that's where my responsibility ends. I'm not responsible for the UK leaving, deciding to leave or not. That's the entirety of the population.

And as a business owner, I just have to say, "Whatever will be will be." My job is to look at the situation and manage it as best I can. So whether we stay in the EU, I've got to deal with that. If we leave the EU, I have to deal with that too. Whatever the situation happens, I have to find a way to deal with it. There are advantages and disadvantages to whichever path you go down. And it's my job to navigate those choppy waters. It's the same for every business owner. You've just got to look for the opportunities, and see if there are ways of making gains throughout it. And so that's what we've done.

In fairness, we've had a very, very difficult time. Not because of Brexit, but because of the uncertainty over whether Brexit was going to happen or not. And if voices continue about Scottish independence, then maybe that will have issues for us as well. But at the moment, it's not an issue right now. The Scottish National Party, which is the inner majority here in Scotland - they're the ones that's driving it forward. They're the ones that's wanting to cede from the UK. And at the moment, there is not a vote on the cards. That may change in the next year or two. Who knows? And we just have to see what the situation is from day to day, and try and make the best of whatever situation we find ourselves in.

Len: Thank you very much for that very pragmatic answer. One thing, the only - I mean, not living there or being from there, the one thing I guess I could maybe add is that - from my perspective - is that the uncertainty around Brexit has affected business owners all around the world. Anybody who does business from or into the UK, including online commerce - has been disrupted by the uncertainty and by the handling of the transition from one state of affairs to another. And we're not on the other side and to a new, permanent state of affairs yet. There's still so many things regarding trade and commerce to be worked out. And, yeah - it's something that - everybody's had to be as realistic as they can be about it, at the same time as we've all got these passions roiling underneath.

Lee: Sure.

Len: I mean I can say, "If I'd been there, I definitely would've -" And I've voted in UK elections. I definitely would've voted against it.

Okay, well - moving onto the next part of the interview. I wanted to ask you about your books. You've published many on Leanpub, and as you said, they're available elsewhere as well.

One of my favorite ones is the one about How to Lie with Numbers, Stats & Graphs I was wondering if you could talk a little bit about the motivation behind that book? Because we're all surrounded by data and statistics and numbers, and we're presented every day - you don't have to read two headlines before you've found one with some kind of statistic in it. I was wondering if you could talk a little bit about what your motivation was for writing that book, and who it's for?

Lee: Sure. I'd written quite a few statistics books before I wrote that one. They were quite technical books, for beginners, so they're not too difficult. But nevertheless, they were quite technical books. And I decided just to have a little flight of fancy, and to write about something that was much less technical, and a lot more entertaining. So I decided to go into all the different ways in which you can use numbers and statistics to deceive people. I didn't write the book to teach people how to deceive. I wrote the book to show people how they themselves are being deceived.

Of course, if you already know how to lie with data - as a lay person, you need to be able to arm yourself against that. So you need to know the tricks of the trade that's used by the politicians, by the pharmaceutical companies. By the marketers. They've all got these various tricks that they use to deceive.

And sometimes it's not just lies. Sometimes it's lies, and sometimes it's - they use absolutely correct numbers, but they use them in a slightly different way to what they really should, and they lead you up the garden path. They lead you to think things that they want you to think, rather than the things that you should think.

I decided to write these books. This was two books that I put together into one. One was Truth, Lies & Statistics, and the other was Graphs Don't Lie. I put them together as a little box set. One of them is about how to lie with numbers, and the other is about how to lie with graphs. I decided that I was going to write these in as a humorous a way as I possibly could, and keep them really light and entertaining - rather than deep in statistics.

I talk a lot about things like averages - how to lie with averages, or how to lie with pie charts, and things like that. So not very technical at all, but very entertaining. I've had a lot of feedback from people saying they absolutely love those books. And I feel good about that. It gives me encouragement to write more.

Len: And can you give us an example of how to lie with pie charts, in particular?

Lee: Lying with pie charts, yeah. There was one example that I gave in the book. And this was from a respected UK publication.

They'd put a pie chart together. And when you look at the sections, you've got different sizes of different chunks. A larger chunk had got a smaller percentage than a smaller chunk, which made no sense at all. And there was something that was on the graph, that actually had a percentage of zero. But how can you have a section on the pie chart with a zero? And it was - you just look at it and think, "Oh my God, where do I start with this?" So it was a bit of a crazy situation. Pie charts is one of the very best ways in which people deceive, to use graphs to lie to us, and tell us that it's one way - when it's actually not, it's another.

Len: And lying with numbers in particular - you're really clear about, or you have some really good examples about, lying with averages and things like that. And I find - this was one of those things where it's like really easy to know what's wrong, but somehow, our intuitions are easily manipulated with things like this. I think you have one example of like, if there's some street ending in a cliff on the sea and you tell people, "The average person living on this street is a millionaire." But that can actually be incredibly deceiving, because now you think everyone's a millionaire.

Lee: Yeah. You've got all the houses along the street that are all very normal, with normal price tags. And then, yeah - the millionaire Ozzy Van Hendrix and then -

Len: That's the name.

Lee: Yeah, I love that name. And that house price puts the average house price of the entire street up. But it only puts the mean value up. It doesn't put the median value up. So if you want to, you can publish the mean value - if you want people to think your house prices are higher in this region. Or you can publish the median value, if you want them to think that it's lower.

And even the mode, the mode itself - the most frequently occurring number. If you've got a - within the street, if you're talking about the income of the people who live in these houses as a demonstration of average income on the street - and you've got several people who are out of work with no income, then that's the mode. The most frequently occurring number. And that's one measure of average. So the average wage in the street is zero. And if that's the number that you want to put out there, it's absolutely correct. Because the numbers tell you it's correct. But is that the real story of the data, or are you trying to misrepresent the data? That's the important thing. That's the whole crux of the book.

Len: Yeah, and it's really good. I recommend it to anyone who's interested. And, again - it's very entertaining in learning a little bit more about the ways that you can be manipulated, and even the ways that people who put together data can sort of lead themselves down the wrong path. So, they're not necessarily in a sense lying, but they are misrepresenting things.

And the thing I find, personally - being outside the industry, but I find so fascinating, is that the presentation of numbers and charts themselves can be a lie. No matter what they're saying, or no matter what their background is.

For example - a typical thing that you might see in the book publishing industry, is some consultancy puts out a presentation that says, "The book publishing industry is going to grow by a CAGR of 3.25% over the next five years." And you're like, "Oh, .25 - to two decimal places? Well, they must really know what's going to happen in the future, right?" They could show you the spreadsheet that the intern put together that came up with those numbers. But it's the presentation of numbers themselves that make - people just assume, "They must've measured something measurable. They must've quantified something quantifiable."

And that's something that I find really fascinating. You see it in psychology where like, "Teenagers these days are, on average, 80% less empathetic than they were 20 years ago." And it's like, with all those numbers and terms like "average," and then they say "research," and "study," you assume there must be something real behind it, but often there isn't.

Lee: Yeah, absolutely. Just this week - we're home schooling here, so my eleven year old daughter - she was having to calculate some averages. So she'd got some numbers, and she worked out an average. And she worked it out, and it was to - she'd written it down to, I think - something like about eight decimal places. And these were all integers. So she got eight or nine decimal places. And I said, "Well, you've got the right answer, but -" And I, so I explained to her about how accurately you can put things.

The numbers that she was talking about, these were - I think they were heights of people. Heights in centimeters, or something like that - or in millimeters. And I said, "Well, but do you have a ruler to be able to measure to eight decimal places?" "Uh, no." I said, "Okay, well it can't be accurate to eight decimal places." "Well what should I do then?" "Well, fewer - fewer decimal places. Got to go fewer."

So I basically had to teach her that whatever the data you've got - if it's measured to so many decimal places, if you do an average of them - your average can't be more accurate than the same number of decimal places. So if it's measured to two decimal places, you can't be more accurate than two decimal places. It doesn't matter that the number's coming out of your calculator, giving it twelve decimal places. You can't go that accurate. You just can't do it.

But it gives people reading the results a false sense of security in the numbers. Because they think - because you've measured it - you've got a number to so many decimal places, "This must be really accurate." Well, no it's not. Not at all.

Len: One thing I found, in a past life in presenting financial information to people, is that - and that's often based on probabilities, and upside cases and downside cases, and things like that - is that, there is a certain constituency, that no matter how much you qualify the numbers and the charts that you're presenting, saying, "This is based on a projection. Here are the assumptions. They're all made up. We're assuming inflation of this over the course of the next ten years and we're assuming this doesn't happen and this does" - no matter how much you qualify it, some proportion of your audience is just going to believe that you've made a prediction, or you've just told them the state of affairs.

Have you found that with clients sometimes? That no matter how much you tell them, like, "This data that I'm presenting you is based on -" like, "I've done a lot of" - the technical term is, "cleaning." You used the term at the beginning of the interview, where you talked about telling a story and things like that.

But people often - I find it's very difficult with some people - to get it across that, "This is an artificial thing that I'm presenting you with, not a natural object that I found."

Lee: Yeah. Human beings can't help but be human beings. We all come into every conversation that we have with our preconceived ideas. And it can be quite difficult to get past those preconceived ideas.

One of the things that we've had very, very big difficulties with as a company, is that - we create these statistical applications, and a lot of the time we do it using artificial intelligence means. It can be very, very useful to do that.

And as soon as you mention artificial intelligence, everybody's eyes light up. They think it's some kind of magic bullet. It's going to fix all your problems. It's going to be, "Wow, fantastic, wonderful."

It's not. It's just a tool like any other. And you've got to be able to use, to know how to use that tool.

It also confuses the hell out of them when I tell them how much it's going to cost, too. Because they think you can just grab something off the shelf, and "Oh, it's done. It's five minutes, it's done. It's plug and play. You just throw your data in, and out comes a prediction at the other end. It's brilliant, it's great." Yeah, it doesn't work like that.

We've all see the TV programs, your CSI and all these things. And anything to do with artificial intelligence, and they're just - somebody's sitting at a keyboard. They tap away "tickatickaticka (keyboard noise)" for five seconds, and, "Oh, the answer is this." Are you kidding me? That would've been eight years work for somebody in the real world. And so people have got the wrong idea about so many things in society. Artificial intelligence is one of them.

But there are many different things in which people will come into a conversation, and they think that it's going to do a certain way - and it doesn't, it goes in a completely different direction. Sometimes it can be quite challenging for them to change their mindset. And it could be challenging for you to try to get them to change their mindset. They need to change their mindset, because maybe they've got the wrong idea about things.

Len: That's really interesting. It reminds me, of - I think this is probably like analogous, is something called, the CSI effect. Have you ever heard of that?

Lee: Well CSI, that's my company - Chi-Squared Innovations.

Len: Oh that's true. No, the show, Crime Scene Investigations. And apparently it had an effect on juries. Like, a real-world effect on juries, where they're like, "Can't you just put it into -?" I forget the name. There's some system. Like, "Can't you just put their DNA into the system and like -?

Lee: Covis or something like that.

Len: Covis. [It's probably CODIS - eds.] I was going to say, yeah something like that, "Can't you just put their DNA into Covis and just like have a montage and play some rock music or something? And then like we're going to know the exact trajectory of the bullet through the five windows." And -

Lee: In seconds.

Len: In seconds, yeah. And looking great while you're doing it. And, yeah - no, I hadn't thought of that before. But I guess, I can imagine that the audience for certain kinds of data analysis - when they're told things like - when they're told there's artificial intelligence involved in data science and things like that, that they might have all kinds of expectations about what's happening behind the scenes, that don't necessarily match the kind of messy reality.

Lee: Yeah, it's right. Messy is the word. Because when we're talking about data, it's hugely messy. And the biggest elephant in the room is dirty data. It's a huge, huge problem.

Of course, if you're going to analyze any kind of data - whether you're analyzing it with statistics or with artificial intelligence - or anything, if you put dirty data into your system, your system's going to crash. It's just not going to handle it. You've got to have perfectly clean data. And anybody that's done any data analysis or statistics knows this. And it is a huge, huge problem.

There was somebody I was talking to, a few years ago now. He was saying that their institute, a large company - I'm not going to mention the name, it's a pharmaceutical company - and they had to submit their data set to a national database for use by the public. This was mandatory. But before they could do it, they had problems with the data. It wasn't perfectly clean. It needed to be cleaned. So they went to a company to have it cleaned before uploading it to this national database. And they got their quotation. 72 million dollars to clean this database. Oh my God. 72 million dollars. So data cleaning's a huge, huge problem.

Len: Yeah. And I was actually just - that gives us an opportunity to talk a little bit about what is dirty data, what's an example of dirty data? And how would you go about cleaning it?

Lee: Yeah. Dirty data. If you were to take a single column in a data set, and in this column you've got - you can have two possible entries. The word "positive," or the word "negative." Whatever they mean. Doesn't matter. How many different ways are there of spelling the word, "positive?" You can have it spelt all lower case, all upper case. You can have an upper case "P," and all the rest of it lower case. You can mix the case all the way through. And if you take all this data and you put it into a statistics program, it will see all these different ways of writing the word as different data. So it sees them as different things.

So where you've got two possible entries, you've suddenly got lots of possible entries. You've got to clean all those up. And these are all just legitimate ways of writing down the word, "positive." What about if you misspell it - if you leave the "e" off the end, or you add a full stop? How many different ways are there of misspelling the word "positive?"

Actually the answer is, there are an infinite number of ways of misspelling the word "positive." And you cannot possibly account for every way of spelling it or misspelling it. And you can shorten it to "POS," and you could just put the sign, a plus sign in there instead - instead of the word, "positive."

This is a great example of dirty data, and it's absolutely horrible. Somebody like me has got to go in and clean all of that up. And the best way of doing it is actually to use various artificial intelligence means to be able to do it. There are lots of different ways in which you can do it, because each different type of problem has got a different type of solution.

One possible way is using something called "fuzzy matching." So if it sees the word "positive," but you've missed a letter out - it can look at it and say, "Ah, has got a really strong similarity to the word that we're looking for. So we know that it belongs to the word "positive," and not the word "negative." So it can force it to go one way or the other, to belong to the correct category.

I'm not going to go any further into the different ways of doing it - but there are many, many different ways. And artificial intelligence is actually the way to clean up all this dirty data.

Len: I'm really curious, just about the way the terms are used, like "clean" and "dirty." So would, for example - a survey that has selection bias in it. I think a famous one is with political polling, where people call on landlines only. And so - not only does someone have to answer the phone and then have a landline and answer the phone, and then go, "Yes, I'd love to talk to this stranger for ten minutes about my political views." And then you report the result saying 10% of people support this or something like that. Would that be -? If a survey like that were part of a bigger data set of results, would that be called "dirty data?" Would you just not apply that term to that situation?

Lee: No. You wouldn't apply that term. Dirty data is data that is incorrect in some way. So, as I said - the word, "positive," if you misspell it, that's dirty data. You have to clean that data up.

What you gave to me there was an example of a really poorly-managed survey, where somebody has thought about a survey, but they've not thought about it well enough. They've not planned it correctly. And so, all the data that they've got is perfectly legitimate data. It is clean data. It is not dirty data. But, when you analyze it, you're going to get the wrong answer. Because they didn't plan their study correctly in the first place.

Len: Depending on what your motivation is, right?

Lee: Very much, yeah. This is right into the center of my Truth, Lies and Statistics book.

Len: Yeah, it's really clear about that. So for example, that someone can honestly be reporting - and accurately reporting the results of something they did. But if they've designed it in such a way that - you want to present the general population as having the views of only people with landlines who like to talk on the phone to strangers - because you have a suspicion that people like that might have certain, be more likely to respond the way you want them to - then you can bake the result into the design of the study in the first place, and still be totally honest, when you say, "I'm accurately reporting the results of my survey."

Lee: Absolutely. I think I gave an example in the book of - it's really, really easy to construct a survey which is absolute rubbish. And all you do is - you send a survey out to somebody, and the survey asked them one question, "Do you like filling in surveys?" And all those people that like filling in surveys will send it back to you. And all those people that don't like filling in surveys will file it straight in the bin. And so, what you get is 100% of people responding to you saying, "Yes, I love surveys." "100%, our survey said 100% of people love surveys." Yeah, but it means nothing, because it was poorly planned.

Len: That actually leads me to ask, I'm going to ask - sometimes I ask kind of selfish questions, because - just a sort of particular hobbyhorse of mine. But one of mine is - for example, with respect to things like surveys - I guess it's a known thing that if you ask questions in a different order, you'll get different responses to the very same questions sometimes.

But of course, no one can - there's no systematic way of explaining, of predicting or explaining - predicting what the results are going to be if you shift the order around or explaining why. But often I've found - and this is, again - just as a lay person who sort of reads the news articles with statistics in it and stuff like that - this mysterious magical phrase, "corrected for," that often appears. "We corrected for age bias," or something like that. And it's like -

Lee: Yeah.

Len: That's always struck me as complete bullshit. What do you mean you corrected for unknowable things? What am I missing?

Lee: Well, it all depends on who it is that's done the analysis on that one. It might've been complete bullshit, and it might not have been. Because we have something in statistics called confounding. And you can correct for various confounding things using various statistical techniques. So when you say, "Yes, we corrected for age bias," They may well have done that. There is a very strong statistical way of being able to do this. So it could be legitimate. I think I'm not going to go any further on this, because it's going to start to get technical in statistical things. And I don't think everybody listening is going to be a statistician.

Len: No, that's okay, thank you for that. I mean, I put my question so naively that it really didn't give you an opportunity to either answer very specifically or generally. So, my apologies for that. But it is something that I've just always kind of laughed at myself about. And I know there's something underneath it that I completely don't understand.

But speaking of getting into statistics, just before we move onto the last part of the interview - so your latest book is actually a collection of books, called [Getting Started With Statistics)

And I was just wondering if you could talk a little bit about the bite-sized guides that are currently in this collection? One's Data Collection, the other one's Data Types. The third one is Hypothesis Testing, and the last one is Bayes' Theorem and Bayesian Statistics

Lee: Sure. There are a lot of statistics books out there, and most of them are written by statisticians for statisticians. And frankly - they're really, really difficult to penetrate. A lot of statistical books - despite me being in statistics for 20 years, I can't understand them. So there are very few books written for beginners. People just getting started. People that have, they've collected some data, they might be doing some kind of research project.

They've collected their data, and they've got the data - and now they've got to analyze it. And they've got no idea where to start. Where do they get started? Well, they need some help. And we've all been there, where we needed some help in a new subject. Something that we've just got to get done. And so I write my books for the beginner, for those people that really are just getting started.

And that's what this series of books is all about. I basically looked around our website at articles that I've previously written, and said, "Can I do more with these? Can I try to reach a new audience and a bigger audience and try to get this information out there? To get people inspired and get them started, and to get them to realize that - actually statistics is not as hard as what they think it is?

Statistics, actually - most of statistics is not difficult. It's actually quite easy. When you read academic papers and you read statistics books, you think it's really hard. But most of it isn't.

And if you can learn to do the simple things well in statistics, you'll have probably 80% of all of statistics covered - and you'll be a good statistician. Not the best, because the best statisticians are those that have got a PhD in statistics. That mantle is reserved for those guys. That's fine.

But your job is not to be the best statistician. It's just to make good decisions with data. And if you do simple things and you do them well, you will do a really good job with your data.

That's what I'm trying to get across with people. And so that's what this particular series of books was about. It's about trying to pull together things that I've written before that can inspire people to learn those easier things. The things that's not so difficult.

Len: Having looked at them, I think you succeeded quite well at that. They're really readable. They're great for beginners, and a really good introduction to getting into statistics, if it's something you're curious about and as ignorant about as I personally am.

Just moving onto the last part of the interview, where we talk a little bit about your experience writing and publishing.

We were talking a little bit before we started recording the interview. You've been on Leanpub for quite a few years now, and I was wondering if you could talk a little bit about why you chose us as one of the platforms for delivering your books?

Lee: Ah, that's taking me back a long time. Because my books have been on Leanpub almost as long as you've been at Leanpub.

At the beginning, I looked around. I've got no idea where I could publish my books. I think I - I seem to think I started out at Smashwords, which is a distributor. So I published books on Smashwords, and then it would go out to lots of different places, Barnes & Noble and Kobo and lots of different distributors. I just looked around.

And I was seeing a lot of books from Leanpub. I was particularly interested at that time in trying to promote free books for our readers on our blog, because I want to get them to learn more about statistics and data science. And "free" is my favorite price, it's a wonderful price. I'm sure it's everybody else's favorite price too. So I wanted to point people to free books. And Leanpub was a great source of free books when I first went on there.

So I thought, "Well, okay - if other people are producing books at Leanpub, why can't I produce books? Whether for free or paid, it doesn't matter. The process is the same. The only difference is whatever price you decide to put on it." And so I just decided I'd put them on Leanpub as well. And it's gone really well. We didn't sell many books at Smashwords through those channels.

But Leanpub has been really, really good. And I think it's because - I know there's lots of different types of books at Leanpub, but it seems to be a real big hub for technical books. Programming and data science are really big outlets there. So it's a bit of a home for the tech community, rather than - such as a distributor, which takes books on romance or pirates or whatever.

Len: Thank you very much for sharing that. The ability to have a free minimum price for books is actually something that's been very attractive - particularly people doing exactly the kind of thing you're describing you were trying to do.

So for example, I mean - but from another perspective - people who might have a very popular MOOC or Massively Open Online Course, who also want to provide some learning materials along with it, wouldn't mind maybe getting some revenue from that. But that's not the primary purpose.

And so using any platform that has variable pricing like we do, is really attractive, right? Because then - if you've got someone - it doesn't matter where they are or what their financial situation is, or what their relative currency value is - they can get the learning material, they can get on their way in their career or in their class or whatever it is. And so having a free minimum price is actually really attractive, particularly to people in - who are trying to teach, basically.

I leave these kinds of questions for the very end of the interview, where we go into the weeds. So, you use our Bring Your Own Book workflow on Leanpub. What that means is that you can publish your book, and you can do coupons and pricing and sales and stuff like that on the Leanpub store, but you actually write your books in your own book production workflow - and then you upload them to Leanpub. I was wondering if you could talk a little bit about what tools you use to produce your MOBI and EPUB files?

Lee: Sure. This process has changed over the years. I've been through lots of different things.

What I struggled with at first, was that - if you went to a distributor, you had to put your book there in a certain format. And if you went into Leanpub, it had to be in a certain format. And if you put it in Amazon, it has to be in a certain format. And sometimes these formats don't go well together. So it could be quite difficult to get all these different formats.

So, it's taken me a long time to figure out how to do the whole process, and come out at the other end with all of the formats of the book that I need for all of the different places that we're going to put it - for no extra work. And actually, there's an alternative distributor to Smashwords, and it's name is Draft2Digital - I'd forgotten it for a second.

But Draft2Digital. So what I do is - I write my book in Word. Everybody knows how to write in Word, we all do it. And then I take the book and I upload it to Draft2Digital, and it produces a MOBI format and an EPUB format, and even if you want it as a PDF format, albeit that that is only black and white - and once you've got those formats, you download them and you've got them. And then you can use them in, wherever. So I can use those - the MOBI and the EPUB, and I upload them directly into Leanpub and into Amazon and into other places that we put them - with no issue whatsoever.

I go through Draft2Digital anyway, because they're a distributor. So I go through that process. And then at the end, I get the downloads of the book that's been converted into the right formats that I need, and I immediately take them to the other outlets and put them there.

The process is really, really easy now. It used to be horrible before. I used to have to write it in Calibre, and having to mess about with style sheets and HTML and CSS and trying to get it right. And, oh - I could spend days just in formatting. That's after I've already written the book. And then I've got to format the whole thing so that it works in the right format. I don't have that anymore; it's a very, very simple process for me.

Len: Thanks very much for that, it's really interesting. So, what Lee's talking about, for anyone listening who hasn't gone through it, is that often - any service that is going to convert a Microsoft Word document or a Calibre document or anything into a MOBI or an EPUB file, will give you a guide, right? So you have to format things in accordance with this guide, and those have definitely improved over the years.

But the formatting that you're doing isn't what you would typically think of as formatting. Like, formatting it, so how's it going to look on the page? You're formatting it to make sure that it corresponds to the guidelines that you've been provided with, so that you get the right output at the end. And those services have really, really gotten a lot better over the years, which is really good for self-published authors. Particularly those who like to work in Word.

Lee: They needed to, because the Smashwords guide to producing your book is - I think is about 60 pages long. It's huge. And if your book doesn't fit to the guidelines, it will get thrown back to you and you've got to go and fix it. So it can be a bit of a difficult process. It's been so much easier since I've switched to Draft2Digital. Sorry Smashwords, it was great while it lasted - but Draft2Digital is my outlet of choice now.

Len: Oh yeah, everybody has their own preferences. We've actually had Mark Coker - the founder of Smashwords - on the podcast before, talking about it. So we're friends of Smashwords and, like - people like some things and they don't like others. The one thing I'll just say from our perspective, that we don't exactly find frustrating, but we like to say to people: "If you just learn Markdown in five minutes, and then use Leanpub, you won't have to learn all those style guide things."

But a lot of people are like, "What are you talking about?" I know how to use what I already know how to use. I don't want to have to learn anything new."

So yeah, there's some people who are like, "Oh thank God I don't have to do everything the way I've always been doing it." And other people are like, "The last thing I want to do is learn something new. I just want to write, because I'm a writer and I want to reach people with my writing." So everybody has their own way in and their own way out, too.

The last question I always like to ask a guest, if they're an author who's publishing on Leanpub, is - if there was one magical feature we could build for you, or if there was one thing you hate about Leanpub that we could fix for you - is there anything you can think of that you would ask us to do?

Lee: I think, to be flippant - I would say, if you were to give us 150% of the revenues. That would be nice. To be honest, I'm really excited about a new development at Leanpub. I'm not quite sure how new it is. But I've recently discovered that you, on your platform you now have [courses](http://help.leanpub.com/en/articles/4828488-getting-started-creating-a-leanpub-course-in-the-in-browser-text-editor-writing-mode. I've just been looking into it. I've been doing courses on our platform for quite some time now, and I'm going to put some courses onto Leanpub.

I recently produced some courses for a producer, and I have - although I had a deal with them, the deal was that I still retained the rights, and I could still do whatever I want with the course. I'm going to be putting those on Leanpub very soon. So, this will be my first foray into courses at Leanpub. So, you've already done what I wanted to do. Your platform does courses.

Len: Oh, well thank you very much for that. Yeah, you can use Leanpub to make ebooks and publish them, and you can also use it to make courses. It's still a relatively underused part of Leanpub, but part of our hope is that - I mean, partly because so many Leanpub books are intended to help people learn things, that actually, if you've got a book and then you create a course to accompany it, then the person who's read the book can then get a kind of like social proof - by getting a certificate, by completing the course.

Because right now, you've read this whole book. You've learned everything in it. But you can't put a badge on your website or your CV or anything like that, saying, "And I took the course and I got an A," or something like that.

So that's part of the idea, there - is that hopefully the courses will complement existing books. But of course, a lot of the courses just exist on their own as independent things as well.

Well, Lee - thank you very much for taking the time out of an evening to talk to us, and thank you very much for using Leanpub as one of the platforms for your books.

Lee: Well, thank you very much for talking to me. I've had an absolute blast. It's been great to be with you.

And as always, thanks to you for listening to this episode of the Frontmatter podcast. If you like what you heard, please rate and review it wherever you found it, and if you'd like to be a Leanpub author, please visit our website at leanpub.com.