An interview with Rafael Irizarry
00:00
00:00
  • April 24th, 2019

Rafael Irizarry, Author of Introduction to Data Science: Data Analysis and Prediction Algorithms with R

00:00
00:00
1 H 6 MIN
In this Episode

Rafael Irizarry is the author of the Leanpub book Introduction to Data Science: Data Analysis and Prediction Algorithms with R. In this interview, Leanpub co-founder Len Epp talks with Rafael about his background and how he got into statistics, the fascinating story of his participation in preparing a Harvard study on establishing a mortality rate for Puerto Rico following Hurricane Maria that went viral, the current debate about statistical significance in the scientific community, how statistics are changing sports, his book, and at the end, they talk a little bit about his experience as a self-published author.

This interview was recorded on March 27, 2019.

The full audio for the interview is here. You can subscribe to the Frontmatter podcast in iTunes or add the podcast URL directly.

This interview has been edited for conciseness and clarity.

Transcript

Introduction to Data Science: Data Analysis and Prediction Algorithms with R by Rafael Irizarry

Len: Hi, I'm Len Epp from Leanpub, and in this episode of the Frontmatter podcast, I'll be interviewing Rafael Irizarry.

Rafael is Professor of Applied Statistics at Harvard and Professor and Chair of the Department of Data Sciences at the Dana–Farber Cancer Institute. His research is wide ranging, but is focused on problems in genomics and computational biology.

Throughout his career, Professor Irizarry has won many awards for his research, including, in 2009, the Committee of Presidents of Statistical Societies President's Award. The same year, he was named a Fellow of the American Statistical Association, and more recently, in 2017, he won the Benjamin Franklin Award from bioinformatics.org, for his open access work in the life sciences.

In addition to all his research, Rafael has also created several very popular courses in the burgeoning field of data science, that you can find on edX. And along with his colleagues from Johns Hopkins - Jeff Leek and Roger Peng - he blogs at simplystatistics.org, a site I highly recommend for anyone interested in data science.

You can also follow him on Twitter @rafalab.

Rafael is the author of two books on Leanpub, Data Analysis for the Life Sciences, which he co-authored with professor Michael I. Love from Chapel Hill, and most recently, Introduction to Data Science: Data Analysis and Prediction Algorithms with R.

In this interview, we're going to talk about Rafael's background and his research, what data science is, and how it is evolving, his involvement in the - I guess you could safely say - somewhat controversial assessment of mortality rates following Hurricane Maria in Puerto Rico, and, at the end, we'll talk about his experience writing and self-publishing his very popular ebooks.

So, thank you Rafael for being on the Frontmatter podcast.

Rafael: Thank you for having me.

Len: I always like to ask people, when I'm starting these interviews, for their origin story. So I was wondering if you could talk a little bit about where you grew up and how you got into statistics?

Rafael: I grew up in San Juan, Puerto Rico. I got into statistics, because I was good at math. I went to college looking for something good to do with math, and tried many different areas where math was used. But I never really found anything that I liked, that was clearly a career choice for me.

So I finished math, and during that time, I also spent some time in these summer camps for mathematics. In one of them, I took a probability class, which was my favorite class in the summer camp - and it became one of my favorite classes I've ever taken.

That got me thinking that maybe I should study probability and statistics. So I did that, I went to graduate school in Berkeley, at the UC Berkeley Stats Department, where I thought I was going to study probability, but as I learned more about what that was, I realized maybe it wasn't for me. Probability starts getting quite abstract and mathematical, as you go forward in your career.

But I did discover applied statistics while I was there, and that's what I'm doing now. I did a thesis study in musical sound signals with David Brillinger, who is an applied statistician who likes working on as many different things as possible. He's actually a student of Tukey, who also liked to do that. There's a famous quote from Tukey, I think it's from Tukey - saying that, "Statisticians get to play in everybody's backyard."

After that, I applied broadly for positions, and was lucky enough to get a position in the Department of Biostatistics at John Hopkins University, where they had no interest in in musical sound signals. But they did have the insight that I would be able to apply those skills in other areas. That's quite common in statistics, where you can learn the methodological, mathematical parts and learn how to apply it in one specific area, but then apply it elsewhere.

The department was right, and I started working in other areas that involved what are called "time series," which is what musical sound signals are. I worked with things like brain signals, circadian patterns from mice, and measurements taken from foetuses, where they tried to measure health through their activity counts and heart rates.

These were all time series data, and it was a lot of fun working on these, all these applied problems. But then I eventually got involved in helping people doing analysis of microarray, which was new back in 1999. There was no one in the department that was an expert in that area, because it was a new area. So my chair, Scott Zeger, thought that it would be a good match - given that musical sound signals have a lot of numbers, 44,000 per second - which is relatively big for for what we were doing back then. And the microarrays also had a lot of numbers, relative to what we had back then. So I started working in that area, and then it just took off.

The pressures in research academia, where you have to get NIH funding, makes it so that we often tend to work in a specific area and stay that area. because we can get funding for that. So I've been working in that area for 20 years now. It's fun, I like it. It's very challenging. There's a lot of interesting practical problems. And throughout the whole time, I've been analyzing data, applying statistics, coding, and doing all the things that kind of define data science today.

Len: Thank you very much for sharing that. You were reminding me, when I was researching your background for this interview, of a friend of mine, when I was studying in the UK, he was doing a doctorate in maths. And one day he's like, "Oh, I can't make it tonight, I've got to go to Glasgow to help hatch some penguins." It was a bit of a lesson for me in just sort of how wide-ranging the applications are in everyday life for mathematics and statistics, and what an adventure it can be.

And so, eventually you ended up doing a lot of work in genomics. I understand that's a broad discipline, and I was wondering if you could talk a little bit, specifically about some of the work that you've done?

Rafael: Sure. So, when I mentioned microarrays, I should've been more clear. For the more general public, microarrays is how I got into genomics. It's a measurement technology that can measure gene expression for thousands of genes at a time.

One of the things that many of us saw at the time, was that the measurements that were coming out of these instruments was noisy, and could be of better quality. Part of the reason they were low quality, was because the preprocessing - the data analysis part that goes from raw data to what investigators were given, could be improved. I worked on that aspect of it.

For a very specific technology, we came up with an algorithm that worked quite well, a statistical method that we disseminated through open source software. That became widely used, the software got downloaded a lot. It was a gratifying experience.

Then I started working on similar problems with other technology. Microarrays was one, but this field moves fast, and one of the things that moves fast is the technologies that are used to measure different end points in the molecule. I could list for you a long list of applications and different measurement technologies, but I won't do that now.

But for several of these, I have developed statistical methods and software to basically clean up the data to make it more useable for the end user, so the investigators use what comes out of these technologies.

Len: And was that initial work - I saw a talk you gave online about, forgive me if I get this wrong, CpG island shores?

Rafael: No, this is something - an algorithm called RMA, that was the first one that we built. That was one of the first algorithms that was part of the bioconductor project.

The CpG island discovery - or the discovery related to CpG island and CpG island shores - came a little bit after. That was one of the subsequent collaborations that I had with this biologist researcher in Hopkins, Andy Feinberg, who was very interested in understanding how DNA methylation changes across tissue, and from cancers and normals.

Back then it was very difficult to measure this, and the technologies were particularly noisy. But we had this very productive collaboration where, together his group and my group developed new technologies, taking into account the statistical analysis we were going to do after we had had the measurements. It was a very productive collaboration in that sense, that there was this feedback between the two groups - of us saying, "This is the best way to analyze it, if you can build it this way." And then they would build it that way, and there was a lot of back and forth.

That is an example, that you brought up, of how we improve the measurements that come out of these technologies. In this particular case, there was a discovery made, because we were the first to see this, because we had cleaned up the data enough to see it. So that was another very gratifying experience in applied statistics, where the statistical approach actually made it possible for the biologist to make the discovery.

Len: It's a really interesting story. I'll make sure to put a link to the video in the transcription.

Your team was under competitive constraints and budgetary constraints, and so operating within those constraints - you had a richer competitor, as I understand it, who -

Rafael: Now you're onto a second thing. After that first project, there was a follow-up where a technology change occurred, and now we could look at even broader sections of the genome.

There was a lot of competition, and this has played an important part in making it possible for our smaller group to actually come out with the results, at the same time as the other.

Len: Yes, it's really interesting the way your team brought in a computer scientist to help write the algorithm, so that you could actually confidently draw conclusions from less data than you otherwise could have.

Rafael: Actually, that part was the statistical part. We used statistical techniques to, what we call, "borrow strength," to improve the signal with less borrow strength from measurements, to improve the precision of the measurement. That lets you get a little bit more - well, sometimes a lot more information with less starting material, or starting data. The computer scientist, Ben Langmead, was instrumental. Because this is a very difficult challenge to implement, computationally. So if I had written the code, it would've taken six months instead of, eventually it became, I guess, a few hours. That is not something I'm an expert in. I'm not an expert in developing fast algorithms.

So, a component of this collaboration was a computer scientist. This happens more and more in my area, where you need a collaborator, or use the tools developed by a computer scientist that take care of making what you want to do statistically, even possible. Basically, write the code and use their algorithms, the clever ideas in their algorithms to even make it possible to achieve what you want to achieve.

People who work in statistics are going to be aware of this, that sometimes you have a mathematical solution that is not even that hard to describe or to write down, but to actually get it from the data - it could be a very difficult algorithmic challenge.

Len: I've got some questions I'd like to ask you in a little bit, about one of the emphases that you and your colleagues on the Simply Statistics blogs have, on the importance of actually doing the work - getting actual experience dealing with data in real world environments, and how this is actually really important. It's sort of like - I don't want to sort of say just theory, as thoughto diminish the importance of theoretical understandings of things; but particularly in data science, really getting your hands on things seems to be crucial.

But before we move on to talk about that, and the way technology has actually changed, and driven what data science is today - I wanted to talk to you about something that I discovered about you, when I was researching for this interview.

One of the pleasures of this podcast is that I get to talk to authors from all over the world and ask them about things that they might know a little bit more personally, that the rest of us have only seen in the headlines.

And so in September 2017, there was a devastating hurricane named Maria in Puerto Rico, where you're from. The island is still suffering from the damage caused by the hurricane. And it was in the news just yesterday - because of comments the President of the United States made about how he thinks too much money has been allocated to help people on the island recover. I wanted to ask you about this, because I was very surprised to discover that the Harvard study that became well-known in the aftermath of the hurricane was partly worked on by you. And you recently gave a talk about it called Mortality in Puerto Rico after Hurricane Maria.

I wanted to ask you a couple of questions about this, because it gets to one of the beating hearts of why things like statistics are so important.

My first question is - can you give us a bit of a sense of how the hurricane affected life on the island for people?

Rafael: The power was out for months, I think, for most people, it was out two to three months. That was the main problem. The other things you associate with national disasters weren't as big a factor as this, as there not being any power. That was, I think, what caused most of the problems that we were seeing. That's how it affected people, members of my family were without power for months. And a lot of people left the island, but some have come back; it's not completely back to what it was before, I don't think. There's also been an immigration problem that started before - that's for another day.

Len: How did people cope with no power for months? I mean literally lighting fires at night? Candles?

Rafael: Candles, batteries, yeah. There are power generators. You go buy diesel and power them up. My parents had one that could power their fridge, and a couple of fans. That's how you coped. Fans are important, because there's mosquitoes, not just heat, so it's already hard to sleep without some kind of wind or AC. That's how the more fortunate people, that could afford generators, would cope. Some people had just candles, battery powered flashlights and were eating canned foods. That was tough.

Len: I can only imagine. So after the storm, some strikingly low - it's a crude term, but death count numbers came out in the media. And you became part of an effort to see if those numbers were correct.

I know it's a big story, but I was wondering if you could talk a little bit about how you got to the release of the first batch of information, when there were like various estimates that went up into - potentially into the thousands.

Rafael: What version do you want? How many minutes do you want this answer to be?

Len: It's a very interesting story and like I said, it gets to one of the hearts of why statistics are so important - can you do it in 10 minutes?

Rafael: 10 minutes, okay. I'll give you the 10 minute version. So, the report you were talking about was low death count. The first one I saw was October 3rd. It was between the President of the United States and the governor of Puerto Rico. They were claiming that the death count was 16. This is a week and some after the hurricane hit - there were very few people with any kind of sense of what was happening. The lack of power in a hospital, for example - the effects it could have, weren't really behind this number.

So immediately there was an effort from many people that tried to figure out what it was. Because if the government thinks this, that means if there's something, there's a public health crisis occurring - you want to know about it, so you can take action. So the fact that this is what they think is happening, tells you that they're not aware of whatever is happening - if in fact something was happening. There were several efforts that start at this moment.

We didn't know about them back then, but then we find out - the New York Times was on it. They were trying to get to the information somehow. Other news outlets. There's a Puerto Rican group called CPI Central that was also on it. There was a group of researchers in Penn State who I think worked in, or had knowledge of how, the demographic registry worked in Puerto Rico, so they were able to get data and start to answer the question.

And then there was a group at Harvard that had heard from people on the ground - this wasn't me, this was Caroline Buckee, who was a main person, who I'd heard from a person - she knew that work in Puerto Rico and field studies, that was telling her, "There's no way that's right. From what I've seen, I know it's not true." She got motivated to try to figure it out as well.

What she thought of doing was to do a survey, which is not what you do in a country like Puerto Rico - that's part of the United States, where you have a demographic registry, when you keep track of everybody who's died, you can figure out how many people have died of what you expect, by looking at the observed minus the expected numbers. There's some statistics to be done there. It's not that complicated, but you need the data.

So at that moment, what she thought, and others thought, was - even if we can get those numbers, they might not be right, because, for whatever reasons, there's people whomaybe go unnoticed, uncounted. So they decided to do this survey.

The idea is you send people to Puerto Rico, you design - I helped them design the study.. Eventually, she came looking for me, because she was looking for a statistician, and somebody that she knew that knew me, recommended she talk to me. We didn't know each other. We're both here, but we didn't know each other. So we started collaborating and working together on this. I was providing advice on statistical design and analysis and things like that. And also, because I'm from Puerto Rico, I was also providing some intel about who you can go talk to in the island to do different things. For example, who you contact to do the survey, which is not easy.

So that started around late October, October 23rd, I think was the day that we started to talk about this, and we started planning it.

At the same time, I started trying to get the data from the government. The daily counts. And that turned out to be complicated. So, I was not getting the data that I needed to answer this. I was getting the current counts for September and October at that time, and as well, data from previous years - so we could compare, and also form the expected count.

There were these researchers at Penn State that obtained the data from previous years - the Secretary of Public Health in Puerto Rico made a statement of how many people had died on September and October. It wasn't that they got it from a database, he actually said it in an interview. And then once you see the numbers he said, it was clear that there was an excess of about 500.

And he said something that was very worrisome, at some point. By the way, I have a Simply Statistics blog post that has this timeline written out - for those that want to go see it, because it's complicated. And right now, I might be going back and forth in time.

But he makes a statement that, saying that, "The number of deaths isn't in excess. Because in December of the last year, about as many people died." The big problem with that, is that there's a seasonal effect to deaths.

It's not just in Puerto Rico, it's in many other places - because of viruses and flu, it happens mostly in the winter time. So you expect higher numbers in the winter. If you compare the September number that he had to the previous September numbers, the excess was about 500. And if you make a plot - now that we have the data, we can make a plot. And it's very, very striking - very easy to see. It's clear that there was something bad happening.

Anyway, so then the New York Times actually gets their hands on the data from the register. I remember I contacted them, asking them, "How did you get it, and can you share it?" And they told me that they had made 100 phone calls/emails to get this data. Which explains why I wasn't able to get it. Because I didn't have the time to spend doing that. And I didn't think of trying to do that. But they showed that there was about 1,000 - that was their count. So yeah - so we going from 16 that the government was claiming, to thousands.

So at this point, we're starting to wonder if we need to do continue the survey. The two things that happened that made us continue, was - so there was one more group that also published a number around 1,000. Now you had three independent groups stating this. So at this time, you think, "Well that's it, the government should see this and be convinced that there's something going on." But they don't. They insist on the lower counts, and then the Secretary of Public Safety makes this comment that shows that he's comparing September's and December's. It means he doesn't have a very clear understanding of how epidemiology statistics demography works.

So we decided to continue the study, and it was done very quickly, the survey. It was an impressive group led by Domingo Marques in Puerto Rico, that was able to get, I think, all the data in about two or three weeks.

And then we came up with a very imprecise estimate. Because we only had about 10,000 people in the survey, 3,000 households. That leads to a confidence level that's almost 10,000 wide. It's not very precise. But the center of that interval was around 4,000. So when we published our paper, I actually don't think it's going to get that much attention. Because there were already three other groups that had published something, including the New York Times.

But I think the combination of Harvard and the center of the interval being so high, it just caught the attention of the media and it went viral. The fact that it goes viral and it gets reported in a way that was a bit sensational, it wasn't good.

But one good consequence was that it put pressure on the government. Around May 29, I think, the article comes out, 2018? We're still trying to get the data, the registry data. And they still haven't made it public. And there's other groups trying to get it.

And three days after our study comes out - there's an interview with Anderson Cooper on CNN, where the main question Anderson Cooper was asking was, "Why didn't you share the data with the Harvard investigators?" Because we said, when they asked us, "Why did you do a survey, when you could've just looked at the demographic data?" We said, "Because we couldn't get it. The government wouldn't share it with us." And the Governor then says that that's not true, that the data is available.

He had some statement like, "Heads are going to roll if the data isn't made public." It wasn't exactly that, don't quote me on that. But he said something very strong like that. Like, "The data needs to be public." And that afternoon, they agreed to give it to us. And once we had that data - then we started analyzing the demographic registry, and we got an estimate that's around 3,000. And by the way, these all have uncertainty attached to them, quite a bit of uncertainty.

This gets a little technical, but when I say 3,000 - , that's the estimate of the observed minus the expected. But that doesn't take into account the fact that you have a lot of variability from year to year on how many people die - for many reasons, for example, viruses. So that number needs to be understood as what it is. It's the observed minus the expected. Another year with a bad virus, you get 1,000 people dying on the excess. You see what I'm saying? There could've been a virus going around Puerto Rico that was really bad, that accounts for some of those deaths. We don't, we can't know that.

So my first analysis for that is in Simply Statistics, and the code is included [Links to many of the resources mentioned in this interview can be found here and here - Eds.]. So you can see what I did. The data was shared as a PDF. There was a lot of data wrangling. Then like 80% of the code is data wrangling, just extracting all those numbers from the PDF. Now, we're working on a paper on that data set, describing statistical approaches that one can use once you have demographic data. And we're also comparing it to other hurricanes, like Katrina and Sandy - all these big hurricanes in the US. So we're applying the same ideas to those hurricanes.

The other thing I should mention, is that there was a study commissioned by the government, by George Washington University, that was commissioned in February. By that time, we already had our data. And they came out with their report in - I think late July or August of 2018, and they also come to the conclusion that about 3,000 is the point estimate of the excess. So that's good, there's agreement there.

Once you have the demographic data, that's what comes out. If you're interested in more of that, I have these couple of Simple Statistics blog posts on this topic [see links above - Eds.].

There's also a bioRxiv paper with some preliminary results from our current research in how to analyze that data.

Len: Thanks very much for sharing all of that, it's just such a fascinating story - given the interconnection of actual tragedy and politics and money and the media, and bureaucracy and things like that.

Rafael: And with the media, it's something that I've been thinking a lot about since this happened. I learned about how to better communicate results, knowing that things can get sensationalized or can go viral. It's a little bit harder than just explaining it so that anybody can understand it. It really goes beyond that. You have to take into account, "If I explain it this way, is there room for someone to grab a part of what I just said, and make it sound more worse than it is, or more interesting than it is?" That is something that this experience has gotten me thinking about that. How should statisticians communicate results - not just so that people understand it, but so that it's hard for someone to misconstrue what you're saying?

Len: Thanks very much for bringing that up. That's actually what I wanted to talk about with you next.

But one of the things that happened - I'm going to read the line from a Washington Post article that came out on May 29th, 2018, just to just give people an impression of how all this work that you did was conveyed to people.

The line goes "The Harvard study's statistical analysis found that deaths related to the hurricane fell within a range of about 800 to more than 8,000, settling on a midpoint overall estimate of 4,645."

You talked about the confidence interval, which is a term of art in statistics. And in the Harvard study, you and your colleagues tried to convey, as you were just saying - the technical aspects of what these numbers really mean. But that information was not even attempted, in many cases, to be passed on to people, and people were just presented with numbers. When I was reading about this, it reminded me - a particular challenge of communicating these things, is that some of these concepts are actually just complicated.

Rafael: Yeah.

Len: But one of the problems is that - and this might sound like a very specific thing, but I used to be an investment banker, and one of the things that I did was show people numbers and charts and things like that. And one thing I discovered, and I also discovered this like when I eventually ended up in the tech startup world, and I would pitch products and things like that, startup ideas - one thing I discovered is that if you show people numbers, they see reality.

If they haven't had the experience of being the person who typed those numbers into the spreadsheet and then hit the button to do the calculation, they don't understand the concept that there are ranges and estimates and confidence intervals and things like that. I found, even in person, one on one - trying to explain to people, "No, no what you're seeing are projections based on assumptions, based on factors like, what's the interest rate going to be, what's inflation going to be like?" Blah, blah, blah, blah. People just believe what they see, what you show them at first.

And trying to get beyond that is actually extremely difficult. I haven't had the sort of like experience with things as serious as you have, but it is just the case that people believe what you show them.

Rafael: Yeah, I know. And let me say - that quote you just read, was not bad - compared to others.

Len: Oh.

Rafael: I mean, they actually gave you the interval. So yeah, you're right. And I think that what we have to take that into account when we communicate - we have to take into account that what we're communicating is complicated. That this idea of there being assumptions, and that you could change them and get something different - that's not even uncertainty, right? That's us. That's also called like the research degrees of freedom.

For example, in our recent study, in a recent manuscript - we have a page at the end where we change assumptions about the population that's left in Puerto Rico after the hurricane. Because we don't really know how many people are left. So we had a guess, and we tried different approaches, to estimate that. Then you can see how much the estimate changes. Not because of uncertainty, but because of the choice that we made on how to estimate the population. Those are hard things to describe, yes.

One thing that I learned is that, I teach people to not put significant digits that aren't necessary, right? And especially when you're using R or Python - and it spits out a number and it has eight significant digits after the period, and the standard error's .1, you don't need all those other numbers. You don't have to say .1, 2, 3 - you just cut it at the first one, or whatever your uncertainty tells you you should cut it at. So in our case, the 4,645 - we should have really given some rounder number, right? Because it's the same thing. It should have been 4,000 or more than 800.

Len: Exactly. And the precision - I've had this experience in different areas, but the precision actually conveys to people a sense of confidence. If you've been behind it, you understand that - let's say, to take an example - "the publishing industry is going to grow by 3.25% in the next five years." You know that's complete horseshit, and this is just some analyst was given a task of coming up with a number, and they came up with something. But, if you do go like 4,625 -

Rafael: That's right.

Len: Down to like the single decimal, people actually take that as a sign that you're very confident in what you're doing, even if it's in the same sentence, it's presented as presented as a range from 8 to 8,000.

Rafael: That's right. That's something I would do different in the next time, definitely. And yeah, it makes it seem like we count it. That was one of the biggest misconceptions, that we actually were counting people that died.

Again, it's a hard thing to understand. That concept of a survey - that's not something everybody learns. It's not part of the high school curriculum, I don't think?

Len: That leads me to the next topic I wanted to talk to you about. Which is - like I said, when you've been the person behind the spreadsheet with the assumptions in it, doing the analysis - you have a hands-on awareness of what output numbers and charts and stuff like that really mean. But at the same time, these ideas aren't just hard for people who are not experts to understand. They're actually hard problems internally.

I don't know if you saw this, but just by coincidence recently - there was a comment published in Nature online, with over 800 signatories from the scientific community called, "Scientists rise up against statistical significance."

Rafael: Yes. It's like all over the internet.

Len: I imagine. I wanted to talk to you about that. Because, if even the experts are - so, I'm not in this area, but it didn't strike me as surprising that people were just using something called the p-value in a roughshod way. If the number comes above this number, then it's not statistically significant. If the number comes out underneath, then it is.

Rafael: Yeah.

Len: Can you talk a little bit about this issue of statistical significance, and why it's a matter of debate right now within the scientific community?

Rafael: Oh man, that's another podcast - and you can get real experts talking. There's people who really discuss this at length, that in a way, have made it part of their careers to think about this. It's a little philosophical too, it's not just mathematical.

But I'll tell you what I think. So, and I'll also explain where this comes from. So I think it was Fisher who came up with that cutoff back in - I don't know? 100 years ago.

Len: .05.

Rafael: .05. I can't remember what it was he was doing, and statisticians might get angry at me for not knowing my history. But he comes up with that number, and we still use it. It's completely arbitrary. And then this idea of dividing things into significant or not - which, I think, has its place in some parts of science, for example, if you're going to approve a drug, it might make sense to have a hard cut off of something like a p-value, or something else. Some kind of hard cut off, so we take subjectivity out of the equation. With something like a drug, you really don't want, I think, too much human intervention in how we decide if we should approve it or not, right? It should be a set algorithm that protects us from false positives, and also doesn't keep too many true positives from going to the market or whatever, or drugs being made available that aren't actually helpful.

But what should that number be? Well, we can have a discussion about it, but it should be something.

Now, in other aspects of science, it makes absolutely no sense to talk about statistical significance. What we have taught in statistic classes for - at least I have, and most of my colleagues have, for decades - is that if you do an experiment, there's randomness in your estimate, and sometimes there's a hypothesis that you want to reject, or not.

But let's forget about that for a second. You're estimating something. You should simply give your estimate, and give a confidence in it. That's one improvement over what is done today. Which is just to say, "significant or not?"

Also, because then you're saying, "I think the effect is this much, and the data's noisy, so the range is - there's this natural variability, all these other things. So it could've been some other number." That's what you're trying to say.

I'm being very vague here. I'm not using technical terms. But what has become the standard in many journals, is that in many cases if it's, for example, "Is there an effect or not?" - "Does coffee cause this disease? Does smoking cause lung cancer?" If it's a question like that, rather than stating the effect that you think it has, like the risk; how much more risk does a smoker have over a non-smoker? Iinstead, the rule used in many publications, is: "Is the p-value bigger or smaller than .05?"

Now what's an alternative to that? One simple alternative to that is just to say what it is. "It's .043." Or, "It's .069." Or, "It's .15." You just say what it is. You don't make this distinction.

And that's one of the things I think this opinion piece is saying - that it's very arbitrary to jump from yes to no, when you go from .04999, to .5. Or 1. That's I think, what the argument is, and I agree with it. So that's just one easy improvement, is to actually report the actual p-value.

One of the bad consequences of having this threshold is that you only get to see the papers that achieve it, in many circumstances. So if 20 people did the experiment, and one of them had a p-value less than .05, and the others didn't - it's probably the case that there is no effect. But do you only see the paper that found that there was an effect. You don't see the 19 that had smaller estimates. You see the one that had the higher estimate. And that introduces a bias, they call it "publication bias."

So, there's many problems with this idea of just dichotomizing results, in this way. There's many other arguments people make about this - you can have a whole podcast about it. If you want to have it, I can give you some names of people who think about this more than me.

Len: Thanks for explaining that. One of the reasons I wanted to talk to you about it is because, when we get these precise numbers, or we see these precise charts, but there are all these dimensions behind it - like politics, like reality on the ground, like bureaucracym like how it's communicated through the media - which isn't the same as how it's communicated to the media in the first place - and now you've brought it up, that there's these issues of theory, and basically philosophy of science, but also institutional practice and motivations and things like that - so behind these numbers that we get, "Studies show that 45% of people who do this get that" - there's just so much complexity behind it and messiness.

I wanted to use that as a chance to go into the next part of the interview. You've mentioned things about the data being noisy and cleaning up the data. I think that hooks into something I brought up earlier, which is something that you and Roger and Jeff write about on the Simply Statistics blog, and in your books, which is, how important it is to have hands-on experience doing data science.

I wanted to ask you, why is that so important? Because I think in a lot of people's minds, it's like, "Well you've got data, that sounds precise. You've got computers, that sound precise. You've got million-dollar experimental machines using microwaves, that sounds precise. You've got all this scientific theory behind you. Why would you need hands-on experience? Why do you need to clean up data? Why do you need to get rid of noise?"

Rafael: That has a complicated, long answer. I'll try my best to shorten it up.

Before I continue, I want to clarify that - I used the word "noisy." It's coming from my roots, from signal processing, where they use the term "noise." But a more precise term would be "variability." There's unwanted variability, there's variability that's interesting, and statistics is all about parsing those two out, right?

So when I say "noisy," I sometimes include variability that is - for example, in the cases of that, it's nature. What I was calling "noisy," meant that you had, one year there's more deaths because of a virus, another year there's less deaths, because we got lucky, and there was no bad influenza or anything. See what I was saying, when I was using the term "noisy" before, I was using it very, very generally, to signify stochastic variability in general.

Alright, so now let's get to your question about the importance of looking at data. What you said earlier, about how important theory is, and how important what we call statistical methodology is - one of the reasons it's important, is that it keeps us from reinventing the wheel. There are many problems that I face as a data analyst, where I don't invent anything new. I can use something off the shelf that someone invented 100 years ago, and it's just proven to work over and over again in situations like this.

There's a book of stuff that has worked in the past. There's a ton of stuff. You have it, you read it, you have a toolbox. You don't want to reinvent that toolbox. Because if you're going to be doing data analysis, you really do want to learn about all the statistical methods that have been published, that have been used in the past.

Now with that said, what you don't want to do is use that as a recipe book, that you follow as if you are baking a cake. Because that also doesn't work. And the reason it doesn't work is because every problem - at least in my experience - almost every problem has some nuances that make it so that a recipe won't work.

You have to tweak this, or maybe use another method that you thought was not the right one at first. But then you realized it was. Or you realize that you have to transform the data before you do that, or you realize that whoever was - there's all kinds of things that can happen, with taking the data, with making mistakes. These are things that the books don't tell you about. This is why it's important to gain experience.

You gain experience by doing. You start seeing problems that are somewhat common. You see them, and then when you see them again, you know what to do. You gain a sense of how to search for problems, how to find problems. In particular, data visualization is very important, perhaps the most important part of analyzing data. Before following a recipe, you want to look at the data, to make sure that it's appropriate to follow that recipe - and very often, it's not.

That's why it's so important to work with data. Because there's no such thing as a recipe book for doing data analysis. You have to learn by experience.

There are other things you learn by experience too, like how to be efficient when you code, and you analyze data. How to make your data reproducible, coding techniques that make you more efficient. There's all these other things you learn as well. So that's what I would say.

Another thing that I would say about statistics is - it relates to this data analysis question you asked me - is that, there are certain ways that we get confused with data, that arise - more often than not. Simpson's paradox is an example of - well, maybe not Simpson’s paradox. Let's just say, confounding is an example of this, where you think x causes y, but it turns out that there's another variable - z, that causes both x and y to change. And that's something that, with experience and analyzing data, you see that happening often, and know to look for it. That's another thing that you learn by learning statistics and epidemiology, in the case of confounding.

Len: It sounds like one of the really important things to learn, is to not necessarily believe your first intuition when you see something.

Rafael: Yeah. You learn to test things out. To test for how rigorous what you did is. There's a lot of things you learn by actually analyzing data, as opposed to just learning the methods and the theory. And that's what Roger, Jeff and I - I think you're hearing us talk about often.

Len: It's interestingm, just yesterday when I was preparing for the interview, I read a piece by Roger, where he brought up von Clausewitz, when he was talking about what the distinction between theory and practice. He articulated things in terms similar to the ones you just used, where you should learn everything that worked in the past, but essentially when you go to war - in this metaphor - you've got to pay attention to what's happening in real time on the ground, and beaware of what's worked in the past, but be willing to innovate in the moment.

You mentioned visualization being very important. That reminded me of a post you published recently, which had me laughing - although it is serious - where you said, basically, you're not a fan of bar charts.

Rafael: Dynamite plots, yeah.

Len: I was wondering if you could explain a little bit about that. Because this is something that we all see and use all the time.-

Rafael: First of all, let me clarify. Bar plots are fine if you're showing if you want to show 10 numbers, and you want to do it through a graph, use visual cues - the bar plot is the tool to use. Each number gets one bar, and length is the best visual cue for eyes to connect some shape to a number. So it's what you do.

My critique was really of - they are called "dynamite plots," or they have another name, bar and something else.

These are widely used in science to summarize multiple data points. And that's where I have a problem. Not just me, there's whole articles written about it. In my blog post, I link to those articles, so people can go see them. It's something that has been actually published about before.

So, the problem is thatm if you're making a comparison between two groups, say a control and a treatment group - and you have 20 data points for each one, if you want to compare those two, I, as a reader, want to see those 40 points - it's not hard to do. You just plot them. You plot the 40 points, and I can see them. You put them next to each other. Or if that's too much to ask, you make a little box plot, so I can know the range of the data. Or two histograms.

But that's not always done in science. Sometimes, they make a bar plot that shows you the mean of each group. And then they have a little antenna, that tells you the standard error of each group.

So you're now showing me four numbers, and there are many, many ways in which two groups of 20 points can have the same mean and the same standard deviation. So you're really over-summarizing the data, and actually not showing me much. You're showing me four numbers. And if you want to think about it in terms of ecological impact - if you're printing a journal, you're wasting a lot of ink, a lot of toner and paper on this graph that's just showing you four numbers.

Len: Thanks for explaining that. I think I understand your point a little bit better now.

Fefore we move on to talk about your latest book and how it came about, I wanted to ask you a selfish question. I ran a hockey stats website for a few years, and I really enjoyed doing it. And I know you wrote a post a while ago about LeBron James, and you talk about Moneyball and stuff like that.

I was wondering, as someone who knows all about these things, how do you see sports statistics evolving in the sort of near term?

Rafael: Oh man, they just keep getting more and more sophisticated. It's impressive. I'm a huge baseball and basketball fan, since I was a kid. I've always looked at the stats and thought about the stats. It's something that I've always liked. That's why you see me posting about it every once in a while.

I like how statistics is used in sports. And it's also one of the first very clear examples of someone doing better using stats.

The Moneyball movie is a summary of what happened, of how someone used statistics to improve, to find inefficiencies in the system.

But now you're seeing - I mean, and I've seen things in NBA, where the data is actually where every player was at every, I don't know, second, or some time frame, and where the ball was. That's a pretty complex data set.

And from there, they're trying to figure out what plays are more efficient, which groups of players play better together. It's super complicated, and very sophisticated. I think it's very interesting, and I see it continuing to thrive, and to be a big part of sports.

Len: Maybe one day some AI will be calling the plays.

Rafael: Sure, and in baseball, I think there are teams that almost do that. Not quite, but that you have basically a computer calling what to do next. There's not like a robot in the dugout, but that there's like an algorithm where you're in, after the manager. "This happens, you do this. If this happens, you do that. If this pitcher comes in, this batter's going to hit against him."

Len: It's going to be really interesting to watch what happens. I've read a little bit about gathering essentially sort of three-dimensional data from the entire court. And it just gives people so much to think about.

Rafael: There's two things with sports. One is, how to build a team, how to pick plays. That's one aspect of it. But then the other, which is now part of a lot of organizations - they hire people to do this, and they appear to do it quite well - the other part of it, that I get more involved in - not too much, but every once in a while - is using statistics data to prove that one player is better than the other. That's what my post about LeBron is about. I used LeBron mainly because I see his stats, and I just can't believe it that there's someone this good. And how anybody could question his greatness.

Len: Moving onto the last part of the interview, I wanted to talk just a little bit about your book, Introduction to Data Science: Data Analysis and Prediction Algorithms with R.

I believe it sort of springs from a previous book, it springs from your MOOCs, edX courses. I was just wondering if you could talk a little bit about the book, and how you used GitHub to produce it?

Rafael: Let's see, so where do I start? The first book was for the first MOOC. We were creating our Markdowns for the lectures. When we prepared lectures, we would have R Markdown, that created the plots that we were going to show in some cases. I have to think back, because now I'm using something called "bookdown," but that wasn't around, or I didn't know how to use it back then.

So Mike wrote this script that would turn the R Markdowns, automatically, all at the same time, into a web page. There were these tools to do it. And GitHub was part of it, because they facilitated this. They would automatically turn into a web page with rendered R Markdowns, so that people can read it and see their graphs, not just code.

I really liked that. The students could go in and actually see the code, and copy and paste it. If they wanted to download the original R Markdown, they could do that too. I really liked that approach. It helped me organize and be efficient. It was much more work, because you had to prepare these R Markdowns in advancem but at the end, I was very satisfied with that.

At that point, when we had all these R Markdowns, either Jeff or Roger told me that I should just turn it into a Leanpub book, because I already did all the work, it's already all in Markdown. So we did that. It wasn't as easy as they said it was going to be, but it was relatively easy. The difference is that my book had much more code than theirs. So it had more places to break. And much more LaTeX than theirs. So there was more places to break. The LaTex was challenging.

But, it worked out. And we put it up there, and it got a lot of people downloading it. I love the fact that you can make it free, and give a suggested rice. Because people do pay. It's like 10%, I think, pay. At least for that book. There might be other stats for different other books. But yeah, it was great. Because it was free, and we were still making a little bit of money on it. And at the same time, it was a great resource for students. So it was good all over.

Len: Thanks very much for that explanation. For anyone listening who's not aware, one of the features of selling a book on Leanpub is that we have a variable pricing model. So, the author can set a suggested price and a minimum price. And that minimum price can be free. And as Rafael's found, a lot of people, even when a minimum price is free, will pay for it. Which is a nice bonus for the authors.

One of the reasons people do that is that we pay an 80% royalty rate, which is high for the industry. And we actually show how much the author earns when you choose what to pay. So it establishes this nice connection between yourself as a customer or reader, and the author.

But also, having that free minimum price is really important for certain types of projects - particularly those aimed at a very wide audience, for educating them. Because not everybody can afford to pay.

And if your goal is to train the next generation - well, the first generation of data scientists around the world, that free minimum price actually plays a really practical role in achieving your goal.

Rafael: Yeah. In our classes, and it's particular, because we're teaching, this is for a book - we had a substantial number of people with much less purchasing power. That's just a reality. And it would seem unfair to have them not be able to have the book, that all the other students have. That was one of my favorite parts of the Leanpub model, was that you could do that.

The other thing I was going to say, because - for others thinking about doing this - for my second experience, I used something called "bookdown" for the book. For my second MOOCs, I decided, "This is the way to do it." Before every lecture, I'm going to do an R Markdown. It was a lot of work at first, but at the end I basically have a book done.

"bookdown" is a package in R that makes it quite easy to turn R Markdowns into an online book. It's not like Leanpub, where you actually have a PDF with pages, it's more like a web page. And then, bookdown also lets you turn it into a PDF. So Leanpub let's you submit a PDF [In our "Bring Your Own Book" writing mode - Eds.] So all I had to do was just submit this PDF, and there it is, and it's now published for Leanpub. It was so easy. So easy compared to the previous time. Because I didn't have to LaTeX anything, because bookdown took care of all that.

Len: Thanks very much for that very clear explanation. It'll really help anyone listening - if they're thinking about publishing something to accompany a MOOC on Leanpub, or just publishing on Leanpub generally.

The last question I always like to ask people here - it's a bit of a springing-on-you kind of question - if there were one thing we could build for you, or one thing we could fix for you on Leanpub - is there anything you can think of?

Rafael: Well last time, it would've been LaTeX. Because there was some weird LaTeX you guys use. I can't remember what it was - but I didn't have to do that this time, because I created the PDF myself. But that last time, we had to write a script to change the actual LaTeX to the LaTeX that Leanpub took. That was really the source of the pain point for me, but it wasn't really that bad.

I don't really have much to say - I like the system. The other thing - I guess you guys tell them all the time, but authors get like 80% of the proceeds, which is unheard of.

Rafael: Am I right about that, is it 80?

Len: Yes, it's 80%. Thanks very much for that, for that feedback. I'll make sure to communicate that to Scott, my colleague.

And thanks very much for taking the time to do this interview. I had a lot of fun. You were very game to cover so much ground. And thanks also for using Leanpub to publish your book.

Rafael: Thank you guys for providing the service.

Len: And thanks as always to all of you for listening to this episode of the Frontmatter Podcast. If you like what you heard, please rate and review us in iTunes, or wherever you found our podcast. And if you're interested in becoming a Leanpub author yourself, please visit our website at leanpub.com.

Podcast info & credits
  • Published on April 24th, 2019
  • Interview by Len Epp on March 27th, 2019
  • Transcribed by Alys McDonough