Special Guest: Chris Mattmann, Author of Machine Learning with TensorFlow, Second Edition
A Leanpub Frontmatter Podcast Interview with Special Guest Chris Mattmann, Author of Machine Learning with TensorFlow, Second Edition
Special Guest Chris Mattmann is the author of Machine Learning with TensorFlow, Second Edition. In this interview, Leanpub co-founder Len Epp talks with Chris about his background, worki...
Special Guest Chris Mattmann is the author of Machine Learning with TensorFlow, Second Edition. In this interview, Leanpub co-founder Len Epp talks with Chris about his background, working at NASA's Jet Propulsion Laboratory, re-architecting the way NASA missions handled massive changes in the amount of data being gathered from instruments thanks to technological changes in the mid-2000s, about some fascinating NASA and DARPA projects he's worked on, the particular challenges of dealing with telemetry data, Kobe Bryant, and about his book and his experience as an author.
This interview was recorded on February 10, 2020.
The full audio for the interview is here: https://s3.amazonaws.com/leanpub_podcasts/FM143-Chris-Mattmann-2020-02-10.mp3. You can subscribe to the Frontmatter podcast in iTunes here https://itunes.apple.com/ca/podcast/leanpub-podcast/id517117137 or add the podcast URL directly here: https://itunes.apple.com/ca/podcast/leanpub-podcast/id517117137.
This interview has been edited for conciseness and clarity.
Transcript
[Note: There are discount codes for Manning books at the bottom of this transcription. Please note some of them have limited uses and are temporary! - eds.]
Len: Hi I'm Len Epp from Leanpub, and in this episode of the Frontmatter podcast I'll be interviewing Chris Mattmann.
Based in Pasadena, Chris is Deputy CTO and Principal Data Scientist at NASA's Jet Propulsion Laboratory.
You can follow him on Twitter @chrismattmann and check out his profile at scienceandtechnology.jpl.nasa.gov/dr-chris-mattmann.
Chris is the author of the Manning book Machine Learning with TensorFlow, Second Edition.
In this interview, we’re going to talk about Chris's background and career, professional interests, and his book.
So, thank you Chris for being on the Frontmatter Podcast.
Chris: Len, it's great to be here. Thanks for having me.
Len: I always like to start these interviews by asking people for their origin story. So, I was wondering if you could talk a little bit about where you grew up, and how you first became interested in Computer Science and technology generally.
Chris: I'll try and do it in a way that doesn't take the rest of this podcast. I've been accused by people of being long-winded, so I'll try my best.
I grew up in a town about an hour north of Los Angeles. I was born in Los Angeles, born and raised. And RIP Kobe and Gigi - everybody out here is still hurting every day about that. He was born in '78, I was born in '80.
I grew up in a town about an hour north of LA, Santa Clarita. Probably the most thing it might be known for is on the west coast, it's where Six Flags is located, Magic Mountain. A number of my friends and others worked there growing up.
I was not into computers growing up, I was into sports. I - everyone around me - Santa Clarita's a pipeline for going to UCLA and Cal State Northridge, and other things. And so naturally, I wanted to be different, and so I wanted to go to USC - the University of Southern California, which is an amazing private school that I didn't have the money to go to.
I grew up in a trailer in Santa Clarita, and I like to tell people I went from the trailer to the Ph.D. And it was through - to be honest, it was through some of that - same things that Kobe preached and others, that Mamba mentality - are just grinding harder than everybody else. I'm not the smartest person in the room, but my philosophy has always been, "I'll outwork you."
And at least for me, what happened after probably my senior year growing up, was - I got interested in computers a little bit. I had tinkered with them a little. Like a long time ago, I had an Apple IIe, and I figured out if you pressed ctrl-C during an adventure game on the Apple IIe, you could change the things that the characters said to one another. And so I changed the characters to say swear words to one another - and all the majors and the wizards and things like that, and that was when I was 12. That was most of my computer experience until I got to SC.
At SC, I picked Computer Science. I don't know why, I couldn't tell you. But when I got in there, I started studying Computer Science - and to be honest, I felt kind of inadequate. A number of my family - my uncle in particular - convinced me to hang around, and it was the best thing that ever happened to me. I just felt like everyone else was smarter than me. And then I realized - it wasn't that, it was just myself, and doubting myself. And so I stayed.
While I was an undergraduate, I had to work, basically, to be in school.
And so one summer I was in the computer lab at night, and an opportunity came to work at JPL, the Jet Propulsion Laboratory in Pasadena. It was an earth scientist, a guy named Doctor Robert Raskin. He basically was looking for computer programmers, because back in the day, scientists - card-carrying scientists - "Big S," I like to call them - unlike my generation of data scientists, they are called "Little S" science; maybe it's the feeling of inadequacy of 20 years of being here? But Rob was looking for someone that could help him program.
And so I got my start basically programming databases for earthquake and earth scientists, working with Caltech and other places.
I went from there. I graduated after having a job as an academic part-time. And after about six months off, I was still interested in continuing my education for a couple of reasons. First, at the time, if you got a Master's degree, you could get a raise at JPL. And so I wanted to do it. I had a - well, a longer-time girlfriend at that point, but soon-to-be wife. We were buying a house, and so that was kind of important, to take care of the family.
And then the other was - I just got interested in Computer Science. Right around the time I started my Master's, I basically had the opportunity to watch the Mars rovers, the Spirit Opportunity rovers. The twin Rovers launched in 2003, 2004. And I remember being there, watching NASA TV at night at my new house in Highland Park that we had just bought - on my TV that I couldn't afford, but I bought it anyways - and basically sitting there with my wife, and being like, "Wow, I work there." Governor Schwarzenegger is shaking the hands of friends of mine that are in Mission Control and other places. And so that made me want to re-dedicate myself to computers and science and learning more about it.
I started working on missions after that. My team - between 2005, 2009 - kind of re-architected the way that we deliver instrument and ground data systems for missions, for earth and planetary science. Basically, to cut to the chase, and I know I'm being a little long-winded - I'll wrap up soon - to cut to the chase, missions fundamentally changed during that era. And well really, sensors in general changed - right around the time - a couple of years later, the iPhone would come. Everyone would have an amazingly powerful camera on their phone, and things like that. The same thing was happening with instruments.
And the Orbiting Carbon Observatory in 2005, in which the project really got going, and when I kind of kicked onto it. Basically, it was going to change the way that we took our science OCO's goals, to measure global carbon from space. And as opposed to the prior earth science mission, which was a scatterometer measuring winds, called "QuikSCAT," which took - in 10 years, 10 gigabytes of data. And in daily jobs - the amount of jobs that it had to do to process it, was on the order of about tens of jobs per day. OCO would be 10,000 jobs per day and 150 terabytes of data - which is commensurately thousands to tens of thousands, to hundreds to millions more data within the first three months, that the entire prior mission took in an entire decade.
And so that was not unique to OCO. It was all the missions at the time, and so we had to kind of re-architect the way that we did it.
And so, I was being heavily influenced at the time by open source, what was going on in the open source community. In particular, I got involved in the Apache Software Foundation. I started to build search engines around the same time at USC - I just kept going. I had a really inspiring mentor there, Dr. Nenad Medvidović - who convinced me to stay on and do a Ph.D.
Basically, I was living the dream. I was redesigning and staffing and putting together the team that would build the next generation of missions for basically big data science and things like that.
And then getting to basically take JPL from the era of C and C++ programming, which was basically what hardcore people did for flight software - and convincing them that this new thing called "Java" could be useful, sort of in that context. And, yeah - so we moved the missions on to Java and things like that.
Len: Thank you very much for that answer. I should mention, we like long-winded on the Frontmatter podcast. There's nothing better than having someone spontaneously speak in whole paragraphs for minutes at a time. That's a wonderful thing, and bodes well for a good interview.
So, there's a lot to unpack there. One of the questions I wanted to ask you - specifically with respect to data science and working for NASA - is there something unique to dealing with telemetry, as opposed to other types of data that poses a challenge for you as a data scientist?
Chris: Absolutely, Len. I think there are kind of two unique things about telemetry. And so just for your listeners and everybody - telemetry is typically time series data that comes back from instruments and from different science - basically hardware and software. It's usually packetized, kind of like the internet. The internet is packetized data that's transferred from computer A to computer B on possibly different networks, possibly across the globe or possibly into space. You could extend that concept to space. That's how computers on the ground communicate with instruments in space and things - they use telemetry. That's how they send their data. There's a whole layer about that.
The challenge with telemetry is twofold. The first challenge would be that you have engineering and housekeeping telemetry, and you have science telemetry. And it's all really messy. It feels like it should be very organized, and things like that. But it actually turns out that all missions, all instruments, all - whoever - there's not this like nice terrestrial commodity network architecture and standards, and things like that. There are some standards for space - but they're at the very low level, and they're managed by this organization called CCSDS - the Consultative Committee on Space Data Systems. And they do set,- like I said, lower-level standards.
But at the application and higher and other levels, they tend to do kind of their own thing. And so you're not guaranteed if you have readers and writers and software, for one mission's telemetry to be able to line it up with others. So it's really hard to compare across missions.
There's actually a big effort at JPL nowadays - there's a project here called ORCHIDS, in which they basically are trying to make telemetry data, what we call "analysis ready" or ready for machine learning and other things like that.
The other challenge with telemetry is that - because sometimes we're dealing with earth science and terrestrial, closer to terrestrial orbiters and things like that - but because we also deal with deep space at NASA, there are a lot of issues with, for instance, missing data - data that's been sort of messed up in particular way,s that you've got to calibrate and validate and things, especially at the telemetry level. And so you basically have to account for the fact that - "Well, the orbiter only sent back three hours, and it was supposed to send back six."
And when that happens, we keep going, we can't stop. We've got to do processing. And so there's a lot of messy data issues on that end. Not just in terms of like the formatting, and you can't compare them across instruments. But even if you could - the record itself might be non-congruous, non-continuous. You've got to impute data and do things like that, and make guesses. So that's the other big challenge about it.
Len: One of the really fascinating things, I imagine, about the kind of work you and your colleagues do at NASA, and particularly in the Jet Propulsion Laboratory - is that lives are at stake in the work that you do, in addition to billions of dollars. And so when we think about the kind of work that a data scientist might be doing at a company - at sort of, let's say a conventional company like Amazon, where the stakes are still high in some scenarios, but they're not as high as they are for you - are there unique approaches to the way -? I mean, I imagine they're across the defense industry as well. But are there unique approaches to testing the way that data is used in the work that you do?
Chris: Yeah, I think there are unique approaches there to testing data, and even in the way that we handle processes and things like that. JPL - like any major federal lab or what we call Federally Funded Research and Development Center, as in FFRDC - we're the national labs, and so we're supposed to be the centers of excellence for the government, and our parent agencies. So everybody at JPL, just - for your audience and for yourself, we're all Caltech employees. Caltech manages JPL on behalf of NASA, and for the government. So it manages the institution for that. Just like Lawrence Berkeley Labs, Berkeley manages that lab - that national lab on behalf of the DOE.
One of the things about the National Labs is that we're supposed to be the first of a kind in doing things. So if it becomes commodity, if it becomes something that's routine - either in hardware, software or spacecraft - whatever, remote sensing - then industry's got to do it. We can't compete with them. But we can do the Mars program. The thing that no other nation in the world has put something on the surface of Mars besides the US. It's a crown jewel. We can do things like the Deep Space Network, these three 70 meter huge dishes as big as a football field in Canberra, Australia, Madrid, Spain and Goldstone, California.
We can operate that type of stuff. And so because we're doing those big-ticket items and things like that, we have to have two components - operations. And operations, like you said - it's a bigger deal, lives are at stake, science is critical. We're informing policy and research and things like that.
So we have operations. If you're a data scientist and you work on a Mars program or a Mars project, you may work on what they call "Martian Time." Mars has sols - a Martian sol is not equal to an earth day. And so when that happens, you might be deployed into our control center and things like that, and work Mars shifts on Mars sols, doing that. That's just one way it might be different. They take it very seriously, even - the maintenance and operations and commanding of, basically, robots on other planets. Which is JPL's NASA center of excellence.
The other key component - the other big thing, is - we need a healthy R&A program. So I don't want to give everyone's impression that everybody at JPL is working around the clock, and all we do is lock ourselves in the big Mission Control and work on Mars time. JPL also has - my boss at JPL - the CTO here, Tom Soderstrom, he likes to call JPL "an 80-year-old startup." Because from his perspective, our office, that is, the innovation office - is really operating like a startup.
I manage a team of people here called The Innovation Experience Center. These are people that work on everything from next generation of flight hardware, basically designing - what if in the future we could have radiation hardened GP-like devices, to Internet of Things, Smart Campus. I've got data science and analytics in our org, visualization - people that could work at the New York Times or the Guardian doing data visualization, I have working here. I've got the Cloud Innovation team. I've got a Machine Learning team.
So all those people, they're like a startup. And so that's a little bit different. They're not deployed on Martian time. They're cross-functional, they're innovative people helping the rest of the laboratory - but they work a little bit differently. They're more like a skunk work.
So JPL's a big place. It's amazing. We can operate in a lot of modes - and we have to, to basically serve the nation and to really explore the galaxy, and help mankind and the universe.
Len: You talked about living the dream. So, I've got a bit of a cheesy question for you. Amongst the many things you've worked on at the JPL, what's the coolest project you've been involved with?
Chris: I would say my favorite project over the years - there's one that I've recently worked on, that's near and dear to my heart. But I'll talk about the one maybe I'm most known for, in that I really, really enjoyed working on over a number of years - was a project called the Airborne Snow Observatory, ASO. It was a sub-orbital project, meaning that it wasn't in space. But what we do a lot of times is, we will test things on aeroplanes or in situ on the ground, or things like that. We'll validate instruments before we put them up in space. That's for risk reduction, and things like that.
But yeah, ASO, it was an airborne mission, that had two instruments that basically flew over the Sierra Nevada. And what it was looking to measure was basically snow melt, the rate of snow melt, and basically, how much snow we've accumulated in the mountains.
So, water's a big deal in the Western US. There's land rights, there's water rights. Believe it or not, even though we've got that big Pacific Ocean to the left of us, to the west of us, we can't just take the water from that and drink it; desalination and other things are too expensive. And so we've got to deal with the water, largely, that comes out of the mountains. Out of the Sierra Nevadas, and other places.
And so right now, the way they measure that - unfortunately, still to this day - and even with ASO, although it's changing that, they need a whole fleet of ASOs - but basically, is that a guy or gal goes up into the mountains and sticks a big pole in the ground, and measures how much snow is there. And you can imagine that's treacherous, that's dangerous. Lots of bad things can happen, not to mention that it's a poor way to get really great measurements. But that's the way that it works.
And so the goal of the ASO was to use a spectrometer to measure light reflections off the snow, to measure the rate of melt of snow, and to use a scanning LiDAR to measure the snow accumulation or the snow depth for that. And if, with those two combined knowledge, we could basically inform the California Department of Water Resources and Water Managers - like, "Hey, here's how much snow you really have left. Here's how fast it's melting. Here's how much water you should basically release."
And so the key part about ASO, besides those cool instruments, was the compute system that we set up. We set up the first ever compute system, that basically - an airborne one from NASA, that took all the data the same day. That plane hits the ground, the flight people walk over, we set up a remote compute laboratory at the Sierra Nevada Aquatics Research Laboratory - SNARL, at Mammoth Lakes. And that remote compute laboratory that my team built and sort of commissioned - basically - they get off the plane, they take a terabyte brick off of each instrument.
They plug the brick into the compute machine system, and they press a green "go" on there. And basically they interact with our operators through internet relay chat. We created bots to monitor the processing. And in less than 24 hours - we went from raw data, raw telemetry and things like you talked about - basically to produced maps for water managers and products that showed snow water equivalent, basically on a seven day period in that range - and basically just fundamentally changed the way that they can make predictions about that.
And so, the reason I said they need seven ASOs - and back to my earlier point about JPL doing the first-of-a-kind thing - is that JPL did the first of a that in 2013 - and we showed it for about three years, and now there are efforts to basically have companies do this. We're transitioning the technology to companies and so forth to do it.
So that was my favorite project, probably ever. It involved combining all the mission work I learned before - the technology, the open source work, bots, intelligent digital assistants - the stuff that we're doing today, machine learning - like the book. And so yeah, that was my favorite project.
Recently, the other project - just to let you know that I worked on that, I promise I won't talk about it as long. But that project was - we have a project called, Send Your Name to Mars. And so what that is, is - for the Mars Rovers, like Curiosity, which launched in 2012 and now 2020 - which is leaving JPL today, to get on a plane, basically - and fly to Cape for commissioning and eventual launch in the summer to Mars
One of the things we offer is for children around the world - K through 12 - to basically offer up a name for the Rover.
And so Curiosity got its name - its technical code name at JPL was Mars Science Laboratory, MSL. Curiosity came from a contest with children to name it. And so 2020 has a very similar contest. It's got two contests. First to name it, and it's down to eight names. You could look it up right now, "Name Your Rover," or, "Name The Rover." Just Google that. [Here is a link - eds.]
And then the other cool part that we allowed, is we allowed people to etch their name into the rover. So, Len - if you have a family with your kids or your wife, or just you want to put your own name. You could go to "Mars Send Your Name." And what that was, was a way for you to get a ticket or a pass to Mars. What it does is it puts your name that you submitted, and your address and your things like that, on a hard drive, whatever you want to put. And it's going to go on the hard drive that's on 2020 right now, and it's going to go to Mars physically.
It's like the Carl Sagan concept from Voyager, with the golden record, but a modern approach to do it. We got 11 million names submitted to that, or 11.9 million - which was just a huge success.
Len: And how many of the names submitted were Rover McRoverface?
Chris: Ah, everybody asks that question. So let me tell you a funny thing. We're working on a paper right now for KDD, which is a big data science conference. We basically used machine learning - things like TensorFlow and whatever. We had to build a profanity filter. But both - a dual thing, because it wasn't Rovey McRoverface. It was names like, I don't know? Like you might see on The Simpsons. That Bart used to call Moe, do things like that.
And so we had to use some - believe it or not - advanced machine learning and AI. We used an LSTM network, which is a recurrent neural network - to basically learn really good names and so forth, to get past all of the sophisticated ways that people tried to do it. So even in that environment, we're innovating and we're figuring out how to do machine learning. But yes - many, many people do try and submit Rover McRoverface.
Len: Thanks very much for explaining all those cool projects. I've got to say, one of the fun things about researching for this interview was getting to read about some of the interesting things that you've done over the years. And one of those projects was with DARPA, called Memex. I know you've done a couple of projects with them, but the Memex one involved investigating the nature of the deep web and the dark web. And so, I was wondering if you could talk a little bit about that project and your contribution to it?
Chris: Sure Len. Now you're moving post 2009 there and my origin story. So perfect segue.
In 2009, I was at my wit's end. We were about to deliver OCO. We delivered it, and then the instrument fell out of the sky on launch in 2009. Now that's, we're NASA - so we said, we were like, "You know what? President's budget, we're going to do another OCO." We did OCO-2. Built the spec, the same instrument. And in 2014, we succeeded in launching it and put that up into space, and that was great. The same data system we built for then, worked then - and there's an OCO-3 on the ISS now too. So all of that software is used.
But post-2009, I was done with missions, and I wanted to go into technology development. And what I learned was, all the things that we did wrong or could do better - when I was building the missions. I'd become the chief architect in our instrument and science division for that, basically. And I was looking at technology infusion, how could we do it better? So I looked at DARPA. And DARPA, to me, is the center of excellence in computer science technology innovation.
They fund a lot of work. A lot of it has to do with cyber and defense and things like that. But in doing so, they give the resources - sort of requisite resources that are necessary, to make some of those advancements. Some of those advancements like the internet, right from the ARPANET project - SIRI and things like that. And to me, Memex, was to start what the ARPANET was to the internet, in my opinion.
And my background - being involved in open source search engines, projects. The Apache Nutch project, like I referenced back - way back when, eventually what became Apache Hadoop, and helping to contribute to that.
I was really interested in search, and one of the things I thought we could do better in science data at JPL and NASA, was the way that we search it. As it turns out, a lot of our archives and things like that - we have a big physical oceanography data center, one of the nine neuroscience data centers for NASA here at JPL.
The experience going to that site is very clunky - you search for the data, you select the data, you order it in a cart. You download it via FTP or you might have to click through some JavaScript, or - and then you get the data. It's not like a text file, it's in a binary file like HDF or NetCDF, which are proprietary binary data formats. Actually, HDF might be open source. But anyways, they're not easy text files to read.
And so that experience detracted a bunch of users up until recently, with things like Google Dataset Search, where they made it a little easier. But back then, it was a big detractor. And one of the things that DARPA was trying to solve with Memex, is they experienced the same problems when law enforcement was trying to, for instance, do bulk analysis on the internet to find the bad guys that were doing things like human trafficking. Or that were going on forums and selling weapons illegally.
As it turned out, that same experience - going on a forum, clicking on ads, trying to figure out whether this ad or this picture of an escort or this picture of a guy was someone doing something illegal - and if so, get the tactical information that you needed to be able to go intercept and save a life, or figure out some connection to international terror or something like that - were the same types of issues that they were trying to solve basically on Memex, that we needed to solve at JPL for our science data.
And so I got involved in the Memex project. I wrote a proposal, it was successful to that. On that project, we built technology to crawl and do bulk analysis, what we call "domain specific vertical search," of both the public internet, as well as the deep web. The public internet or the surface internet is about three percent of the internet. It's the web you can see. And then the deep web is all the web behind JavaScript, Ajax, behind forums that you have to log into.
And then once you get to the content in the law enforcement case - images, videos and things like that, of people and other things that you actually really want to turn into features or text - using things like machine learning; you want to know this person has these characteristics. They're in this type of hotel room. And it's X percent likely that there's a case of human trafficking, or this is an illegal AR15, or things like that. And so the technologies that we built on that program allowed extraction of information in bulk from the web.
There was a big 60 Minutes in 2015 on Memex, by the program manager Chris White. Basically he's sitting there explaining to the anchor interviewing him, that Memex contributed the first understanding of the span of human trafficking on the web. It was about 80 million ads and 40 million images. At that scale, the other thing you need is to be able to have a commercial capability in acquiring that type of scope of data on the web. So it's not like any old research people could do that. You needed something that would scale, like a Google, and things like that.
And so the two main tech that came out of Memex, maybe the three main tech are - crawlers that could go, acquire that data in bulk. The second thing is to featurize that data, and turn it into properties that you could search for. No matter the multi-modal content, the start image video - whatever, text - turn it into features that you can search for. And then the third main thing was to do that at scale and to provide analyst interfaces to interact with that.
Memex technology was transitioned, really, all over the globe. My role in it was to make it all open source, and make sure that company A today doesn't have to struggle to get those features, and that anyone could build a [?] or anyone can build a better JPL science data archive. Or law enforcement could buy the stuff in commodity from providers. They don't have to recreate that technology. So yes, that was my role in Memex. In the non-NASA side, that may have been my favorite project ever.
Len: You mentioned features and featurizing a couple of times. There's a specific meaning to that in the machine learning context, which I think we'll get to when we're talking about your book.
But before we go onto that, I wanted to ask you - you mentioned open source. And so, yet another dimension to your work, is open source work. I was wondering if you could talk a little bit about the Tika framework?
Chris: So, alright - open source, Len. Open source, it's funny for me. It's like the Illuminati. It's like, who you know and who you're connected to. Or it's this small set of people that - it doesn't seem like they'd all be connected in a way. But there's these like weird connections there. When I was studying to get my PhD and my adviser was Nenad Medvidović, his adviser at the University of California Irvine - in like the mid-90's, ran what I would call "Super Skunk Work Software Group." And this is going to get to open source in one second.
But Dick Taylor's students were people like Roy Fielding who invented the REST architectural style - representational, entity state transfer. Basically the modern web service. The way they compute and get data on the web were invented. There was a gentleman, Jim Whitehead, who invented webdev - another key component to the modern web. Things like Dropbox are based off of that, and all that. Besides Jim, there was Jason Robbins, who invented ArgoUML, the de facto open source tool to model software, and things like that.
And there was Nenad, who basically invented the modern way of understanding software architecture. So through that connection - through Roy and through those students, Roy founded the Apache Software Foundation. Roy was one of the original founders of it. Another one of Dick Taylor's students, Justin Erenkrantz was the president of Apache at the time. And so my academic uncle and cousins were deep, deep into - really the fundamental open source organization that all modern ones are based off of.
And so when I got involved in Nutch, that was actually without being in contact with Justin. And then I started talking to him about it. And he's like - I said, "Oh, you're the president here and Roy and all that -" I figured out the relationship, and Justin just encouraged me to keep going and keep being involved in open source.
And so I caught the bug. From a PhD to studying software architecture, to search, to getting involved in Nutch, to helping to build Hadoop and things like that. In doing that, I learned all the big data technologies and things, and I also learned the thing that I'm really interested in in that domain - which is the analysis of information and content and multimedia.
That's where Tika came from. Tika was a piece of Nutch. So, Hadoop was sort of the people building the open source Nutch web crawler, basically realizing that the underlying computation and data framework is actually standalone. It can just be its own project and its own - eventually, company - like Cloudera and things like that.
And so they split up Hadoop from Nutch. Nutch sort of was the grandfather or great-grandfather or mother of all of these other projects. Like Hadoop, hspace - which is like a big query, a big table query - all these projects like that were built from Google.
And so Tika was basically pulling out the parser framework that extracted test and metadata from any type of content. Because the search engine needed to be able to do that. It was also all of the code in Nutch to deal with language identification, to be able to enable search and text mining across multiple languages, to detect the language and things like that. And then, the ability to detect any file type, the MIME detection framework, or the multipurpose internet media exchange or file type detection framework. Because at scale, you can't be bothering with what type of file this is, or looking at the extension. You need a more sophisticated framework to do that.
So that was the initial proposal for Tika, and the initial creators of it were myself and a guy named Jérôme Charron - who was building basically a French search engine at the time, called "Frutch." Jerome got busy in life, he became the CTO of a company. And eventually a guy came to replace him. His name was Jukka Zitting, who was a Finnish gentleman who was working on content management at the time. He was working on what would eventually be bought by Adobe, and what would eventually go into things like Alfresco and modern content management systems.
But Jukka knew Apache a lot better than me at the time. He helped really turn Tika into a mature project. And then what I did, was basically make sure Tika was used everywhere. From FICO, in generating your credit scores, to NASA, to - basically, eventually today - you could look this up on Wikipedia, but Tika is one of the two key technologies that they used to basically unravel the Panama Papers. It's a very powerful tool in data journalism and digital forensics. I like to call it "the digital Babel Fish."
It extracts - you put any digtal file into it - and just like the Bable Fish from the Hitchhikers Guide to the Galaxy that understands any language, you give any file to Tika, it'll give you text, metadata, language information. And after Memex - people, places, things, locations and things like that. And so Tika is this place that - it's really my passion project. I tell people - I get 30 students a week emailing me. They want to work with me, and they want to come do whatever and work together. I said, "If you're interested in Tika, we could play ball. Otherwise, it's - just like Kobe, "It's time for my family and my kids." Tika's about the only thing I make time for anymore, besides working and managing.
Len: I should mention, before we move on to talking about your your machine learning book - you have a book on Tika on Manning, and we'll make sure to link to that - that you co-authored with your second colleague that you mentioned.
But before we go on to move on to talk about your book, one of the really fun things about this podcast is that I get to interview authors from all around the world. And so I get to ask them questions about things that are local to them, that the rest of us might only know from the headlines.
You've mentioned Kobe Bryant a couple of times in this episode already, and I wanted to ask you if you could give us - imagine you're talking to people who aren't from the United States, who don't follow the NBA. Who was Kobe Bryant, and why is he so important to Lakers fans, and people - I mean, everybody who's a fan of the NBA?
Chris: Yeah. So for me, Kobe Bryant's important to me for a couple of reasons. Kobe Bryant is an international basketball star that played a 20 year career in the NBA. He died tragically, along with eight other people - including his daughter, Gigi - who was about 13, in a helicopter crash over in Las Virgenes. Which is near Thousand Oaks, which is maybe about 20 minutes, I would say, north east of LA. And, yeah - so it's been a couple of weeks since he passed away and died.
And why does he mean a lot? He means a lot to me in a couple of different ways. First, he and I are around a similar age, and I grew up watching this. Kobe Bryant was one of the youngest people ever to play in the NBA. He was drafted right out of high school, and around that time, they weren't doing that. I mean, nowadays we have LeBron James and other people - who I think are setting the standards. But back then, it was Kobe and maybe Kevin Garnett. I can't even - maybe Shawn Kemp, I think - that have been drafted out of high school, without having played a year or two in college. And so there was a lot of pressure on Kobe early on.
The way I like to tell his story is, it was sort of a redemption and an inspiration story. Kobe Bryant started out - you either loved him or hated him, coming out as a young gentleman out of high school. Initially in his first few years, I don't think he lived up to - everybody's expectations. Everyone wanted to compare him to Michael Jordan, the greatest player ever.
And what Kobe did instead, was he defined his own career. There were other things that came up - I won't even mention, in Kobe's life. But there were issues and things, that - basically, what happened since all that stuff - is that Kobe rededicated himself, and he dedicated himself all the time. He wanted to get better and just leave it all on the court every day.
And all he ended up doing - he won three championships with Shaquille O'Neal, which - in the 2000s with the Lakers. And then many people attributed the fact that he only did it because Shaquille was there. And so when Shaquille left the Lakers, basically no one thought Kobe Bryant could win another championship. And all he did was he won two more after that, and he became an inspiration to all of the younger players. Because they looked at his worth ethic, and he really became the leader of the team.
Along the way, that's his basketball career, and his work ethic, and things like that. He became a family man. He really dedicated himself to his wife, Vanessa. To his three and then four children, that they had. And he became - when he laid his - basically, shoes down and had his retirement and things like that - and he was young, I think. He, Kobe was - I think, barely - maybe 40 or into 41? I don't know? He was born 1978. When he laid down his shoes, he basically - after his 20 years in basketball, said he wanted to dedicate time to his family. And he did.
He spent - instead of going to the Staples Center, you can read about it from Arash Markazi in the LA Times. He's a friend of mine who wrote a great story about this. But he just - unlike other basketball players or whatever, who would always pop up - or just celebrities that would show up at things, all Kobe did was spend time with his family and want to be a great father. Because he missed a lot growing up in the NBA. And so, I - a lot of that relates to me.
A lot of people are doing the hustle and just working so hard, and trying hard. You could ignore what's happening with your kids, or you could not pay as much attention. And it goes by so fast. My oldest now is almost eleven. I've got three kids - boy, boy, girl. Almost eleven, almost five, almost three. And it just goes by so fast. So I know-- My heart hurts for him, because right when he got to be able to spend time with his kids, he started to be able to do it - it's just a life lost too soon.
And the guy is the only basketball player I know to win an Oscar for his amazing movie, Dear Basketball.
When he finished his basketball career he didn't stop. He kept hustling and grinding in different ways. The guy was going to make a dozen movies. He had so much more ahead of him. And so yeah, anyways - that's why it hurts. It hurts us in LA, because Kobe helped define LA. He built the Staples Center, it's the home that he built for us. And so that's why many of us are still hurting.
Len: There have been lots of stories coming out, I think - about his generosity and kindness since he died, and I just wanted to mention - you mentioned kids. And one of the more moving stories I heard - was that he would often take the opportunity, when he flew to a place, to visit a child in the hospital, and he had one rule, which - or at least one rule - which was, "As long as you don't publicize it." And there would've been all kinds of really obvious reasons for doing that. But he wanted to make it about the kid that he was visiting, not about himself. I found that just very movin,g to hear that that was one of the things that he - many things that he did in his career. So thank you for sharing all of that, I really appreciate that.
Moving onto the next part of the interview, where I'd like to talk to you about your book. You're working on the second edition of the Manning book, Machine Learning with TensorFlow, Second Edition And so, I wanted to ask you - just to start from the bottom. If you were to explain what machine learning is, to someone who doesn't - and I'm sure you get this question all the time. But if you were to explain what machine learning is to someone who's never heard of it, what is machine learning?
Chris: I'll try my answer out on this for you. Because to be honest, I was that person. I'd heard of it, but I would call myself not a machine learning expert, say circa two years ago. And to be honest, I got tired of everybody that I was managing talking about this machine learning and deep learning stuff. I was like, I was telling my wife one night, "I need to learn what the hell they're talking about. It's bugging me." I was a trained statistical person in some other areas in search. And so I was like, "What is this machine learning thing?"
To me, basically what machine learning is, is using data or information to make predictions to group and cluster things, to give confidence in those predictions or clusterings or groupings. And to basically make sense of the world from data, in a data driven way.
And so to me, if you look at machine learning, and you look at any book that takes you through machine learning, including my own, it's going to start off talking about, "Okay, you've got numbers, you've got a bunch of numbers that either fit a line or a curve or some pattern."
Let's talk about regression. Regression is the process of trying to fit a curve or a line to a bunch of numbers. Then it's going to go into things like classification, which are tasks that involve, "Okay, I've got a bunch of data and I've got a bunch of labels for that data. Now make a prediction of what labels some unseen or new data is, given this data that you've seen in the past." And/or add one label to it. Add two labels to it. So that's classification, or multi-class classification.
And then it goes into, "What if you don't have any labels? And what if I just give you a bunch of data?" Like in the classic analogy there is, "Your daughter messes up your Blu-rays at home and throws them all over the house, and you've got to regroup them again." That's never happened to me, I'm making it up. But anyways, it happens it me all the time.
But yes, so if you have that - what you’re naturally going to do, is you're going to sort all those DVD's according to genre. Or if you're really sort of OCD like me, maybe you'll do it also after that by actor - or things like that. And so that process is an unsupervised grouping or unsupervised clustering task or learning task. That's another element of machine learning.
My book - sort of after that, takes you through a couple of different things. Then it takes you through explainable models. It talks to you about probabilities, and basically - what if you need to observe things, or make predictions in which you're not fully confident? So it teaches you about, basically the mark off property, using local information to make probabilistic predictions through hidden mark off models.
And then the book shifts into the modern era of thinking about deep learning, and some fads and some sort of trends in that. Deep learning is the big fad nowadays. It's the neural network paradigm. How do we model decisions, predictions, classifications, clusterings - all of those prior statistically-based machine learning techniques. How do we model them the way that our brain works as neurons? And as neurons that fire.
And how do we build architectures that represent ways to predict things just like our brain does? So it starts off basically talking about, "What are these things called?" Autoencoders, which are neural network predictors that encode information and decode it with - small loss in data and things like that. It then talks about reinforcement learning, and that's basically the process of making decisions and receiving rewards, and then trying to make the decisions that had the best rewards over time, instead of just in this instant.
And then it goes into basically deep learning networks that model the visual domain. the classic, what we call "convolutional neural networks," which just, basically, have changed the way fundamentally we do tasks like object detection, object [?] net recognition, and things like that - so that the machines can achieve better than human results in some cases, nowadays.
And then the book finishes off after that with - talking about intelligent digital assistants, chatbots, how to do intelligence agents and sort of NLP tasks, and natural language processing. That's basically the book.
Len: One of the really interesting things about the challenges for machine learning that I just find so fascinating - and not being an expert at it myself, I just read a couple of books - is the fact that you really have to confront the challenge of defining what a thing is.
You mentioned features before. So for example - we all know what a T-shirt is. But if you had to tell a machine, if you had to give a machine instructions to decide whether it's looking at a T-shirt or not, that's actually a really hard and interesting challenge. For example, a T-shirt - if it's flat in front of you, has horizontal symmetry. If you fold it in half horizontally, it will match. And so one way you can tell a computer, "Am I maybe looking at a T-shirt here?" Is, "Does the object in front of me have - if I rotate it, can it achieve a situation in which it's got horizontal symmetry?" And so, can you talk a little bit about - just to really drill down into the details of it, what is a label and what is a feature in machine learning?
Chris: Yes, it's exactly how you just described, in the sense that a label is some - basically, string - that we want to assign, or some set of strings to a piece of data, to allow us to model some decision that we make about it. I think the mistake people make sometimes, and that I made myself sort of early on in thinking about it - is thinking that there's one correct sort of set of labels, or there's one set of features and things about that, that fit sort of all decisions.
So the challenge is exactly like you said. Is, they call these - they call what you just described a machine learning task - and they call the process of thinking about what things we should model about it, the process of feature engineering.
One of the challenges is the feature explosion problem. Which is, if you get really started down a path of saying, "Hey, I've just got to model everything." It's a car, and I've got to model its make, I've got to model its color, I've got to model down to - basically what type of glass was used to build the lights," and things like that. You can get into a feature explosion problem where you just - you have too many parameters to learn, and it's sort of impossible. Or you never really learn anything.
So what people are thinking about in machine learning nowadays, in a couple of different ways are - this is why deep, deep networks and deep neural architectures are so important right now - is that they almost take the feature engineering task away from you. They actually leave it up to the network itself to basically discern what the important features are, in the sense of what neurons basically fire or not - depending on some input, and depending on some expected outcome.
The challenge in those domains and with these sort of deep neural architectures, is that you could actually model many, many different types of tasks with the single neural architecture, and with a single set of sort of expected predicted values or things that you want to learn. But the challenge - and they call this the explainability problem - nd there are several very large, 100 million dollar-plus investments from DARPA and other agencies in the government - is to try and make AI, the modern way of doing AI, which really is - the standard is deep neural architectures. It doesn't mean you shouldn't learn regression, classification, statistics and things like that.
And I actually make the argument - both in the book, and I tell my innovators in our innovation office - and I just tell anyone in my classes that I teach: "You need to learn the basics for some of this stuff. Because not everything you want to throw a neural architecture at - because - a), it's overkill, and b) one thing that classification regression and other things have to their advantage, is that they're explainable just directly. But one of the challenges that they have is just explainability in these sort of neural architectures. Is that, yes you skip the feature engineering task of figuring out, "What are all the important things that we need to model it?"
It basically, it really does - given enough data, it figures that out for you. The challenge is what is it figuring out? There's more than we could talk about on this podcast, ways that are coming out that - I would recommend your listeners look at the DARPA XAI program. Which, the results are being published, it's out in the open and things, nowadays. And there are other efforts in the government, basically to have explainable AI, and commercial companies are doing this stuff too.
The other major challenge, I would say, in machine learning - just while we're on the subject, is the - Okay, in deep neural architectures, there is this notion that - yes, the more data you throw at it, it will learn with confidence the right, basically neurons to turn on and off. The right hyperparameters to tune, and so on and so forth - to make it do what you want for a specific machine learning task.
One of the challenges is getting a wealth of that labeled data. Because it's expensive. If we're labeling cat videos and things like that, I argue - and if you look at the skills required to do that, and just the amount of time and investment, we're talking about - if I had to quantify it, on the cent per label type of thing, and maybe I need a million labels to build a great model, if it's a cent per label, what's that? We're talking about on the order of tens or, to hundreds - possibly thousands of dollars.
If I think, "Okay, I want to do a task." Say I want to do Martian geology. I've got smart rovers tomorrow and I want the rovers - since it has GPUs on it - to run deep neural nets, and I want it to be able to automatically tell me when there's plane or bedrock and outcrops. Because I know when that happens, the Rover is going to have an easier drive. It's not going to have to expend as much energy.
Okay, how do we get - how do we label Martian terrain? There's a lot of free public data that we can look at - Mars surface images. But where are the labels that tell me surface bedrock, outcrop, and things? They don't exist. And I argue that the cost of those labels is specialized postdocs with PhDs. And the cost of those labels - just do it with humans - is on the order of a dollar to five dollars per label, right?
And so the reality is, now your model costs a hundred million dollars, and takes fifteen years to develop, and not a half a year. And so we've got to bridge that gap. Because to be honest, all of the really challenging problems - both in the government and elsewhere, are that second example. It can't cost as much as it costs to do it. We've got to make it like the cat labeling video problem. And so there are efforts right now, there's a DARPA program called "Learning with Less Labels" that we're helping to implement, where people are working on it - we're not the only ones.
Len: Just to give an example of the - if I understand it correctly, the explainability or interpretility problem. One thing that is described in the book is that, if you get a set of data, you want to do the first stage of machine learning on say 60% of it, and keep 40% of it. You want to keep some portion unused so that you can test what you developed from the first 60%, on some of the remainder.
And one of the reasons you do that is that if you learned on 100% of the data, and then tested on 100% of the data, it could be that all the machine did was, as it were, memorized the right answers. It might not have figured anything out fundamentally about what it's looking at. And so if you then went and tested it on some unsupervised data or some unsupervised process, it would just completely fail - because all it did was memorize what it would it had been doing.
One of the reasons I asked you the question earlier on about the stakes involved in the work that you do, and your organization does - is that interpretability or explainability is a really big problem. So like, let's say - I mean, to pick an example from medicine.
Let's say we've got robotic surgeons, and someone needs a brain tumor removed. And we've got a machine learning system that - or a deep learning system that we put some information in, about the patient's situation. And then it comes out with an experimental - or with a surgical protocol for that particular patient's cancer removal.
Do you trust it? How do you know how it arrived at that decision? And so for example, if you ask a system, plan the - I don't know, I actually don't know the terminology. But like let's say I want to get a rover from Earth to Mars - I've got to plan how big the explosion's going to be when it takes off from the platform, and then I've got to plan a trajectory. Imagine if you had a machine, that - where you could press a button, and it could just magically do that. That would save a lot of money, but you wouldn't really know necessarily why it was telling you to do what to do. And are you going to strap yourself to the top of that rocket?
Chris: Yeah, that's exactly right Len. And here's one that you can use in your future podcasts, if they come up with anyone that's interested in Mars.
Here's a real one from JPL. Is that, okay - future Mars rovers after 2020, like starting with Mars Sample Return and the Fetch Rover, whose job will be to have autonomy in it, at the scale where it might need to do a lot of actions without human intervention, and also it will have the computing power on it to be able to do it.
One thing that we're strapped with prior Mars missions on is, we use radiation hardened flight hardware here to be risk averse. Like, in other words - hardware that can withstand cosmic radiation, to the point at where when it gets irradiated, which it will - it won't flip the bits or mess up the hardware. So because of that, all of the Mars rover stuff you've seen to date, basically is running off of the RAD750 chip, which basically has the computing power of an iPhone 1 in it. So we're running off of a computer from 2007. So all of the advanced machine learning, deep learning - all that stuff that you see terrestrially here, we can't do. It's all simulated with human in the loop, right?
But tomorrow, rovers will be able to do it. And imagine in this smart Rover scenario, because they will have what we call "High Performance Space Flight Computing," or a GPU-like, multicore-like chip that can do machine learning on board. In those scenarios, we've been working on killer apps for the rover. And one of the killer apps for the smart rover we call, is "Drive By Science." It was invented by a [?] here, Masahiro Ono.
And the concept is simple, it's when - today we give a command to the rover and tell it to go do some science, and do driving for a couple of hundred meters, and send us back a couple of hundred pictures a day. Because of the light time from Earth to Mars, eight minute round trip, right? So we can only get a couple of hundred pictures a day to basically plan what to do tomorrow.
So in - tomorrow, what we would like to do is basically have the rover run a machine learning model to generate a million text captions of the images that it sees, and believe those captions. Know that it's bedrock outcrop, things like that - and send the captions back. We can send back a million captions for a million images, because it's only text. It's much, much smaller and it makes more efficient use of that very small pipe. If we can run machine learning onboard the rover, and if we can believe it - then we can do Drive By Science.
We won't miss things. We'll know the rover will be smart, in other words, and it can redirect and be more autonomous in where it goes to. So that's just one example of a killer app where, yeah, machine learning makes a big difference. This 2.6 billion dollar investment from our nation - and really our planet - in exploring Mars, could be that much more efficient. It could be from 200 to a million - right - times more efficient - 500 times.
Len: On that note, as someone who sort of has access to knowledge of things that the rest of us might not, are you worried about an employability - an employment crisis from automation?
Chris: It's definitely something that I'm concerned about. I think that the skills sort of - training, it's not just something where you hear people talking about, "Oh, well we'll just transfer the skills." Obviously, there's a technology gap and a skill gap in automation. And it will displace, I think, people in areas. I know they're already thinking about this with respect to smart cars or smart trucks and things like that that are coming out. But to be honest, they're thinking about it - I don't know that they have all the answers.
And right now the government is putting out draft regulations related to AI. They talk about ethical AI, crowdsourcing - community input on regulations and so forth, for that. I think that's just the start of the conversation.
Fundamentally, those tasks - if we're successful in doing it and not just skills and training transition - but if you think about it sort of holistically, what it has the power to do is to shift - people think it will take away their jobs.
But really if we're doing it right, it makes them work on things that they're more interested in. It's the concept of JPL and the work that we do. If it becomes routine, commercial industry should do it. And we should be working on the next great challenge and the next great things. The challenge is those people that have been doing some of these things that have the potential of these tasks to be automated - we've got to provide a path for them to transition, or to work on something differently, or commensurate to that, that still allows them to not just go back to an old skill or go back to a skill that will detriment.
They have to be able to evolve and go to or leverage their existing skills in some commensurate or next task. If we're commoditizing things to machine, the least we can do for those people is transition their skills into something. I don't think - I'm skeptical too about the coding and, and just - the code. I don't know about all that, just transitioning people to that. But we need an answer, and I haven't seen it yet.
Len: It's really interesting. You mentioned ethical AI - reading about autonomous vehicles is just a sort of hobby that I have. It's something that I really hope happens someday. And I'm not talking about autonomous vehicles on Mars, I'm talking about autonomous vehicles on our roads outside. And we're talking about interpretability, or explainability, and things like trust.
And when it comes to sort of things about - questions about ethical AI, I often think about, like, what about - why are we so trusting of ethical people? An example would be, that I like to bring up - is there was once a very sad story in Quebec, where a father and his daughter were on a motorcycle, and they died because a woman had stopped her car on the highway just over a hill, to let some ducks cross the road.
Chris: Wow.
Len: And that's just one example. One million people die in car crashes around the world every year, and 20 to 40 million are injured. I just find it so fascinating that when machines are involved, all of a sudden people are like, "Oh, hold off here. I'm not sure I trust that thing." And it's like, you trust the tired drunk who just broke up with his girlfriend? In that car speeding at you at 60 miles an hour, really?
I just bring that up, it's a sort of like - a sort of just a funny way of saying that. There's reason to be optimisitc about putting things in the hands of machines. Where the question - and you brought this up - but, where the questions get really tough, is when the machines are making the machines, and we're a step removed. And that step is completely - not like, perhaps even conceptually not transparent to us. Like, in principle not transparent to us. It doesn't mean that we should give up, but it means that we need to have some supervening system on top of that - that we need to develop, so we can manage this machinery that's making these decisions.
Chris: Yeah, totally Len. I feel the same way, and I feel that you honed in on exactly the challenge, which is - like the machines making machines, or the machines automating other things. And that level of indirection - some people talk about some of the founders and the tech company people being far out there, because they said, "Oh, they're concerned about AI," and things like that. I actually think it's good to be concerned about things like that. Mostly because we don't have a good grip around - like you said, the types of challenges exactly that will be faced.
A lot of it has to do with the speed and the velocity of it. Because in those environments, they'll be onto the third and fourth and fifth and sixth order things - where it's like, we're still - and actually, just going back to your point - we've been automating things forever. If you look at the Ag industry and things like that, that's actually not so new. Where it's new is in the commodity areas. It's almost like the highway's being automated.
And really if you look - the last great sort of modernization of the highway, was when they built - at least in the US - the freeway system, and the roads and things like that. That was a big deal. And now we're talking about robotic trucks being on the highway. And if you think about it, really in less than five years, a lot of those jobs for shipping and things like that - so to go back to your analogy, I would actually say, "Do you trust more the poor gentleman who's been driving for 18 hours straight and is really tired, or the computer that's not going to get tired?" To be honest, the computer doesn't get tired. It might have other issues, but it might not have that one.
And so there are tasks that I believe it does - it's not just a money thing, but it actually really does make sense to automate them in certain ways. But it's those second order ones like you're talking about - that, anyone that says that they have the answer to that now, is really selling you a bridge to nowhere.
Len: Yeah, it's really interesting too, the kinds of solutions that can be arrived at through things like this. I remember reading an article a few years ago about, I think it was a Google data center, where there's a system, basically, for keeping the machines cool, which involves opening and closing vents and things like that.
They just set loose some machine learning system to manage the cooling system. And it decreased the use of energy by 15% or something like that, dramatically. No one knew what the basis was for the as it were, decisions it had made. But you just sort of pressed a button and got 15% reduction in energy use.
And so the potential - I'm not actually personally a techno-optimist. But like, if you understand the potential applicability, or applications of this thing - there's just a ton of potential there, that often can seem even counter-intuitive.
Chris: Totally, Len. Real small quick example analogy and then we can move on or wrap up or whatever, wherever we're at next.
But the other quick analogy to that is AlphaGo. You heard about AlphaGo, and that's exactly what happened. Basically, AlphaGo is Google's deep neural networks, that they basically trained two networks against each other - to become the best, better than the best player in the world at the Go game. And really, the way they did that, is they had two neural networks, adversarially in the notion of these GANs - Generative Adversarial Networks - which are basically networks that train another by fighting against each other, or trying to outwork the other one. Which is amazing way of learning. It learned to be the best Go player.
Well, they didn't know why for the longest time. And maybe they do - maybe Google DeepMind does, and they haven't told any of us. But that was the same type of analogy. It's like, they looked at it and they're like, "Oh my God, it can beat the best Go players in the world." But they didn't know why, right? Because again, they don't know - it learned something, right? But it's one of these neural networks. It's like, "Why is this weight turning on here? Why is it not?" And then it's a lot of work to understand that.
Len: Moving on to the last part of the interview before I let you go. I know you have to go in a few minutes.
In the last part of the interview, we usually talk about the author's approach to writing. And so, your book is in the Manning Early Access Program, so it's being published in progress. I was wondering if you could talk a little bit about your approach to that? Do you have everything planned out in advance? And then do you have scheduled times when you bang out the next chapter? How often are new versions going to be published? If you could just talk a little bit about things like that for a couple of minutes.
Chris: Sure, yeah. Manning's had the Early Access Program for a long time, even back when I wrote Tika in Action, they had it. It's a good program, it gives you access to - like, chapter dumps - basically as the author is making them.
For me right now, the first four chapters are online. I've got pretty near completed drafts of chapters through chapter nine. I just finished chapter nine, and there's going to be, I think, eighteen or nineteen chapters in the book. And so I'm almost halfway done. And so, what they'll do probably very soon is, release another cache basically of the next set of chapters.
My approach is basically on a weekly to bi-weekly basis, I'm cranking out a chapter. And this is the second edition of Machine Learning with TensorFlow. The first edition was written by an author, Nishant Shukla, who I reached out to as I was reading his book, and working through it all - a year and a half ago, and learning machine learning and teaching myself it.
And basically, everything in that first version of the book - I actually really enjoyed the first version. It was amazing. It was a really good fundamental introduction to machine learning and a fundamental introduction to Google's TensorFlow, machine learning toolkit.
But what happened is, at the end of every chapter in the first edition, Nishant being the the academic that he was - he would throw out flippantly or anecdotally, like, "Now that I've taught you," let me just pick one, "convolutional neural networks, hey, you might want to build a facial recognition model." And you could actually go get this open data called the VGG Face Data, and build a facial recognition model, and apply your convolution neural network knowledge to do that.
As it turns out, every time he said that, or gave a sample project at the end, I went about trying to do it. I teach graduate courses at the University of Southern California, in big data and data science and things like that. Let me tell you - every time Nishant suggested one of those, it ended up taking me five to nine weeks, requiring a super computer, and ended up itself being publishable in the end.
And so what I did is, I did that over about nine months at night, when my kids were sleeping and my wife's like, "What are you doing? What are you working on?" And I said, "I'm learning machine learning. This is amazing."
And basically, then I approached Manning after doing that. I said, "Hey, I think I have enough material for a second edition of this book, and I think I've got really all the missing pieces to it." That was basically my pitch. They signed me real quick after that. My approach is, I have all the code and things written using Jupyter notebooks, which are reproducible notebooks in Python. It's all using the newer version. It's not using the TensorFlow 2.0 or 2.X series, because that's a little - that's not exactly stable enough, and it's a little new architecture in my opinion to do it. Most of the people I know - at least in the research community, are still heavily using TensorFlow 1.14, which is about 26 versions or so after this original book was written. So it's still a very newer version of TensorFlow.
Basically, I've got all these Jupyter notebooks, where I take all the first edition chapters, and I'm revising them. But then I'm adding more than double new chapters, which are all the longform Jupyter notebooks in which I've applied all the techniques and said, "Here's how you can do it in real life." Like, "Here's how someone, who didn't know machine learning before - I've done it. I've gone out and done all these suggested assignments - which ended up, in my opinion, being graduate level half-semester assignments in their own right, and then you can see how to apply it."
So for regression, he basically suggested at the end of the chapter, "Go grab New York City's 311 data and try and predict call volume over a month." So I basically take you to that in my chapter four. He talks about sentiment analysis for classification, and he suggests going out and getting movie review data. I get all the Netflix data and build a kickass sentiment classifier. He talks about convolutional neural networks, building a facial recognition thing. I show you how to recreate the VGG face model, go re-download all the data, do it on your own. So that's basically the model for the book.
The next dump will include, I think, a total of three to four of my own chapters. On the way to chapter ten, the new ones added for the book along with revisions of all the other chapters. Just all the notes that I think you need to basically learn machine learning and use TensorFlow to do it. So that's my approach.
Len: Well, thanks very much, Chris, for that explanation of your approach to the book. Best of luck in your journey to completing it.
[lenepp]
For everyone who's listening, that book is Machine Learning with TensorFlow, Second Edition, which you can find at manning.com.
And yes, thanks Chris very much for taking the time out of what I imagine is a beautiful day, to talk to me and to our audience - and for being so game to cover so much ground.
I should mention there is a lot more ground we also could have covered. We didn't even mention your work at the University of Southern California on information retrieval and things like that.
So thanks, thanks very much for being a guest on the Frontmatter podcast.
Chris: Len, thanks for having me. It was a pleasure. You enjoy it up there and, yep, it's beautiful down here in Southern California. Hit me up when you're around.
Len: Thanks very much.
And as always, thanks to you for listening to this episode of the Frontmatter podcast. If you like what you heard, please rate and review it wherever you found it, and if you'd like to be a Leanpub author, please visit our website at leanpub.com.
As promised, here are some discount codes!
These four will get you a free ebook at Manning, but please note they have limited uses and will expire:
frntmr-4CE4
frntmr-C7AF
frntmr-82E6
frntmr-DCBA
Listeners to to the podcast can also use this permanent discount code to get a discount on a purchase from Manning:
podmatter19

