#071: Reinforcement Learning with Matt Gershoff

Let’s pretend your goal as an analyst is to eloquently and accurately explain reinforcement learning. Now, let’s pretend that you get to try that explanation again and again, and we’ll give you an electric shock every time you state something inaccurately and a cookie every time you say something right. Well, you’re an analyst, so you’re now wondering if this is some clever play on words about cookies. As it happened, we didn’t give Matt Gershoff from Conductrics any cookies of any kind in his return to the show. Instead, we gave him a lifetime’s supply of opportunities to say, “Well, no, it’s not really like that,” which is a special kind of nourishment for the exceptionally smart and patient! In other words, the gang walked through a range of analogies and examples about machine learning, reinforcement learning, and AI with Matt, and no electric sheep were harmed in the process.

Schtuff in This Episode


Episode Transcript

Episode Transcript


00:04 Announcer: Welcome to The Digital Analytics Power Hour. Tim, Michael, Moe, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com/analyticshour, and their website, analyticshour.io. And now, The Digital Analytics Power Hour.

00:28 Michael Helbling: Hi, everyone. Welcome to The Digital Analytics Power Hour. This is Episode 71. Can you calculate the exact time and date an anonymous visitor becomes good at the game, Super Mario Bros? Well, we’ve been looking into it. And apparently, there is this thing called reinforcement learning and artificial intelligence that has all the answers. Our problem was we’re struggling to come up with the questions, 42 and all that. So to be clear, this episode may or may not be about how computers dream of electric sheep. We have TensorFlow simulating the podcast in real time. So just bear with us, it will get better over time. I wanna introduce some of our standard co-host variables, in the form of Tim Wilson.

01:15 Tim Wilson: Philip K. Dick reference, good in the opening. Hi-de-ho.

01:19 MH: And Moe Kiss.

01:22 Moe Kiss: Hi guys. How are you going?

01:23 MH: Going great.


01:26 MK: Oh God, every time. Every time.

01:30 MH: What? No. I was ready. I said “Going great.” Is that okay? Is that a…

01:36 MK: That’s a great response.

01:39 MH: Adapt and learn. And I am Michael Helbling. All right, our guest variable is fulfilled by none other than Matt Gershoff. He’s the CEO of Conductrics. He is a second-time guest on the podcast, and a frequent mentionee on the show. Welcome back, Matt.

01:57 Matt Gershoff: Thanks for having me.

01:58 MH: Awesome. Well, Matt, we’re excited that you can help us reinforce what we’ve learned about reinforcement learning. ‘Cause that’s sort of where we’re headed tonight.


02:10 MG: Oh no.

02:13 MH: You see what I did there?

02:14 MG: Yeah. It’s too weird.

02:15 MH: No… To kinda kick things off, I think this goes all the way back to February at SUPERWEEK and the talk you gave there. And maybe as a starting point, define those relationships between machine learning, AI, and reinforcement learning to create a base level for our listeners.

02:37 MG: The one way to think about machine learning, which is basically I think an easier way to think about it without getting hung up on is this particular method part of machine learning, or is some other method, actually machine learning, is really just to think about machine learning as the intersection between statistics and computer science. With computer science, it’s really the study of programs, right? And it’s usually programs that human write… Human beings write the programs so that they’re executed on a computer. Machine learning is where we’re gonna use ideas from statistics, and we’re gonna learn, we’re basically gonna learn programs such that they execute for some sort of particular task. When you normally write a program, it’s in order to do something. And the idea here is that rather than having a human being write the program, we’re gonna have a procedure, almost like a factory program, a machine learning process, which is gonna generate its outcome are gonna be programs. And the programs can be of various complexities. But the main idea is that we’re gonna have, rather than having to write the thing, we’re gonna be able to learn the program.

03:43 TW: Stepping back a little bit, what is the application of machine learning? We say the intersection of computer science and statistics, the applications of it… I guess, come from a digital analyst perspective, and I think I’ve sorta struggled with this and I have momentary glimpses where it makes sense and then it goes away. But in the context of digital analytics, or digital marketing, or optimization, or making the site work better with data as an input, with machine learning, is there inherently need to be a feedback loop that you’re trying to control things in addition to… Is it measure and control? Is that machine learning? Or, is reinforcement learning where you have to be in measure and control?

04:28 MG: Yeah, I think let’s not confuse… Let’s just start simply. And that’s just basically you have a task. We have some sort of task, and so what are some of the tasks that a digital marketer has? Well, maybe a task is to send out particular email such that you get people to sign up. Or maybe it is you have a website and you wanna have it present a particular product or offer that’s compelling to the particular user. You have some sort of decision, you have some sort of task that you need to take. And so the idea behind machine learning is rather than have us write the rule, as you might currently have done with a business rule where I’m gonna map some scenario to a particular action. If I see a particular type of user, say a new user, we should offer them this product. If we see a repeat user, we should offer them some other product. Instead of having the person write that, or the marketer write that, instead we’re gonna learn what those rules should be, what the mapping should be. And if you really think about it, just a set of rules really is like a program. It’s just sort of a simple program. That’s really all you really should think about is you have a task, and how do we go about learning a procedure for executing on that task in a way that has high efficacy, does really well on whatever performance metric is. That’s really what we’re trying to do.

05:45 MK: And just bringing it back, so before we kicked off the show, we obviously were going back talking about this topic a little bit. And one of the points you mentioned that I found really interesting was that often the role of an analyst and when we’re trying to understand, say for example, email sends, we’re looking at historical data, we’re looking at the past and then trying to use that data to inform our decisions going forward. And it sounds like one of the real benefits here is that because you’ve got this program that’s learning, it’s actually based on current and future behavior. Is that accurate?

06:19 MG: Well, it could be, it depends upon the data that you have, but if we’re just talking about it as a general approach, it’s goodness with efficacy, or its quality is gonna be a function of what you put into it and your skills or your analyst skills or members building the programming or doing the machine learning, so it’s like art. And it’s okay if something is crappy, it still can be art. There’s a difference between art and good art and so there isn’t necessarily some bundling between goodness and machine learning. And machine learning is really the process of generating programs through either the interaction with the environment or through data. Then we can decide or assess how good it is and whether it’s been worthwhile, because there’s a cost and we’re adding complexity and we have to build these things, and we have to maintain the data, and we have to ensure it’s consistent with the experiences and if we’re trying to automate with. And there’s a lot of complexity that we have to factor in to whether or not in aggregate it’s been worthwhile to go down this road.

07:26 TW: In maybe taking one more run at this ’cause I do feel like there’s a group of people who are operating on an understanding of what it is we’re talking about and then there’s a group of us who maybe are still struggling. One of the things that I’ve now pointed people to and gone back and watched multiple times was from your SUPERWEEK presentation, and I think a lot of people have seen it, is the Atari Breakout scenario that was using Google DeepMind. Putting aside that it’s DeepMind and all that sorts of fancy stuff, but conceptually, it made a lot of sense to me because it was if you have very limited leverage you can pull, which if I equate that to a website or something I’m doing. Let’s say all I can control is the copy in this box or all I can control is one of these four offers. And then I’ve got a defined outcome with Atari Breakout, which is I wanna maximize my score. And then with what was a simple, and you alluded to it or you stated it earlier, you have a clearly defined performance metric and you have clearly defined what can be varied. It is a trial and error that is smart enough to start figuring out what works and what doesn’t. Is that a fair description of the Breakout example? And then I wanna try to equate that to a website.

08:52 MG: Yeah. I just wanna get back to Moe’s original question, so I just wanna make sure I’m clear about that. At a higher level, you can think about machine learning, and then underneath machine learning, there are different sub problems of machine learning. One of the sub problems is normally known as unsupervised learning. And that’s like when you guys sometimes are talking about cluster analysis or segmentation, it’s really about grouping things together or maybe you have a lot of documents and you’re trying to figure out what the topics are for the documents. And we’re trying to group like things and we’re trying to separate unlike things. And then there’s also something called supervised learning, where we already have the data and that’s something like let’s say you wanted to do image classification.

09:35 MG: You have a set of images and you’re trying to find out whether or not there’s a dog in the image, or there’s a person in the image, or a cup of milk in the image, whatever it happens to be. Maybe we already have images that have labels assigned to it. There really isn’t any interaction that needs to take place in the environment. We have the data and then we’re gonna learn a model; we’re gonna learn a map between the pixels and the images and the labels. And that’s also supervised learning, but that’s machine learning. Now, there is no trial and error there, there is no learning from the environment, there’s just a set of data that we have and so we know whether we got it right or wrong ’cause we have the truth. We already have the data set which says what the right answer is. And then there’s this third thing, which is what you were alluding to with what Google DeepMind is doing and it’s maybe a little bit about what we’ll talk about with reinforcement learning, which is where we don’t have the data.

10:24 MG: And so we actually have to go out and interact in the world and we don’t actually ever get the right answer. And it’s just like when you’re in a new town and you move to a new town and you’re trying to find out what restaurants you like. When you go to a restaurant, after you eat there, you don’t get a message coming back to you to say, “You just ate at the best restaurant” or “You just ate at the worst restaurant” for you. You just get a utility, like, “Wow I really liked that restaurant” or “I really hated that restaurant” or somewhere in between. You just get a reward metric, and so it’s just like… Reinforcement learning originally comes, I believe, from psychology with this reinforcement learning with like a dog and operating in classical conditioning and you give a reward for good behavior and you withhold the reward for bad behavior. In that case with Breakout, that was a case of where Google DeepMind was using reinforcement learning to learn how to play Atari 2600, for those who remember from back in the, I guess early ’80s, late ’70s? I don’t know.


11:20 MG: But the thing is is that Google DeepMind learned, did a lot of work with deep learning and reinforcement learning, blending the two together. They learned how to play these video games. And which all that really is, is it’s a map between the pixels and the screen and whether to move the joystick up, down, left, right, or to fire the button. But the only way they could do that, they didn’t have data on all the moves, they actually had to play the game. That’s the distinction is where either you have the data and you have the labels or the right answer for each classification or each regression problem, or you don’t and you have to use this trial and error learning. And that trial and error is much more like how us humans or animals behave in the world, is we act in the world. We do something in the world. We see how the world responds to us. We either like where we are in the world or we don’t like where we are in world. Maybe you put your hand over the flame when you’re younger and you don’t like that, or you eat some ice cream and you do like that, and so you get positive/negative rewards. And over time, you start to learn what actions to take in different scenarios, based upon the reward. It requires this extra resources to actually play out different experiences or try out different actions in the world to see how well it works.

12:34 TW: And is there a piece of that also the world is changing so that even, in the case of something like Breakout, you would expect the computer to just get better and better and better and better until it reached some logical max, whereas in the world that we’re in, where the seasons change, the products change, the consumer changes. All of that is changing that if you’re running your… If you’re doing, and maybe continuous optimization is the wrong way to say it but… We talk about classic AB test, that you run an A/B test, you get the answer and we probably treat that… I know we treat that as way too definitively. This is a 13% better than this other, and we wanna treat it in that black and white way, and then we move on. We feel we kind of set it and say we fixed that. We came with the better solution and went with it. And there’s, I think, a ton of problems with that mindset. But if the machine is running and the machine is continuously learning, it can find out if the flame is hot and it burned me now, but maybe a week from now, it’s not gonna burn me because I’m gonna keep learning and keep feeding information back. Is that a fair statement?

13:56 MG: Yeah, Tim. I think you bring up a good point, and what you’re talking about here, I believe, is the idea that the world might be non-stationary, right? Things can change over time. But that isn’t particular actually to the reinforcement learning problem or the learning from trial and error. That’s actually true of our classification problem. It could be also true of that, and it can also be true of our clustering or segmentation problem. I don’t want you to think or I don’t want your listeners to think that the idea of the fact that things can change over time or if we’re only looking at something in a particular season, or a particular day of the week, that’s an issue that’s only something that they’re gonna have to deal with in this trial and error learning idea with reinforcement learning. That’s actually true with the classification. There could be something with images where labels sort of change over time or the distribution of the images that you get changes and so you actually have to keep learning that as well. That’s an important point, but it’s not something that is unique. It’s not an issue that’s unique just to reinforcement learning.

14:56 MG: But it is true that if you’re doing something online where it’s adaptive, where rather than… I mean, there’s two ways to do machine learning. That’s not exactly true, but there’s… One way to look at it is that you can either do it in a batch, so you collect the data and then you build a model off of that batch. And so you have your learning from that big bundle of data that you have. Another way to do it is what’s known as online learning, and you can do online learning for unsupervised problems, for supervised problems, as well as for reinforcement learning. It’s not unique just to one of those types of machine learning. And online, it’s basically the data comes and you treat the data as events or sort of a stream. And you keep updating your model as the data comes in and you take advantage of the data as it comes in and what that lets you do is it sort of lets you forget older data or older history and put more weight on newer data.

15:53 MG: There’s this idea that part of machine learning or part of AI or part of learning in general is forgetting. In a way, in order to learn as things change through time, you have to be able to forget. You can model that or you can embed that in your machine learning algorithms. And so that would be an approach. That would be something you would want to do if you had a problem where you did have some sort of belief that you thought the world is changing over time or your models or your belief where you’re not… Your belief about the world is perishable. It’s like, “How long is this really gonna be useful for?” And then there’s some knowledge that we have that is 100 years old and still true today or still is useful today. And there’s other things that, as the world changes… Like knowing how to use a fax machine or whatever is really not that useful anymore.

16:43 MH: Okay. How much data do I need to be able to do this?

16:49 MG: Well, it really depends upon the problem. I would think, in general, you probably want to be… Honestly, you probably wanna be in a fairly data rich environment. But why do you want it?

17:04 MH: Data rich environment means terabytes of data?

17:10 MG: No. I don’t think so. In fact, if you’re doing any of this reinforcement learning stuff or you’re doing… Think about it this way. When you’re running an AB test, you’re really trying to learn what action to take, right?

17:20 MH: Right.

17:20 MG: Well, often times… There’s two forks to why you might be doing an A/B test or an experiment. If you’re a scientist or you’re doing some sort of research, you’re trying to learn some general statement about the world. You’re trying to learn some generalizable inference about how the world works. And you might actually be doing that as a marketer. You might wanna understand some generalized behavior about how customers behave in certain circumstances that you could apply across the board. That’s a case where you wanna be fairly rigorous in your experimentation. But then there’s other types of problems which is where I think most companies are probably using A/B testing, which is really they’re trying to optimize a process. In that type of problem, it’s really about… A lot of times, you have people running tests and they don’t have terabytes worth of data. They have tens of thousands of observations or hundreds of thousands, but certainly not terabytes. There’s no need for that.

18:19 TW: Is it a faulty to look at it as the size of the data set? I could do one e-mail blast to a million people and I could wind up with a lot of data, but that hasn’t been any data that I’ve been able to, I guess if I very, very…

[overlapping conversation]

18:40 MH: Serve longitudinally.

18:42 TW: Or if I’ve even very carefully designed it, sure. If I sent 10 variations and mixed it all up, maybe that would work. I guess I’m back to, it feels like it’s easier to understand and maybe this is ’cause, Matt, I’ve seen enough of the Conductrics stuff and how it works and it makes total sense when you’ve said I have three options I can control here, three things I can change here, three things I can change here and I know x, y, and z about every visit that’s coming in and I have a steady in-flow of visits. And in that sense, sort of… And you and I, a couple of weeks ago had a little bit of an exchange. Like, I look at when Adobe went from Adobe Test&Target and everybody thought of them as testing platform to Adobe Target that part of that’s what they’ve been… Even Adobe, and I guess Google probably as well, I don’t wanna go down the vendor path. But this idea that you know a bunch of things about your visitors and some of the stuff matters and some of it doesn’t, and there’s a limited number of things you can control, either because you haven’t created content to provide various options or it’s just not something that makes any sense to be changed or the infrastructure doesn’t support it. And don’t you have to start by defining that universe of saying this is… And I want to have a steady stream coming in so that I can change things. No? You’ve got the look of, what the hell did you just say.


20:27 TW: “Oh my god, Tim. You are so wrong.” I’m trying to figure out how to get this out of like a very narrow, I can define areas where I say this totally makes sense and then I look at the other… I feel like 97% of questions that analysts are asked or trying to answer and I can’t bridge that gap.

20:45 MG: Yeah. First of all, there isn’t gonna be one way to do something. An analyst or anyone working in this space, they have a broad set of tasks that they need to perform. Now, if one of the tasks is to try to figure out how to map experiences to customers. I think that’s what we’re talking about. We, with the Conductrics, let’s you do. That’s I assume what Adobe Target does and there’s others out there, and if that’s the task, and there’s really two to three types of sub problems there. One, we’re trying out new experiences. Let’s use a… What’s a case? Maybe it’s a travel site and they’re trying to sell vacation bundles. And there’s maybe the family vacation, there’s the single vacation, and then there’s the adventure vacation, let’s say. There’s three possible product offerings that they wanna present to different users, to the users, and they’re not sure what one has the best efficacy and for whom.

21:49 MG: Number one, these are new bundles and so they have to first do some trial and error learning, and one way you might do it is you might use A/B testing. I think that’s the general, mental framework or the general approach that is usually discussed or promoted in the industry. But really, it’s not necessarily about A/B testing, it’s really about this trial and error learning. Because you don’t have data about these three different offers, so the only way to learn about them is to actually try them out. It’s starting to fall under our reinforcement learning framework, because you have trial and error learning and that’s one of the components. We actually have to see how well each one does, so we have to sample.

22:26 MG: And then two, we’re gonna need to do some predictive targeting. We’re gonna have to learn some sort of mapping between different types of users and the value of each of one of these, the conversion rate of each one of these vacations. We’re gonna wanna look at, as we offer up each vacation package, we’re gonna also wanna look at attributes of the user. In the standard web setting, it might be where there coming from or it might be the type of browser, but it could very well be that there’s additional data that the client actually has, and it could be their tenure, it could be an RFM model, it could be a lifetime value calculation, it could be anything that is an input to this. And that’s where this part is similar to the supervised learning problem or a regression problem or a classification problem where we’re trying to learn a map between attributes of the user and then what vacation to offer them. Like what the conversion rate is for each one of those. What we need to do is we need to learn that relationship between the user and what offer is best, but the only way to collect the data is to actually go out there and interact with the world. And that’s basically the reinforcement learning problem or that’s basically what was known as a multi-armed bandit problem. An actual bandit problem.

23:44 MK: Can I just clarify? I wanna make sure I’m following this, because to be perfectly frank, this is something that I’m just starting to get my head around. Let’s take your e-mail example. We’ve sent out our three e-mails to our three groups. If you were using reinforcement learning in this context, would it be that… So the emails go out, person A converts on the family email, and then part of that feedback loop is that it says, “Yes, you did good.” This is the treat reward side coming in. We learn that this is the behavior that we want versus person B that didn’t convert so that’s the behavior that we don’t want. Is that breaking down what the learning’s actually doing?

24:26 MG: Yeah, it’s as simple as that. Let’s just use a website so it’s a little bit more continuous as opposed to adding a little extra complexity from the email idea, which is gonna be a little bit less more discrete. A user comes in and you take a look at their attributes, like just data about them, if you have it. Let’s say we do have some data. Let’s just say it’s the day of the week, day part, and the geo, whether they’re coming from Sydney or Melbourne or Brisbane or whatever, and the browser. Okay, that’s not great data, but for this, it doesn’t really matter just to think about it. We take a look and we see what those features are, and then we randomly sample, and we select one of the vacation packages for that user. And then we wait, and then we say, “Well, what do they do?” Well, in this case, do they convert or do they not convert? Or if they do convert, what’s the value of the package, the vacation?

25:23 MG: And then, we get get that information back, and then what we do is we… A portion, a value of that conversion event, that reward event, that good thing that’s happened in the world, back to each one of those features that was active at the time. If they were coming in on Chrome and they’re coming in from Sydney at night, then each one of those features would get a little share of the conversion event. Their weights, their input in predicting the value of that particular vacation, let’s say with the family vacation, goes up a little bit. And then, let’s say, another user comes in, and we offer up the single package, and they don’t convert. We had predicted for them. We looked at the weights for each one of those features, and we come up with a little score, and now we’re overshooting because the person didn’t convert. We’re actually gonna decrease the value for each one of those features that was active at the time.

26:16 MG: This is just basically simple online regression. This is gonna be a little complicated talking about over podcast, but that’s basically how it works. In an online approach, you take an action, you see what the reward is, and then you go back and assign… You either push up the value for each one of those features if you had a good thing happen or you bring down the value of each one of those features. And over time, it starts to converge and you can say, start saying, “Well, if the user is coming at night on Chrome from Sydney, the best predicted option is the family vacation option,” and if they have other sets of features, maybe they’re coming in on Facebook during the day and coming from Brisbane, maybe it’s the single vacation that has the best conversion prediction, and then you play that.

27:01 TW: Let me try, I’d say flipping it around to what I’d think is the traditional way that testing and optimization teams will approach… They may start with their first test and say we have these three vacation options, and we’re gonna run a test, even split across these. We’re gonna run it and see which one performs best, and maybe nothing comes out as being best, but then we always say, “Ah, but don’t stop there,” even if there’s not a clear winner. Then go and do your segmentation after the fact because that’s where you’re gonna uncover the gold and then potentially you say, “Well, it seems like what I discovered was that people coming from Sydney prefer this option.” Now, lets do maybe another test where we actually do some targeting, and it’s kind of a human, slow and very discrete steps of running for the entire population then take that batch of data and start trying to find ways that you can sort of tease it out. And then either decide you’re just gonna do targeting on that or do further testing. Is that a… Whereas what we’re talking about here is a little bit of turning that…

28:17 MG: It’s automating it.

28:17 TW: Upside down.

28:19 MG: Well, it’s automating it, really, right? It’s kind of…

28:21 TW: Automating it.

28:22 MG: Automating that process. And that’s gonna be a trade off, and automation in of itself isn’t necessarily gonna be always the best approach. It really just depends upon the scale of the problem and whether your organization has the existing infrastructure and sort of sophistication to manage that, but there’s nothing wrong with what you first suggested. It isn’t necessarily wrong, especially if you’re gonna rerun a test after you think you’ve found some potential areas of opportunity, as long as you rerun those tests just to see if there is some sort of return. There’s nothing wrong with that, but what we’re really trying to talk about here, if you’re talking about machine learning and you want to apply it, you really wanna ask yourself, well before, and really just think through like, what types of returns do we think we’re gonna get by doing this?

29:09 MG: And the whole idea of it is to help augment and automate part of the procedures so that you can do this faster and at scale and at a complexity that may not be so easy to do with having an analyst involved. The idea is at this micro level, and so the idea is to push down some of these low level, high transaction, low risk decisions into the application layer itself, and free up the analysts to be doing more important stuff, or just to be maintaining this and managing it and help guide it because it’s not something that’s gonna be done in a vacuum. I mean, it’s still a hard problem. Just saying the word automation doesn’t mean it’s necessarily gonna do particularly well, and it certainly needs to have…

29:56 TW: But I think even in a static world, where you’re trying to decide between three vacation types where that’s fairly static. You could say well, the cost to do automation may not be worth me doing it as a smart, experienced, analyst. But, if I know that I’m gonna be rolling out new offers and I know seasons are gonna be changing, there’s this sort of image in my brain of, it’s running and it maybe starts off static, I’ve got these three vacation packages to deliver and it’s figuring it out and it’s cranking along, but now, I can come up with a different vacation package and I should be able to, I think, just kind of throw that into the machine and say, that’s all I had to do was come up with another idea. Maybe look at what seemed to be working and didn’t work to inform that offer.

30:46 TW: But now I don’t have to go through all of the steps of setting up planning, figuring my start date, figuring my stop date, figuring my new analysis, if I’ve automated that piece, it seems like, and I maybe over-simplifying it, if I know that I’ve got certain areas where I’m making offers on the site, I know I’m gonna have new offers coming out, driven by all sorts of external factors. What I’d like to know is that I’m gonna be able to add that offer into the mix, and then let the whole thing reach a new equilibrium, figuring out that, oh, well now, instead of being new visitors from Melbourne, it’s now, new visitors from Sydney actually fit better here, and returning visitors from Melbourne, or something like that. Look at us pretending that we’re gonna drop… I’m gonna drop in Brisbane and Perth, and then I’m done.


31:37 MK: I’m loving all the Australian references.


31:40 MH: Yeah, how you going, Tim?


31:45 TW: I’m going great.

31:46 MG: You could definitely do that. And in the literature that’s known as, I think it’s, “The Restless Bandit,” and so, it’s basically, you have arms that are degrading, and you’re pushing in new arms. And we definitely let you do that in our software, but that’s not… It’s still hard, and you still need an analyst to be thinking exactly what the problem is. So yeah, in that type of use case, and you really just care about optimizing the process. You’re getting different types of information out from, when you’re doing something like RL, and the bandit is just a subset of a reinforcement learning problem, where you were adaptively making these selections over time. It’s really about trying to find the best arm, and we’re really trading off trying to find the best arm quicker, with uncertainty about the conversion rate or the value of the suboptimal arms. We’re kind of re-allocating resources. That’s really all an adaptive procedure is doing.

32:46 MG: When you’re running a regular test, normally it’s like, each option, A, B, and C, or each of the vacations, gets the same amount of data. And so there’s roughly the same amount of uncertainty in the estimated value of each of the vacations, each of your A, B, and Cs, roughly. ‘Cause there’s the same amount of resources, the same amount of samples have gone into each one. When you’re doing something adaptive, like a bandit, even if it’s doing something, some of the simpler procedures, we don’t really have to talk about each of the procedures’ different approaches unless you guys want to. But, regardless, it’s really about, over time, re-allocating resources from what we think are the lower performing arms or options, to the higher performing ones. It’s almost like, you’ve got your soccer team or your football team, and so you have all these players, at first. And they all get time on the pitch. But over time, as you start getting closer to the season or get closer to the championship, you start benching a lot of the players. And only the ones that seem to have been performing very well, are the ones who wind up playing the game on the pitch. It’s like that. At first, everyone gets a chance, and then you get less and less playing time if you’re not performing. And that’s really the same type of thing.

34:00 MK: That analogy actually really helps things click in. But, I’m just wondering, I’m thinking back to that…

34:04 TW: And, we’re out of time.


34:10 MK: Oh.

34:12 MG: Now, don’t listen to Tim.

34:14 TW: Okay, go ahead.

34:15 MK: I’m thinking back to that Atari example, which I watched and found very entertaining. I feel like, coming client side and bringing that analyst point of view, people have just gotten used to, or they’re used to A/B Testing, they pretty much understand what it’s doing, people are comfortable. As an analyst, if I’m like actually, here’s another thing I wanna try out, I can already see it. One of the big hurdles to get over, is that people will wanna look at the Atari game, in that very early stages, where it hasn’t learned yet how to do things well, and it’s not performing really well. How do you manage that, until it gets to a stage where, the reinforcement learning is performing well and really giving you those results? How do you manage that, in a business context of getting people on board, getting them used to trying something different when, yeah, early on it might not be great?

35:12 MG: Well, it’s even worse than that. It’s even worse than that. [chuckle] You’re talking to the wrong person. I’m a bad sales guy, so, ’cause I’ll tell you it’s hard.


35:21 MG: Everyone else always says how easy all this stuff is, and how their software’s gonna be, but, no, it’s an inherently hard problem and it’s harder than that because your assumption is actually one that, I don’t want the listeners to think is actually true. It’s not true, actually, that it doesn’t look like it’s performing good at first, and then, suddenly, it will perform well. You have no idea if it’s gonna perform well. Because, remember, it turns out that the marketing problem, and this is why the analyst and the marketer is so valuable, actually. They actually want this stuff to be hard. Because if it wasn’t hard, they wouldn’t be valuable. And it turns out that they are very valuable ’cause these are very hard problems, and they’re harder than, and in a certain respect, they’re much harder than the Atari problem because the Atari problem if you think about it, while it’s a very complex representation that the algorithm needs to map the pixels into in order to learn how to play the game well, we know that you can play the game. You already know that there’s a solution because you play the game; you can look at it and we know going into it that the images on the screen, the pixels of the screen, we can look at that and we can play in a way that performs well, we know there’s structure there.

36:32 MG: But let’s think about the marketing world, that’s not the case at all. How many times have you or your listeners run A/B tests and it’s been inconclusive? In fact, Tim’s first analogy there, of the discussion was it usually doesn’t work at first for the whole population. Well, think about it, that means that actually there is no structure, there is nothing to be learned. It’s A, B, and C are equally poor or equally good. That’s actually the hardest problem, and that’s something that actually at Conductrics we spent a lot of time working on is that the problems that the marketer is the agent or the intelligence is gonna find itself in are often gonna be these very noisy problems. That’s why it’s really hard, and so really what it is, is it may never find anything because there may be no map.

37:20 MG: And so what you want the system to do is to not, you don’t want it to present the… Because remember, the output of this is a little program, and by program I just mean if the user has these attributes then there’s this, it’s we represent it as a decision tree but it can be represented as a bunch of numbers, there’s lots of ways to represent this program, but it’s really just a way of putting in input about the user and getting an output of what action you should take. And the thing is what we don’t want is for that program to be very long if there is no structure, because how complex that program is, is really saying how much structure there is in the problem. And what I mean by structure it’s like well, if a user is coming in from Melbourne and there they’re coming in at night, then they have to get this vacation. But if they have this, if they’re coming in from Melbourne but it’s during the day but they’re actually on Chrome, then they should get this other vacation.

38:13 MG: And you have a whole set or list of rules. The more rules you have that actually make a difference that implicitly, it’s like the flipside of how much structure there is in the problem. If you think about it like the edge case, if there’s no structure, you don’t need any rules. It’s just like pick randomly; pick A, B, or C, it doesn’t matter. And so you have a really short program. You can do anything you want and it won’t matter. And that actually often is a lot of the scenarios, and it’s important for people to realize that you don’t wanna sell them… What’s important, I think, if you wanna manage expectations, is even if you could sell it if you were to make it seem like it’s magic, long term, that’s detrimental and you don’t want to sell people a bill of goods and you wanna be honest with them. And it’s like these are hard problems. Now, the question is if you think that your marketers are coming up with different experiences that do differentiate your users then, yeah, this is the type of thing you want to do because you can do it at scale.

39:09 MG: Now, what we do is that we actually, we have like a meta A/B test. You don’t have to do this but this is how we think this is best practice, and so we have our machine learning our RL program, which is trying to learn this assignment between users and experiences and it’s doing all this adaptive stuff. What we do is we always have some subset of the users that are just getting randomly assigned just like you would in a regular A/B test. The adaptive, the machine learning is just another arm or is just another experience in this larger A/B test, this meta A/B test. And so we look at A, like the vacation, the three different vacations. Some users get randomly assigned to each one of those, and then we have, we keep an account of when users were assigned by our little machine learning, our little RL algorithm, and then we can see, well, how much difference is there actually when people are going through the machine learning versus if we were just to like randomly select over these three or we just picked A, B, or C.

40:10 MG: And then you can see what the marginal lift is, what’s the the incremental return to targeting. Which is really what you want and in a lot of times there might not be. If you think about it, let’s say you had some sort of problem where… I know vacation is really lame, I wish I thought of a better example but it’s like some vacation that everyone wants, it’s to Tahiti. It doesn’t matter. You don’t have to target on that, maybe everyone wants to go to this free vacation to Tahiti. Targeting is gonna have no value because no matter what, the answer is always Tahiti. And so if the answer always comes back to Tahiti, then it’s you get no return from doing any targeting. You’re just gonna pick Tahiti.

40:50 MG: It’s that type of thing which is what the analysts needs to be thinking about and how to approach it, and how to ask, well, what types of returns do we need to get on average for this type of approach to be useful, and then I would just position it as an experiment. Look, if you’re in an organization that actually understands the value of experimentation, not the ritual of doing experimentation, but you’re incurring some sort of risk, some sort of cost to collect information so that you can maybe use it in the future as some sort of future value, right? Perhaps.

41:25 MK: Yes.

41:26 MG: Then that’s how you position it. Let’s do this as an experiment and see how this works. You can’t guarantee anything, and I know that’s unsatisfactory but the reality is, is that if you’re in the world of statistics and we’re in this inductive learning, we’re using experience to make statements about the world to learn, our knowledge is coming from experience. There’s always uncertainty, and anyone who’s trying to convince you that they can give you this the answer is A or the answer is B is either ignorant or is disingenuous because you can only say what the answer is on average and you can only say, you have to assign some sort of certainty like we’re, with some sort of… You’re making decision on certainty. And so the question is like we could do this extra, we can collect more information that would reduce our uncertainty and then you can estimate the value of reducing uncertainty in your problem. I know that sounded a little bit harsh but I do think it’s important that this can’t be magic. This is like scientific marketing. It’s the science of it and so we need to be very honest with our stakeholders.

42:40 MH: What I think I just learned there is that we should always have an AI running in the background that does nothing.

42:46 MG: That is absolutely not correct.


42:51 TW: So close.

42:53 MH: I had this, I had…

42:55 TW: He’s just saying that ’cause really he’s thinking, he’s like, “That’s a great idea for a product. Who can tell me I’m… ” [laughter] Well, he said disingenuous or ignorant and I was like, “Why not “and” ignorant.” So, you know.


43:11 MH: But that’s the other thing is going back to that issue that you just discussed, there’s also the prediction of external factors that an AI’s not gonna be ready for, right? And how good at adapting should we expect reinforcement learning system to be, if I know Lorde is from New Zealand but if she finishes a concert and then she yells out to everybody, “I buy all my clothes from the Iconic.” And then all of a sudden, the website’s flooded with new customers, [laughter] the reinforcement learning, I don’t know if Lorde is big in Australia, maybe there’s not a thing, she’s big in the US.

43:48 MK: She is. She is, it’s all good.

43:51 MH: Okay, I could go Kylie Minogue but that would be…

43:53 TW: Lorde is a musician. Okay.

43:56 MK: Yup, got it.

43:58 MH: Tim, you would actually like some of her stuff, if you would like… Out of your folk prison.

44:06 MK: Oh geez.


44:07 TW: My Americana World.

44:09 MH: Yeah. Does that make sense? That’s a long way to ask a question but that’s Tim’s thing so I’m trying to steal it and… [chuckle] No, I’m sorry.

44:20 TW: Fuck, fuck you, Helbling.


44:24 MG: That’s always true. That’s not anything that’s an additional constraint on an automated approach. That would be true if…

44:32 MH: It’d be true if it was humans running it too.

44:35 MG: Yes. Yeah, so you, of course.

44:38 MH: It’d still be a challenge for any system, human run or machine learning procedurally generated.

44:45 MG: If it does not have, like you have these one-off events that are these random stochastic events, then yes. But now, as an analyst you might say, “Oh actually, what we need to do is model whenever these concerts are happening.” And then actually model day from one of these concerts. If she’s always up there talking about the Iconic and how awesome the clothes are, then you might wanna include that in the model. And include that in the bit of data about the world that you wanna be paying attention to. And if it’s a one-off, then we get back to the point about forgetting and… So what, you’re off by that one day and if there’s some effect to maybe the model but if you’re doing something online or you re-run the models or whatever, you can refresh the thing. But yeah, if you’re always having random events, if there’s no real structure and it’s just this chaotic world you’re in, then you couldn’t learn. The world has to be amenable to learning.

45:46 MH: Well, or those random events repeat themselves. What if annually, this event was random but it kept happening? Then it would learn.


45:58 MK: Oh geez. Matt, can I just ask a quick question?

46:02 MG: Yes, I know that that means it wouldn’t be random anymore. Just for the record.

46:07 MK: Can I just ask a quick question? If our listener is an analyst out there and they wanna start learning a little bit about this and I guess dipping their toe in, have you got any suggestions of, I don’t know, where to start, or a problem that might be a little less complex that they could kick things off with? Have you got any pointers?

46:25 MG: Oh, you mean other than looking at conductrics.com? Then…

46:30 MK: Other than that.

46:30 MG: Yeah.

46:31 TW: There are a lot of great blog posts on Conductrics. I will say that for Matt has done a yeoman’s task of trying to explain some of these concepts on the Conductrics blog.

46:41 MG: Yeah, and they’re very non product-y too. There is actually real content on there and it’s not, there is no real sales component there. That was actually gonna be one of my last calls, but yeah, we’ll put up a set of resources for the listeners and you guys, if you are so inclined, you can put that on the podcast home page and there’ll be a set of lists.

47:08 MH: Awesome.

47:09 MK: That’d be great.

47:11 MH: Speaking of last calls, unfortunately, yeah, we’ve gotta start towards our wrap up. Heavy stuff but I think I learned about three or four new things. I am excited to bring back to the team that actually does this kind of work that I can talk to ’cause this is not my field but [laughter] you know, I talk to people who do some of these stuff. But yeah.

47:34 TW: You’re gonna open with the scheduled annual random event?

47:36 MH: That’s right. And now all rise for our random event.


47:44 MH: Is it a sale? You’ll never know. Okay. One of the things we like to do on the show is we like to go around and do a last call. Something that’s going on that we think is interesting, maybe be of interest to our listeners. Matt, you’re our guest. Do you have a last call?

48:02 MG: Yes.

48:02 MH: Besides the one you just did?

[overlapping conversation]

48:06 MG: That was actually the first call. That’s first call.

48:09 MH: That was first call. There you go.

48:12 MG: No. Actually, since we’re talking about… I know there’s a lot of folks out there who are interested in learning more about machine learning and deep learning. And so I highly recommend, there’s this really great guy. He’s over at Google DeepMind now, but he used to be at Oxford, and before that, at one of the Canadian universities, which by the way is actually, that’s really where a great deal of the world class research is coming out of, actually out of Canada at the moment. But this guy, Nando de Freitas, and he has a deep learning set of lectures that he gave at Oxford. And I know that it sounds super intimidating. But if you have some basic capacity in analytics, I highly recommend going to his YouTube channel and there’s about 16 lectures, 17 lectures. And even though it’s called, Deep Learning is the title of the class, the first 10 lectures really actually just start from the very beginning, talk about basic linear regression, talk about logistic regression. It’s actually a really amazing arc of starting from the very basics of machine learning and going and building in a very modular way, which actually is what deep nets do, they [49:23] ____ their module in learning from the very beginning to some of the more advanced stuff like the convolutional nets, which is what most of the image recognition software is around. If they actually wanna know the mechanics about it, a great resource.

49:38 TW: Nice. Michael? Michael you wanna go second?

49:41 MH: I would love to, and this is actually serendipitous as a random event that, Matt, you are our guest. But I recently started rereading the Foundation series by Isaac Asimov, and that’s what I wanna recommend to people to read those books if you haven’t read them or re-read them if you read them a long time ago.

50:05 MG: Are you at Second Foundation yet?

50:06 MH: No, I’m still on the first book so I’m working my way through the first one. But it’s hilarious ’cause I’m just like, “Oh my gosh. I forgot all these things about this.” So I’m having a good time.

50:16 MG: It’s also total data science, too. Second Foundation is like the data science, big data hypothesis, so I’ll be interested in your take on that.

50:26 TW: Yeah. We have Douglas Adams and Philip K. Dick in their intro and then all the sudden you’re actually currently reading Asimov. I feel like you’ve… You’re covering the gamut.

50:33 MH: I’m going through a phase, is what I’m doing.


50:41 MH: Alright, who’s next?

50:43 MK: I’ll go next. I had to change mine anyway because I’ve clearly established that Tim was trying to steal my thunder once again. This is I guess for all the listeners out there who are still a little bit scratching their head and still when you say reinforcement learning aren’t kind of sure, like it hasn’t sunk in. In doing research for this show, I came across a blog that I’ve never actually seen before and it’s called… I’m gonna mispronounce it, but I’ll make sure to put the link in the show notes, it’s Analytics Vidhya?

51:18 TW: Analytics Vidhya, yeah. Analytics Vidhya.

51:20 MK: Okay. Of course, Tim nails it and knows it, big shock. But there is an article there on what is reinforcement learning, which talks about dog training which we did touch on in the show. But I’m obviously really big into my dog and we do clicker training and it was somehow about that example. It just helped everything sink-in. I really recommend that one if you’re still a little uncertain.

51:42 TW: That was, you actually posted that before the show and I got to read it. That was a very good, a good primer. Mine, just to stay somewhat technical, I guess, I stumbled across this in a funny roundabout way. Mark Edmondson has a slack group for basically anybody who’s using any of his Google R packages. He had thrown in there a link to Google Vision, which is a package for R that uses the Google Cloud Vision API. It’s comical as you follow the chain to the actual documentation for the package. The authors have in their usage instructions they say, “Oh, you need to authenticate with Google. Use the fantastic GoogleAuthR package.” They actually, in the comments of their sample code, commented on how awesome the GoogleAuthR package is, which is a package created by Mark Edmondson. I went in the complete circle and then circles or an image, so then I actually tried out the package. I don’t really know that I’ve got an analytics application for it, but it was fun. I took some pictures from various recent trips and you just throw the image, use just a little bit of code and you say, “Tell me what landmarks you see in this picture. Tell me what tags you see in it.”

52:54 TW: And it does a pretty good… If you’ve got a landmark, if you’ve got a picture on Red Square that, it will identify it as Red Square and give you the latitude and the longitude. That is leveraging the power of some of the stuff Google has without me fully understand exactly the magic of how they’re doing it. But I may post a link to my silly little bit of code I did. I’ll certainly post a link to the couple of posts on it. But it really was, it’s almost if you were looking to dabble in R, it really is one that within an hour or two, if you’ve got your own images where you say, “Oh, I’d like to see. Can you find the faces? Can you identify the nose, ears, eyes and whatever other parts of a face there are in this picture, it’s pretty cool. It’s kind of an idle aside, but kinda nifty. Google is doing some amazing stuff and people are writing R packages to hook into it.

53:48 MH: Awesome. Well, if you’ve been listening and you have thoughts or you’ve had a breakthrough in reinforcement learning and you really wanna talk to the analytics community about it, Matt Gershoff is on the Measure Slack. You can reach him there also on Twitter and so are we. We’d love to hear from you. And you can reach us on our Facebook page, on Twitter, on the Measure Slack about reinforcement learning, AI, whatever. Only Matt will really understand you, the rest of us will refer you to Matt.


54:22 MH: Although Moe is starting to become very dangerous.

54:26 MG: It’s not a competition.

54:27 MH: We’ll see.

54:28 MG: It’s not a competition.

54:28 MK: There was some pointing going on in their shows, I’m not sure what that was about.

54:32 MH: That was me saying that was a really good question and I was about to ask the same thing so.

54:36 MK: Mm-hmm, mm-hmm, sure.

54:38 TW: So you claim.

54:38 MH: No, it was.

54:39 TW: If you are in Measure Slack, the data science channel is the channel that to most likely people would be talking about that.

54:48 MH: In any case, we’d love to hear from you. And if you listen to the show regularly and as a further qualification, you identify as someone who really likes the show a lot, [chuckle] and only if those two things are true. No, I’m just kidding, [laughter] we would love it if you would go rate use on iTunes. When people submit ratings via iTunes, what that does is somehow lines our pockets eventually, uh no. It does not do that.


55:17 MH: What it does is it helps create more visibility for the show. We’d like to get out to more listeners and if you created some ratings and reviews of the show, if you like it, I guess if you don’t like it… Nah, I can’t even do that, yeah. Only people who like the show. Sorry. [laughter] No, I’m just kidding. Anybody can rate and review, but we would love it if you were willing to submit a rating or review on iTunes, that would help create more visibility. Anyways, we’ve loved having you, Matt, again, you know we talk about you on the show all the time and in a good way though [laughter] so it’s okay.

55:55 TW: And we’ll see Matt in New York at e-Metrics here coming up end of October.

56:00 MH: Sooner than that. I will see Matt at I believe the DA Hub at the beginning of October, so yeah I get to see Matt…

56:09 TW: Which is just imminent.

56:11 MH: Twice in October, and yeah, it’s coming right up here. Anyways…

56:14 TW: Couple of weeks.

56:15 MH: But everybody listening, keep on doing that, we appreciate it. And for my co-hosts, keep learning.


56:27 Announcer: Thanks for listening, and don’t forget to join the conversation on Facebook, Twitter, or Measure Slack group. We welcome your comments and questions. Visit us on the web at analyticshour.io, Facebook.com/analyticshour, or at analyticshour on Twitter.

56:48 Speaker 6: So smart guys wanted to fit in so they made up a term called analytic, analytics don’t work.

56:56 TW: Ahh.

56:58 MK: You okay there, Tim?

57:00 TW: Maybe I should have had something stiffer.

57:03 MK: You okay, guys?

57:04 MH: Yes. We’re very excited about cuisine in Australia. Relative freshness. [laughter] Moe, I don’t know what you put in your tea but it’s slightly slowing your speech. [laughter] I’m just kidding. Tim, have you gotten down to a light simmer?

57:30 TW: I’ll be fine.

57:32 MH: Pardon me, I’m hunting for a pen while you’re talking. I don’t mean to be a distraction.

57:37 MK: Oh, shit. It’s gonna be a disaster.


57:47 MH: This is gonna be a fucking nightmare.


57:50 MG: Yeah, what do you want me to do?

57:53 TW: That was good.

57:54 MH: Yeah, anyways.

57:55 MH: Okay, good stuff.

57:57 MG: Wait. Wait, let’s do that again, Hold on, let’s do that… Let’s do that again, I’m gonna do Tim. That was good.


58:10 MH: Tim has had a rough week.

58:15 TW: Rock flag in reinforcement.


3 Responses

  1. […] Favorite Episode: Reinforcement Learning with Matt Gershoff (Episode #71) […]

  2. Rohan says:

    Awesome article, Thanks for Sharing your knowledge
    This is how machine Learning Work ??

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

Three pay phones mounted on a wall next to each other

#245: Dear APH-y – An Analytics Advice Call-In Show

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_245_-_Dear_APH-y_-_An_Analytics_Advice_Call-In_Show.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares