#253: Adopting a Just In Time, Just Enough Data Mindset with Matt Gershoff

While we don’t often call it out explicitly, the driving force behind much of what and how much data we collect is driven by a “just in case” mentality: we don’t know exactly HOW that next piece of data will be put to use, but we better collect it to minimize the potential for future regret about NOT collecting it. Data collection is an optionality play—we strive to capture “all the data” so that we have as many potential options as possible for how it gets crunched somewhere down the road. On this episode, we explored the many ways this deeply ingrained and longstanding mindset is problematic, and we were joined by the inimitable Matt Gershoff from Conductrics for the discussion!

Links to Resources and Short Stories Mentioned in the Show

Photo by Lukas ter Poorten on Unsplash

Episode Transcript

0:00:02.2 Announcer: Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language.

0:00:14.0 Michael Helbling: Hi everyone, welcome. It’s the Analytics Power Hour and this is episode 253. Almost every time I’ve attended the SUPERWEEK conference in Hungary over the past seven years, a major theme is how much our industry is changing. And lately, especially with privacy regulations, new laws that impact our industry. And the other thing I usually take from that conference are new ideas about where the industry’s heading and how we adapt to these changes. And I think this conversation on this show will be similar, I think, in a lot of ways. And with new constraints on how, when, and where we collect and store data, it’s high time to embrace new paradigms where we can find new ways of thinking about data collection and usage in a privacy-first world. So let me introduce my co-hosts, Julie Hoyer. Welcome.

0:01:10.3 Julie Hoyer: Hey there.

0:01:11.9 MH: That’s awesome. And Tim Wilson, who has been with me many times at SUPERWEEK. Welcome.

0:01:17.3 Tim Wilson: Thought we weren’t gonna talk about that.

0:01:18.7 MH: It’s on video.

0:01:19.8 TW: What happens at SUPERWEEK stays at SUPERWEEK?

0:01:23.0 MH: Stays at SUPERWEEK. Well yeah, we won’t talk about a lot about it. All right. And I’m Michael Helbling. And our guest today needs no introduction, but let me do a little bit. He is the CEO of Conductrics, an amazing thinker and speaker, and is our guest again for the third time. Welcome to the show, Matt Gershoff.

0:01:44.5 Matt Gershoff: Thanks for having me. A real honor.

0:01:44.7 MH: It’s awesome. I’m thankful to have you too. And actually, as I was thinking about it, I was like, well, you’ve been there at most of these SUPERWEEKs as well. And your company Conductrics sponsors that event. And I remember that very fondly. I mean, seeing you there all those years, it’s also really fun. But one thing, Matt in this topic specifically, you consistently in our industry, for me anyway, are usually one of the people who sort of about five years ahead of what a lot of people are talking about. And so I think it’s really interesting that one of the things you’re really talking about now around sort of this world of new privacy laws and things like that is about adopting a mindset of just in time or just enough data or sort of privacy first mindset around data. And so kind of maybe as a starting point, what got that going for you when? And what kind of spurred that as sort of a major area of thinking and writing for you over the last couple of years?

0:02:47.4 MG: Sure. Well, first, thanks for having me. And thanks for having me, Tim and Julie. This is gonna be fun. Looking forward to it. Well, actually just to step back a little bit, the work that we’ve been looking into and working on within Conductrics around privacy engineering and data minimization is really less about privacy per se, and really more about thinking about why we’re doing analytics and experimentation in the first place. And so I think for us, we have a slightly different view of the value of experimentation. And just so that the listener understands where I’m coming from, is that Conductrics is in part an experimentation platform where you might do A/B testing and multi-armed bandits and that type of thing, where you’re trying to learn basically the marginal efficacy of different possible treatments. And for us, we really feel like the value of experimentation is that it provides a principled procedure for organizations to make decisions intentionally, to make them explicitly, and to consider the trade-offs between competing alternatives.

0:04:10.7 MG: And ultimately, the reason for doing this is to act as advocates, sort of the front line for the customer. And so we have a much more, I guess, hospitality or omotenashi approach to why experimentation, why one really should be doing experimentation. And I think that’s true of analytics more generally. It’s like, really, why are we doing it? And I think one of the issues that I’ve seen in the, I don’t know, almost 25, 30 years that I’ve been in the analytics space is that sometimes analytics tends to become, kind of lose that focus. And we tend to have programs that become almost ritualized. So we sometimes start doing behaviors just to do them, and we kind of lose the focus of really why and what the ultimate objective is. And so for us, part of the reason why privacy engineering and data minimization is something that we’ve gravitated towards was, one, part of that is really about respect and being customer-focused. But also, two is that it really forces one to think intentionally.

0:05:38.4 MG: And we ask the question, what is sort of the marginal value of the next bit of data? Like, why should we collect this next piece of data or the added data? And to really have some sort of editorial and expertise about why we might be getting more information about the user when we might not really need it in the first place. And so this idea of intentionality is really what underpins both experimentation for us as well as why we were interested in moving towards having a more data minimization approach to the experimentation platform.

0:06:21.2 TW: So you said sort of the ritualized behavior, which you came up with, as I recall, you sort of came up with two and then you added a third. You said, oh, there are these mindsets of data, I wanna get data just in case, just in case I need it. And that, I think, falls under that kind of ritualized behavior, gather all the data, not considering the incremental value of it. And you contrasted that with just in time. And then you added like just enough, I think, a little bit later. But does that fit that we’re kind of making a broad generalization in analytics? And I think even in experimentation, there’s a tendency to say that next bit of data, the cost to collect it is near zero. So let me collect it just in case down the road. And that just is kind of ballooned out that you add on a million additional data points. And now you’re just in the habit of just collecting everything and sort of lost the idea that you’re actually trying to figure out what you’re doing with it.

0:07:29.5 MG: Yeah, that’s a good question. That’s a good comment. Really what it is that if you think about it, the GDPR and data privacy, most of that conversation has been around compliance. Which, and what you can’t do. And a lot of that is really sort of a procedural thinking, like do you follow certain procedures for risk mitigation? And really what I think the privacy legislation is really about is to encourage privacy being embedded in technology, being embedded in processes by default. It’s not that you shouldn’t collect data if it’s required. It’s not that if you have a task and you need the data in order to achieve the task, no one’s saying that one shouldn’t collect that. It’s really about asking for a particular task, whether or not the data is pertinent. And it’s about being sort of respectful to users and not collecting more than that’s needed. Now that privacy by default is in contrast to what I think a lot of the thinking had been or currently is in sort of analytics and data science, which is really a data maximalist approach, which is collect everything by default.

0:08:46.0 MG: And again, as you say, the sort of the marginal cost of the next level of granularity, right? So we can think of more data as being finer and finer levels of granularity for any particular data element, or it could be additional data elements and it can also be additional linkage. And so that’s sort of that whole 360 and so that every element or event can be traced back or associated with an individual. So you kind of have those three dimensions to expansion of data. And so I was really trying to point out is that a lot of that data collection is somewhat mindless. It’s just that just in case and underpinning it, is it really an explicit objective, right? We’re not, we don’t have a particular task and we’re collecting data for this particular purpose. Like in an experiment, I was talking about just in time is because we have the task. I need to know whether the marginal efficacy of one treatment over another, one experience over another. And so then I need to go out and collect data for that task versus just in case it’s really, I don’t really know what the question is that I’m gonna ask, but I’m gonna collect it anyway.

0:10:04.1 MG: Now, why am I gonna collect it? Well, really there’s sort of a shadow objective, which is one based upon magical thinking, which is all of the value is in that next bit. It’s almost like the gambler who’s at the table when they’re losing and they just have to believe that the next hand is where the big giant payoff is. That often gets rationalized in data science and venture land is sort of fat tails, right? And so there’s some sort of huge, there’s huge payoffs out there lurking in the shadows and you just need to have reached some sort of threshold of critical mass in order to achieve it. And I’m not saying that that doesn’t exist, but it’s unlikely that it exists in the probabilities that people think. So that’s one side of things, which is this magical thinking that all the value is in the data that I haven’t collected. And then secondly, it’s about minimizing regret. So it’s like, well, I don’t wanna not have collected it in case I need it in the future. My boss asked for it. And so we collect it. And that’s sort of collection by default. And that is not consistent with the privacy by default.

0:11:22.5 MG: And that’s really the law. And so that’s not to say, though, that discovery isn’t something that’s also important. So it’s not about being paternalistic and saying, don’t collect data or there’s a certain way that you have to do it. Really, all we’re talking about is just being thoughtful about it and being intentional. So it’s like, hey, I think perhaps that if we had the company may think or you folks may think that, hey, from this particular company or a client, if they had X data, then they could solve tasks A, B, C, D, X and Z, whatever. And that seems totally reasonable to me. Then you have a reason to go collect that data and then check, Okay well, does it look like this data is informing these decisions or helping us make decisions? But that’s entirely different than just collect everything. And I think that just in case collect everything, one, it being mindless, there is no objective to having it other than to have it, really opens organizations open up to grift. The sales pitch, which is can you afford not to collect it?

0:12:33.8 MG: A lot of that stuff. And that’s prevalent in our industry. And so I really think it’s really about being mindful. And it’s really about this idea that the real value is not in the data or in any statistical method or any technology. It’s really in the editorial and the expertise and really the taste. It’s like, does the company have taste to be thinking about what is gonna be useful for their customers and to be cognizant of what the customers need or have empathy for them and to be using information about them in a way that’s respectful? That’s really all, that underpins all of this.

0:13:16.2 MH: It’s time to step away from the show for a quick word about Piwik PRO. Tim, tell us about it.

0:13:23.1 TW: Well, Piwik PRO has really exploded in popularity and keeps adding new functionality.

0:13:28.8 MH: They sure have. They’ve got an easy to use interface, a full set of features with capabilities like custom reports, enhanced e-commerce tracking and a customer data platform.

0:13:40.4 TW: We love running Piwik PRO’s free plan on the podcast website, but they also have a paid plan that adds scale and some additional features.

0:13:46.7 MH: Yeah. Head over to piwik.pro and check them out for yourself. You can get started with their free plan. That’s piwik.pro. And now let’s get back to the show.

0:13:58.2 JH: Well, it’s funny, too, that working with a lot of clients that do the just in case collection, because, again, it is widespread. It’s the norm across the industry, I would say. I have run into so many situations where we go and they ask a very important business question and we start with like that question first and then they say, and we have all this data that we can pull in and we have so much we should be able to answer this. No problem. And time and time again, I start getting into like the actual requirements of what the data needs to be able to do to answer this great question. And then we find out that even though just in case they’ve been collecting all of it, it’s not in the right structure or things can’t be joined the right way, whatever it is between the tool and the actual data structure itself, we can’t answer the question they care about. And so it would still be then defining in that moment going forward, like, what do we actually need to be collecting for you to answer this business question?

0:14:50.3 JH: And it’s funny because one of the examples I had was actually working in Adobe Analytics, or actually Adobe CJA. And we were bringing in a data set from, let’s say, like Salesforce. And I started to have this conversation with my stakeholders saying, you’re asking great questions, but you’re asking questions that we’re used to being able to ask the data that would come in through Adobe that we were used to for years with Adobe Analytics. And now you have this data coming in from Salesforce, which was structured and designed to answer different types of questions. And so they don’t map perfectly together. And so now we’re starting to talk to them about how could we rework this and actually bring in the data in a way to answer the questions you care about and that your stakeholders coming to you actually need.

0:15:36.2 MG: Yeah, the main thing is to be intentional. Now, but to be fair, like some of those companies that you’ve mentioned in the past, they were sort of masters of this collect everything and magical stuff is gonna happen. And then all of the use cases wound up being error handling because the site was broken. And so that’s not really a community that has been totally innocent of maybe overselling collecting data. I mean, data is not information. And I think it’s important to think about kind of like the entropy of what you’ve collected, like how compressible is the data? And so a lot of times you have data, but it’s not information. It doesn’t help you reduce uncertainty in a particular question that you’re asking. And that’s what information does. And just because there’s bits being collected does not mean there’s more information.

0:16:41.3 TW: Well, and it feels like my concern is that it’s already a problem. It already is the, and you said it was kind of the laziness of avoiding thinking of saying, well, just collect everything. I mean, the number of times that I’ve got experiences where somebody said, oh, the data collection requirements are pretty straightforward. Just collect everything. And it’s like, well, no that’s lazy and simple for you to articulate. It’s actually showing that you’re not thinking through what you’re going to do. I feel like we’ve been in that mode with lots of forces sort of pushing that idea, that idea of I wanna have the option to look at this data and hopefully it’s structured well with the, a chunk of the world of AI and the next generation of the technology vendors jumping on that train or kind of spinning the, well, to do AI, like the more data, the better.

0:17:40.7 TW: And there, we’re running out of data already to train the models. And I’m afraid that’s pouring kerosene on a raging, poorly functioning fire already that now people get to wave their hands and say, I’m doing this for the future of AI. It’s just like the next level of a lack of intentionality of surely if I get even more data, then the AI will be able to kind of run through it. But it’s really just amplifying, I think the same problem that you articulated when very clear and concise questions may mean that you need to collect a very small amount of data for the next month, as opposed to you’ve got boatloads of data you’ve captured for the last five years that actually aren’t that helpful, but you’re gonna force yourself to go wade through that, trying to do something that if instead you had intentionality and said, I’ll just go forward, like having that historical data, it actually makes it harder to have the discussion of what’s the best data to collect just enough of just in time to answer that question.

0:18:50.7 TW: Oh, that’s that new data. And it’s like, well, new data, what are you talking about? We have this ocean of data. What can you do with that? Well, what I can do with that is a much more complicated, messier, actually less good at answering the question. But yes, we’re checking off the box that you can point to your just in case mindset is having, helped me answer a question. It actually wasn’t the best way to answer the question in many cases.

0:19:21.8 JH: Yeah, and I get so many times like, what can, just do what you can do with the big messy historical data that we just in case captured when I tell them like, oh, well to really answer this, yeah, maybe it should be different data looking forward in a test. And they’re like, eh, yeah, well, we don’t wanna do that. So what’s the best you can give us from the other stuff?

0:19:42.9 MG: Yeah, and just to be fair, I didn’t use the word lazy. I just think maybe just unaware. Yeah, I mean, I just think it’s, I think the value is in being aware and being explicit. That’s what I think data teams and companies should be doing. And I think that’s where the success is. And it’s not in doing analytics. It’s analytics in the service of having a well thought out understanding and model of the customer and the environment that you’re in. But this, again, this isn’t to be paternalistic and saying, I don’t know, it’s not for me to say what companies in particular context should be doing or shouldn’t be doing. I just know for us, when we re-architected the software back in 2015, we were aware of GDPR, and we read up on privacy by design, which are principles, I think came in mid ’90s by Dr. Ann Cavoukian, I believe. And there’s seven main principles. And the GDPR and other privacy frameworks have incorporated those principles into their legal frameworks.

0:20:55.9 MG: And one of them is principle two, which is privacy by default. And so, and I think principle three or four might actually be by embedding. And this idea is that the software and systems should have these, should be privacy by default, by design, and it shouldn’t be like a bolt-on. And so customers should be able to use the services by default in a privacy-preserving way. And it’s really only in cases, you need to like move up from the default as opposed to the current approach, which is collect everything and moving down from that. It’s really inverted and it really should be, you should be collecting as little as possible to solve the task. And we just realized that actually experimentation at least, and I’m not saying everything, but at least in experimentation, many, if not most, and actually most of the tasks in A/B testing experimentation can be done following a data minimization principle, which means we really do not need to link all the information together. We do not need to collect IDs.

0:22:03.8 MG: And we can store data in what are known as equivalence classes. You can kind of think of that as like a pivot table. And so the data is stored at basically an aggregate level. But even though the data is stored in an aggregate way, which allows us to use ideas from privacy approaches such as K-anonymization, we can talk about that if that’s of interest, we kind of use ideas of K-anonymity to help the client A, be able to audit what data has actually been collected in a much more efficient way.

0:22:39.5 MG: So it’s very easy to know what you have and whether or not it’s in breach of any privacy guidelines you might have. But also it means that we can do the analysis in a much more computationally efficient way. And so there’s a lot of nice benefits from following or embedding privacy by design principles into your systems and procedures, which are beyond just having less data about the individual. The main thing is that it encourages this idea of intentionality, just being aware of what you’re collecting and why. But that doesn’t mean it’s appropriate in all cases. That’s not what I’m saying here. It’s just more of an option.

0:23:26.7 TW: Well, and Matt, because I’ve now read and seen you talk about this, like it kind of blew my mind a little bit when it sort of clicked. And I think it was an indication of how sort of stuck in the standard way of doing things was that when it comes, if we just talk simple A/B testing on a website, and we know that we need to know, let’s just go with A and B, that we’ve got, that you’re treated with A, you poke around on the website some more, you convert or you don’t convert, store row. Your B, you poke around on the website, you convert, maybe you don’t convert, and the amount. And it seemed like, well, obviously, you have to have every one of those rows. And then when you’re done, you just kinda, you pivot it and you compare the conversion rates and you gotta do some other little t-test kind of math. And what kind of blew my mind is you were like, well, wait a minute, what if instead you just incremented counters? Because that step that I just glossed over of saying, I’ve taken 10,000 rows of individual users and rolled them up so that I could do the actual calculations that are done behind the scenes, you were like, well, wait a minute, if what you need is a count, you can just increment how many people got A, how many got B.

0:24:45.6 TW: If you need the sum of how many converted, you don’t have to have all those rows, you can just increment a counter and say, you’re A, I need to track you in the session long enough to increment the counter, I don’t need to store a whole row, I just need to increment a counter. And then where I really counted was like, oh, and then if you need sum of squares, I can square each value and then do the sum. ‘Cause like, so, like you’re literally getting from, you have what was 10,000 rows and it winds up being two rows that you’re just incrementing. And that was kind of your point saying, I can do, I can give you all the results that you get from a standard A/B testing platform in a standard basic A/B test. And that’s just one scenario, but I didn’t gather even IDs. I just had to have in a very limited temporal way until I could log which class they went in and what the result was. And I can just keep incrementing that. So one, did I state that fair? Like that, if the listeners are like, what is he talking about anonymization.

0:25:57.7 MG: Yeah, I don’t wanna, yeah. So yeah, I don’t wanna get in too much like, because this is like, this is gonna, I don’t wanna lose the listener here in too much minutiae here. But just to, but yes, you’re right. And so really, the realization was, and what some of the listeners I’m sure are aware of, but some may not be.

0:26:13.3 TW: To be fair, you headed down the K-anonymization path before I tried to do my summary. So I don’t want to be like, oh Tim, oh Tim, you’re getting too detailed in the weeds.

0:26:25.2 MG: No, we’re getting, yeah, no, and really let’s blame Julie because we said beforehand that she was supposed to keep us from. But just at a high level, it turns out that actually, what underpins most of the analysis or an approach to mostly analysis of the tasks that folks in experimentation need to do, is really, is regression. It’s like least squares. I’m not gonna go into like, we don’t have to go into like how it’s done and all that stuff. But it turns out that one is able to do a regression analysis, do various regression analyses on data that has been stored in equivalence classes in a certain way. So the main takeaway is that we can store data in an aggregate way such that we can do the same analysis as if we had the data or most of the same types of analysis as if we had the data at the individual level. And so what are the types of tasks that we can do? Well, as you said, we can do t-tests which is sort of like the basic frequentist approach for doing an experiment when we’re kind of trying to evaluate the treatment effect and try to account for the sampling error. But also things like multivariate analysis and ANOVA analysis of variants which you might do for multivariate tests. You might be doing something like interaction checks. So maybe you have some sort of, like Conductrics has some sort of alerting system where we’re checking between different A/B tests whether one A/B test might be interfering with another.

0:28:07.3 MG: Underneath the hood, that’s really for the folks who know some stats in your listener base, it’s really just doing like a nested partial F-test between two regression models, a full model and a reduced model. All of those things can be done and even.

0:28:19.4 TW: I was gonna say that, but I was trying to keep it up a little high level.

0:28:25.2 MG: It’s just more than T-tests and even, there’s like a lot of buzz and I think exaggeration around things like Cupid, which is really regression adjustment in the experimentation space. Even that can be done on aggregate data. Now, the main point about it being aggregated is really about data minimization, which is one, reducing the cardinality of any data field, which is the number of unique elements that we might wanna store. So instead of storing the sales data, the pre-sales data of the user from some arbitrary precision of cents, maybe it makes sense to have it in some sort of 10 bins that represent sort of the average value of each bin. So from zero to 10, where the average value in the 10 bin is like $1,000 or something. So the main idea is to reduce sort of the fidelity and sort of down-sample some of the data that you’re collecting so that you have less unique elements within each data field and to collect fewer data elements and maybe to decide when you wanna co-collect elements.

0:29:40.0 MG: So one can collect the data such that, let’s say there’s 10 segments, types of segment data that we might wanna collect within the experiment. We can store those as 10 separate tables so that you can do 10 separate analyses or you can have them stored, you can collect them, co-collect them. Maybe we wanna have these two or three collected at the same time or maybe up to 10. As you add, you co-collect data, you increase the joint cardinality, the number of unique combinations and that’s the thing that you kind of wanna manage. It’s like how many unique combinations of segment information do we wanna collect? And the measure that we might wanna use is the number of users that kind of fall within each one of those groups, each of those combinations. And maybe we wanna have at least 10 users that fall into each one of those combinations such that we’re never really collecting data on any individual user, we’re collecting data on collections of users who look exactly the same. And so that’s really that idea of K-anon is how many other people look exactly the same in the data set. And so you might wanna have some sort of lower bound on that, say five or 10. And that’s a good way to measure, it doesn’t provide privacy guarantees, but at least it’s a good measure to be aware of how specific or the resolution of the data you’re collecting about each individual.

0:31:19.4 MH: I like what you’re saying. I think one of the challenges that I’m thinking of right now and maybe it’s just dumb, but I feel like a lot of organizations lack the underlying knowledge to start making those groupings or buckets in the first place. And then sort of my question is sort of then how do they get that level of information or knowledge to be able to take that next step?

0:31:46.1 TW: Or is it they feel emotionally like they’re making the buckets, they’re like, but buckets are less precise. I need to be more precise. And that’s just the right, that’s the.

0:31:55.4 MH: I feel like, that’s going back to the first thing, which is sort of like our nature is to just try to glom on to every piece of information possible. But like there’s just people with a lack of knowledge. So let’s say somebody said, hey, I’m gonna fight my instincts to try to do this privacy by design. And now what I need to do is I need to group users like the way you just described to do K-anonymization. How do I know how to set those up so that they’re gonna be realistic?

0:32:21.8 MG: Well, how do you know the data you collect? I mean, first of all, you’re making the decision at a certain level of granularity anyway, like that’s implicitly being done. Secondly, again, I just wanna step back. This isn’t the main, the main takeaway here really is about just at least being thoughtful about it. It may be that you don’t change your behaviors at all. Maybe totally fine. And in the whatever context someone is working in, it may be appropriate. One use case is like, let’s say you’re in a financial organization or healthcare where there is, you’re in a regulated industry or you want to have some sort of, you have to collect the data anyway, let’s say that is private data, but you wanna do analysis. There’s this idea of sort of global and local privacy that really comes from differential privacy. A global privacy is where you have a trusted curator, right? And so you have the data. Think, a good example of this would be the US government and the census. So the data that’s collected by the census is extremely private information about citizens. And when that data is released, it needs to be released in such a way that private information about any individual is not leaked. And so in that case, the trusted curator is the census bureau, but they have a mandate to release information for the public.

0:33:55.1 MG: And so you could be in a situation where you’re an organization that has this information and you wanna do analysis. So you might wanna release data to your analyst team of the private data that has been privatized in some way. And so one would be to use data minimization and this sort of idea of K-anon. But there’s other approaches. There’s differential privacy. And so that’s something I know, I just spoke at the PEPR Conference, which is a privacy engineering and respect conference. And like there’s Meta is there and Google is there and whatnot. And they often have situations where they collect data and they wanna do, you build tools or analytics on it. But they release internally data that has either been subject to differential privacy or various data minimization principles. So that’s one of these.

0:34:44.0 TW: Can you define, can you, how easy is it to give a high level explanation of what differential privacy is and how it works.

0:34:53.2 MG: Well, I’m not an expert on it and it’s not super easy. But at the high level, as far as I understand it, it’s essentially, I believe it’s the one approach that actually provides privacy guarantees. So you actually have a particular privacy guarantee around it. And the main idea is that you inject a certain known amount of noise into the data. So the data is perturbed by a certain quantity of noise, which is defined by a, what’s known as a privacy budget. So basically you inject noise. It’s usually either Laplacian noise or Gaussian noise into the data set such that when a query comes back, it’s a noisy result. And so it essentially has certain guarantees that any individual, you have a difficult time differentiating between two data sets, one that has an individual in it, particular individual on it, and an adjacent data set that’s the same, except it does not have that individual in it. And whether or not the query results are consistent with or without that individual. And so that is probably terribly unclear to the listener, but the main idea is that you inject noise, you inject noise into the data set. It’s actually quite complicated. And at first it looks like amazing. We took a look at it and we were thinking about doing it. And I believe the census now is using differential privacy and it is useful in a situation where you need to release a lump of data.

0:36:35.2 MG: You need to release one particular query, like the census and they release the results and they’ve applied a differential privacy mechanism to it. It gets a lot more complicated when there’s a lot of ongoing queries on the data because there’s a privacy budget and there’s this idea of composition, simple composition, advanced composition. It’s somewhat related, actually it’s deeply related to Pearson-Neyman hypothesis testing actually. And so these ideas about inflation of type one error rates and all that stuff is not completely dissimilar to the idea of consuming privacy budget and whatnot. And so it’s not clear to me how one would actually manage it in an organization and two, whether or not organizations would accept noisy data. People kind of freak out about that. But there is this trade off of course, between privacy and utility. But again, the interesting bit, I think the takeaway is one, privacy by default is the law, at least in Europe and to various degrees in different states. And what I found can be often frustrating is that most of the privacy conversation is around, again procedure and compliance.

0:37:54.5 MG: It’s like you can’t do this. And it’s like not productive. It’s like, well, what, like help, give me some tools to think about what we actually can do. Like if you care about outcomes. And what is, I think of interest for the listener might be is to look into privacy engineering, which is really more a community and approaches about design-based thinking to build systems that have properties, privacy properties in them. And that gives a way forward to actually build stuff and to build stuff that has these privacy properties as part of them, as opposed to what I feel a lot of the privacy conversation is about not doing stuff and people trying to like block you from doing anything, very sort of bureaucratic in its approach, very legalistic. And this is a much more engineering approach and really.

0:38:50.0 MG: This whole conversation that we’re having is really just about providing an example of a company that has applied these privacy engineering principles to their software. Now it’s really gonna be up to everybody else to decide when and where it’s appropriate for them, but it is a way to actually build stuff as opposed to just not being able to do anything.

0:39:14.7 TW: So it’s interesting, I never read the seven principle, the Privacy by Design seven Principles until prepping for this episode. And you, because you bring up principle number two a lot, but principle number seven is the respect for user privacy and keeping the interest of the individual uppermost. And I feel like that may be a cudgel that I start swinging around like I… Watching on LinkedIn, is people are posting these diatribes. If you’re not taking your first party data and pumping it into this other system and giving it to that, what are you doing? This is insane. And it’s, you quickly watch the comment thread. Some people say, yeah, I use my tool to do that. You have other people arguing about the logistical complexity of doing it. And then there’s like a tiny little thread that is saying, is that in the individual’s best interest? Like everything about that. Sometimes it is. I think you were using an example earlier that if you need data from somebody in order to provide them something that they want, it is in their interest to provide it. But that feels like another whole tranche of the MarTech industrial complex that… There is nothing about that principle number seven of keeping the interest of the individual uppermost, which I think is another piece of that, that maybe just a little another hobby horse I can mount and gallop around on.

0:40:45.3 MG: Yeah. Well, seven and two I bring up mostly because it’s privacy as the default. That’s key. I think that’s the key bit is that it should be the default. And I definitely think, one should not be getting their guidance from the marketing tech industrial complex. Like that’s a problem because there’s perverse incentives there. That industry is incentivized to push, collect everything and magical thinking like people will sell a magic box if people wanna buy a magic box. And I think that’s the antithesis, I think of being thoughtful and mindful about why you’re doing something. Unless the optics of buying a magic box have value, that’s okay. I don’t… It’s not for me to judge like, what is… Why you’re doing something? It’s just one should have thought about why they’re doing something.

0:41:44.6 JH: But it feels like this way of thinking will end up being more productive for people long term though. Because we are, to your point, going to continue to run into restrictions privacy wise. And I think people that are still holding onto this idea that I have all this historical data and if I can just look backwards and answer any question and understand each individual and watch their entire path through my website, I’ll be able to answer any question, I need to make any decision about the business. But it feels like if someone could let go of some of that baggage of the way the industry and the story’s always been told to us. That you can start by saying like, what is the best question to answer right now for the business to make a decision moving forward? And what’s a way to actually ask that and answer it looking forward again by doing experimentation rather than trying to do a very complex historical analysis. And then you can go about actually designing and engineering the data again, moving forward. And I run into this so much with my clients where I do feel like you just get stuck in the cycle of looking backwards. That it is refreshing to hear that this is tactical steps and way of selling that forward thinking mindset instead. And seeing that it could be really freeing for probably a lot of companies.

0:43:03.7 TW: I don’t think it has to be experiments. I think you could even have stuff that if you’re not tracking something and they’re like, well what’s going on here? It’s like, well, we could just keep a counter, we’re at our a physical store and somebody saying, well, we wanna know how many people are looking at… How many people look at produce versus toilet paper. And one option would say, well we gotta have cameras mounted. So we’ve tracked all of that so we can answer it just in case if you ask that question. Or if all of a sudden that becomes a very important question to answer, say, cool, we’re gonna take all that money. We didn’t invest in this super complicated tracking system that had to store everything and we’re just gonna, send some resources.

0:43:46.3 MG: It’s gonna take me two weeks to answer the question, but very, very precisely. ‘Cause I know exactly what you’re looking at and it may not be even an experiment. It does seem like a… It is such a radical shift, like a change in… I’m not optimistic that we’re gonna be able to affect that sort of a shift because there are a lot of pressures that don’t want it. And it’s to Matt, I think your point, it’s so easy to get sucked into the compliance mindset for privacy. Well, what do I, my default is everything, what do I have to turn off or what layers do I have to put on so that I’m backsliding at a slower rate from what I’m used to doing as opposed to or… And you hit it quickly this, the simplicity of the computation. Well there’s a simplicity of if you have no data and you have a really clear question and you say, what’s the minimal data I need to collect to answer that question? That in many cases becomes a lot simpler for a lot of the questions. Now the problem is, you’re leaving a few questions that you could have answered otherwise, I guess.

0:44:57.2 JH: And this isn’t, and just to be clear you’re not tied to the old way they were collecting it. So many times you ask a good question and the data they have in that topic is not in a way you can even use it. So I love though that this frees you up to say, how exactly do I need the data to answer the question instead of, again, you’re married to the baggage of what’s already been done. And they’re like, well, I spent a lot of time and money and effort. So you gotta figure out how to use it.

0:45:19.7 MG: Also… That’s a great point. And also, just to be clear, this isn’t like Gershoff’s point, this isn’t like me, this is like, it’s encoded in the law. That’s what…

0:45:32.3 TW: It’s Gershoff’s law.

0:45:33.2 MG: No. Yeah. It has nothing to…

0:45:34.1 TW: It is now.

0:45:37.0 MH: 100%

0:45:37.7 MG: It’s not like I’m bringing this to the table. It’s like that’s privacy by design is embedded in things like GDPR, article 25 in principle five, 5C I think. So it’s not like I am suggesting that people do this special thing. It’s really, this is what’s out there. This is part of the expected behavior, especially at least in Europe, I guess. And what are some ways that we might wanna think about it and, oh yeah, also it is, I think supports this idea, which I think is really the main point from my perspective. Is that the value of… The value is not in this technology. It’s not in our software or other company software. It’s not in any statistical method or in the analytics method.

0:46:28.7 MG: It’s really about being thoughtful about what it is you’re trying to do and being thoughtful about what the customer might care about and being explicit about how you’re allocating resources and then thinking about things at the margin. And a nice added benefit of thinking about datamisation in privacy engineering is that it is consistent with thinking that way. That’s really the main thing. I think that’s what’s nice about it is that it helps us think through and be, have clarity about why we’re doing stuff. What you wind up doing is not for me or any of us to say it’s really gonna be ultimately for everyone in whatever context they’re in. That’s all. It’s really just calling that out that we can actually have sort of outcomes. One of the… It’s not gonna be my last call, but it’s Jennifer Pahlka who wrote Recoding America.

0:47:30.8 MG: There’s a really good podcast with her on Ezra Klein his podcast. And I think she has great clarity on where she talks about procedural thinkers and outcome-based thinkers. And I think that’s a really… She kind of frames it in a way that I think about all the time and a lot of privacy conversation is really procedural. It’s like, have you followed this process? Have we have we hit the check marks? Yeah. Great. But it’s sort of like, it doesn’t tell you how to do anything. It doesn’t tell you about how to improve your outcomes, whereas the privacy engineering side of things is really outcomes based. It’s like, how do we actually do stuff? And I think the one thing that is the theme that runs through analytics and marketing analytics specifically is about outcomes. We really should be caring about outcomes and actually being productive.

0:48:26.8 TW: You can say that it’s not you saying this, but as you’re saying that, I think you’re pointing it out, but if you look at all of the hand ringing around GDPR and different kind of privacy legislation in Europe, and then they’re, oh, these countries are saying that their interpretation is Google analytics is not valid. As soon as that sort of becomes the debate, it becomes the regulators don’t understand digital and that’s not reasonable. And let us rationalise why the way that we’re doing things is fine. So that then, that just sucks all the oxygen out of the conversation is what’s the ruling gonna be as to whether this platform is allowed in this region based on this argument. And it feels like it just by default moves four steps away from the underlying intent and the principle and then has a debate kind of in the wrong space. Where you’re pointing out that like, no, no, no, where it started is valid and let’s not rip it away from there and go have an argument somewhere else that’s already missed the point.

0:49:48.5 MG: Yeah. And you don’t have to be part of that argument. That’s like… You don’t… That’s a decision that you make. Like is that what you care about? It’s not what I care about. And so we just wanna make good product and that’s respectful of our users and is consistent with some of these principles. And it has some nice benefits and we’re just, I’m chatting with you all right now is really like here A is an example, and then also B. Again, making sure we just don’t just mindlessly collect data. Now there’s a reason to push back on that is that privacy or data minimisation is the default. And so you make that what you will. It’s really gonna be up to everyone else, but I think it’s valid just to sort of point it out.

0:50:40.2 MG: But yeah, there’s a lot of nonsense out there, Tim. So what? There’s a lot… There’s… I mean if you’re getting your information from LinkedIn primarily what’s LinkedIn? It’s like a lot of people like self-promoting their stuff and people like, are they really experts? You look at it, a lot of people aren’t and there’s a lot of nonsense multipliers. There’s a lot of agencies out there. People just, you gotta step back and think about what the perverse incentives are and there’s a lot of perverse incentives out there and a lot of folks are selling product and are selling services. And what is new often is something that they can use to sell. And I just think by being, again, I don’t overuse the word intentional, but just being thoughtful and mindful is a protection against acting in a way that isn’t rational and you can bump up what they’re saying to see if it’s sort of consistent with what your actual needs are.

0:51:45.6 MG: And again, I sell software and so people can be… I have my biases as well and so I’m well aware of that. But again, this is stuff that is not made up by us, by me. It’s kind of the law and just a way of thinking about it. But again, we’re not selling, there’s no one way to do things and we’re not being paternalistic about it. It’s not for me to say or any of us to say how others should… Well you all are some of you’re consultants. So I guess it is kind of for you to give guidance. But it’s ultimately… The way we look at it, it’s our job to give… It’s almost like being a doctor and there’s various treatments and we may have a preference about what we think a type of treatment works, but it’s ultimately up to the client to think through what are the trade-offs between different interventions? And does one approach work better for them?

0:52:47.1 MG: They are in a better position to know. It’s just really our job to give them options and ultimately if they do something they wanna do an approach that isn’t what we would’ve done, that’s totally fine. It’s not for us to say. It’s just our job to give, to be acting in good faith and kind of give them options.

0:53:05.6 MH: I love that we’ve got this conversation done now ’cause I think we’re gonna be referring to it again and again and again over the next many years. This is good on a lot of levels for a couple reasons. One, because when we start seeing vendors in five years talking about this, we’ll know where it came from. And as we sort of seek out and pursue sort of almost like a new set of first principles as analysts around how incorporating privacy in a proactive manner works. It’s starting at this sort of juncture. It’s a lot of food for thought. All right. This has been outstanding as per usual and thank you Matt. Thank you very much.

0:53:55.6 MG: Well thank you so much for having me. It’s been a real pleasure.

0:53:58.1 MH: It’s good. I’ve got a lot of thoughts going on as I usually do when we talk and none of them are very well formed and most of them probably don’t make any sense. So it’s gonna take a while. But this is really good and I think I echo what you were saying, Julie, which is sort of like, this is the first time I’ve sort of looked at privacy stuff and not felt sort of like this, oh, they’re just crushing our fun and we have to follow all these rules. There’s now sort of like, okay, there’s a path forward and I can get excited about that. Now I’m intrigued and I wanna go learn more about how do I incorporate that as part of a central part of my path out from here. Which I think is. Yeah.

0:54:38.5 JH: Can I just say, I do, to echo that, Michael, I started to feel at the very end I was starting to culminate all my thoughts finally into something coherent of, I really like that this way of thinking gets rid of the fear of feeling like they’re losing something with the privacy laws out there and the new regulations coming. Because I feel like that’s what always the conversation is about is we’re losing this, we’re losing that, oh no, you wanna hold on tighter because you feel like things are being pulled away from you. But this kind of breaks that fear cycle and, yeah, it feels kind of like a new day. Like, oh, turn the page. There’s a new way to start. You can start fresh, it’s okay.

0:55:18.7 TW: None of our tools support it yet, but then we can start going and building that future.

0:55:21.9 MG: No. Not yet. Come on. Come on.

0:55:24.8 TW: Yeah. There might be one.

0:55:25.8 MG: That was quick. That was a quick… [laughter] That took all of 43 seconds.

[laughter]

0:55:37.8 TW: It’s always somebody been thinking about this back in 2015.

0:55:41.8 MH: Oh. Like I said, in five to seven years when some of the vendors start talking about this, you know where you heard it first. All right. One thing we would love to do on the show is go around the horn and share a last call. Something that might be of interest to our audience. Matt, you’re our guest. Do you have a last call you’d like to share?

0:55:57.1 MG: Sure. Actually, is it okay if… I have a couple.

0:56:00.8 MH: Yeah. Go for it.

0:56:02.0 MG: One is, since we were talking about this, and I just wanna be clear that I am sort of adjacent to it. I’m not an expert in the privacy engineering space, but there are experts there. It’s just amazing community and I highly recommend anyone who’s interested in any of this to attend PEPR, which is the Privacy Engineering Practice and Respect conference. It just happened last month and it’s coming up next year. But I highly recommend folks, and I can give you all a link if you wanna put that on the page for the podcast. Really some of the most inclusive…

0:56:36.9 TW: Which actually that’s, is it through, that’s for your stake years. So we’re gonna… We’ll link to the talk you did there is available on YouTube, right?

0:56:44.6 MG: Yep. It’s that conference and really, it’s some of the smartest people you’ve ever met and also some of the warmest and most inclusive community. It’s very Star Trek rather than Star Wars vibe. So it’s great and then kinda more literary but sort of think, we talked a little bit about cardinality and sort of ideas of information and whatnot is kind of the… I recommend the short stories of Borges, I’m not sure, but Argentinian writer, The Garden Of Forking Paths and the Library Of Babel, those are two of his short stories. And I think if you wanna be like in the know data scientists, like sort of a literary data scientist, those are two good short stories to have read. And then once you start reading those, you’ll get hooked. So that’s my last call.

0:57:38.0 TW: Wait. I assume it will make it through the editing, but I was introduced to the Library Of Babel by Joe Sutherland as we were working on this book. So we have a whole… It’s actually in the book that we’re working on as a explanation and illustration of the Library of Babel. So I should actually read the short story I guess, instead of just the Wikipedia entry.

0:57:58.6 MG: Oh, no, it’s great. Yeah, you should read both. And definitely Garden of Forking Paths, which is often referenced in research design, which is, people refer to that when talking about researcher degrees of freedom and reproducibility of studies and whatnot. So there’s a lot of the ideas that are adjacent to what we work on are embedded in these great short stories.

0:58:23.8 MH: Very nice. All right. What about you, Julie? What’s your last call?

0:58:30.7 JH: My last call is actually inspired by a previous show not long ago with Katie Bauer. I was looking through some of her different articles and I came across one that was titled Deciding If A Data Leadership Role Is Something You Actually Want To Do. It was an interesting read overall, if that’s like a point in your career that you’re at, but I just felt like she broke it into a lot of helpful ways that she thought about making a decision about what next role she wanted. And she talked a lot about, titles in ways she thinks about your titles, which I think a lot of people run into that at different points in their career. So I thought that was just a great way of framing it. She then listed a bunch of great questions that she actually used when going through interviews for different roles and I kind of started to think about how I feel like they would be super helpful, even me as a consultant thinking about asking my stakeholder or can I ask or can I figure out the answer to these types of questions with like where my stakeholder sits in their org, what is their actual job, what is their role compared to their peers?

0:59:36.2 JH: What is their manager like, who are they working with? What are their relationships like? And she just outlined a lot of different great scenarios of how data teams fit within organizations. And so whether you’re using those questions to ask when you are interviewing for new roles or like I said, I’m kind of inspired to use them in different scenarios. I thought it was a great read.

0:59:57.1 MH: Excellent. All right, Tim, what about you?

1:00:00.2 TW: So I feel like I’m gonna be pulling some of these is we’ve turned in the initial full draft manuscript for the book, which means I’ve learned a few things that I’d either forgotten or were new things coming out of the brain of Joe Sutherland. And one of them is, it’s an oldie but a goodie. It’s kind of an academic paper published on the National Library of Medicine at the NIH and the paper is titled, Parachute Use to Prevent Death and Major Trauma Related to Gravitational Challenge, Systematic Review of Randomized Controlled Trials. So it’s from 2003 and it is a brief academic paper where these two people who basically kind of dared each other, the notes at the end kind of explain, hint at what happened. But basically they were looking, saying if scientific evidence really requires a randomized controlled trial for high stakes things, then surely we should just go into a survey of all the randomized controlled trials around the efficacy of parachutes.

1:01:01.2 TW: And the result… They had a whole plan on how they were gonna find the outcomes and their meta-analysis and what they were gonna do. And the results are that our search strategy did not find any randomized controlled trials of the parachute. So it’s kind of a little bit of poking fun at the scientific community, but in a kind of a delightful way with some pretty funny footnotes. And it actually did get kind of published in a way. So it’s just kind of a good reminder of being clear on the question you’re trying to answer and what your options are for answering it. So that’s random. What about you, Michael? What’s your last call?

1:01:39.6 MH: Well, it’s interesting. I had a conversation recently with my niece who’s getting ready to start the school year and she’s taking an AP statistics class, which I didn’t even know that kind of class existed in high school. But we started talking about some of the pre-work that she got assigned and I realized I was like starting to explain some foundational statistics concepts, that she was kind of like struggling with. And it reminded me of this book I read early in my career called The Cartoon Guide to Statistics. ‘Cause whenever I go back to sort of those first things, I’m always reminded of that book, which I got recommended to me actually by Avinash Kaushik way back in the day. So that’s my last call. I think I may have done it before, but it’s been many, many years. And that conversation sort of brought it back up. So if you’re getting into statistics or you just wanna have a better foundation in statistics, that’s actually a great book to have on your shelf to pull off and read. And some of the stuff we talked about today, I kept up with because I’ve read that book and it’s a cartoon so it’s easy. So anyways, Cartoon Guide To Statistics.

1:02:43.8 TW: That’s funny.

1:02:44.6 MH: There you go.

1:02:45.5 TW: It’s on my shelf and I never could make it through it. I should. I should go back and read it now. I feel like I was… Didn’t… Yeah. I should try it again.

1:02:53.2 MH: It probably would make more sense. Yeah. ‘Cause you… What was funny was how much I realized I’d actually learned over the years about statistics in just trying to explain a couple things. And I realized like, wow, I actually know a couple of things about statistics now, which I think that’s important I should know. But it’s…

1:03:11.4 MG: And I think, if we’re being honest, all due to the Conductrics quiz.

1:03:16.7 MH: Oh yeah. Absolutely. Absolutely.

[laughter]

1:03:20.2 JH: Full circle.

1:03:21.5 MH: It’s a full circle moment. A 100%. Well, yeah, this has been obviously such a great conversation and I know as you’re listening, you may have questions, you may have input, there’s things you might wanna share that we would love to hear from you. And the best way to do that is through the Measure Slack Chat community, or as much as… We’re on LinkedIn as well. And also you could email us at contact@analyticshour.io and I think, Matt, you’re pretty active on that community as well as on the TLC.

1:03:49.9 MG: Yeah. Highly recommend folks sign up for the Test and Learn community Run by Kelly Worthham. That’s a great space to learn about all things experimentation in an inclusive space.

1:04:02.4 MH: Yeah, absolutely. And we heartily recommend it as well. And it’s a great place to explore these ideas and keep this conversation going as well. So love to hear from you and keep learning more about privacy engineering, privacy by design, K-anonymization, differential privacy, I mean all new and amazing concepts for me today. So awesome. All right. And of course, no show would be complete without a huge thank you to Josh Crowhurst, our producer for all you do behind the scenes to make this show happen. We thank you very much, sir. And of course, thank you Matt so much for coming back on the show. It’s always a pleasure. Makes me reminisce about all the awesome times we’ve had at SUPERWEEK and other places. It’s always a delight to hang out and talk.

1:04:53.6 MG: Thank you so much for having me. I really appreciate you all welcoming me back and it was great to meet you, Julie.

1:04:57.8 JH: Yeah, you too.

1:05:02.5 MH: Awesome. And I think I speak for a random assortment of co-hosts that I may have, that I’ve incremented a couple of times when I say, no matter how you’re trying to drive forward with privacy, remember, keep analyzing.

1:05:18.9 Announcer: Thanks for listening. Let’s keep the conversation going with your comments, suggestions, and questions on Twitter at @AnalyticsHour, on the web, at analyticshour.io, our LinkedIn group and the Measured Chat Slack group. Music for the podcast by Josh Crowhurst.

1:05:37.2 Charles Barkley: So smart guys want to fit in, so they made up a term called analytics. Analytics don’t work.

1:05:42.3 Michael Wilbon: Do the analytics. Say go for it, no matter who’s going for it. So if you and I were on the field, the analytics say go for it. It’s the stupidest, laziest, lamest thing I’ve ever heard for reasoning in competition.

1:05:58.3 MG: Text was like, Tim and Mo were supposed to be cool, almost like secret agents and like just had their shit together. And Michael was just kind of like, did you ever see, what’s that movie with Matt Damon and Alec Baldwin? And it’s like all Boston and Wahlberg. And there’s that scene where Alec Baldwin is like the police commissioner and he’s all like frantic and he’s sweating and he’s just like, totally discombobulated. That was how I thought of Michael, which just like totally out of sorts, just… And, then Tim and Mo would just kind of come in and just be like cool cucumbers and like, just have their shit together. And Michael never played it correctly. And he edited it out. He wouldn’t say… Oh, but anyway. I sent… I had a dialogue for him. No. That was the whole bit.

1:06:53.8 JH: Oh, man.

1:06:54.6 TW: But how did you really feel?

1:06:57.6 MG: But Michael, I can’t believe, like I thought he would just like lean into it, but no, he was too embarrassed or he like didn’t like, he’s like, his ego was too great to play.

1:07:03.1 TW: He just didn’t commit.

1:07:06.0 MG: Yeah. He just didn’t wanna play it. I think, he just couldn’t play it up. He’s like, I’m too serious for this. I’m not gonna be the one who doesn’t know what’s going on. Well, you’re not the one who’s answering the questions. That was the whole point.

1:07:15.6 MH: I didn’t understand the vision. But I just didn’t understand the vision. I’m not cut out for high level acting.

1:07:24.0 MG: Julie picked up on it. Julie picked up on it. That was…

1:07:28.2 JH: No, Michael said that verbatim in one of the episodes. He literally stopped midway into the quiz and he goes, why am I always panicking? Why am I so frantic in this?

1:07:36.7 MG: That’s the whole bit. That was like the narrative theme. Mo and Tim were just like the 007s.

1:07:46.1 TW: Rock flag and there’s a lot of nonsense out there. Nice.

[laughter]
[music]

Leave a Reply



This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#260: Once Upon a Data Story with Duncan Clark

#260: Once Upon a Data Story with Duncan Clark

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_260_-_Once_Upon_a_Data_Story_with_Duncan_Clark.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares