#274: Real Talk About Synthetic Data with Winston Li

Synthetic data: it’s a fascinating topic that sounds like science fiction but is rapidly becoming a practical tool in the data landscape. From machine learning applications to safeguarding privacy, synthetic data offers a compelling alternative to real-world datasets that might be incomplete or unwieldy. With the help of Winston Li, founder of Arima, a startup specializing in synthetic data and marketing mix modelling, we explore how this artificial data is generated, where its strengths truly lie, and the potential pitfalls to watch out for!

Articles, Events, and a Paper Mentioned in the Show

Photo by Anton Shuvalov on Unsplash

Episode Transcript

00:00:15.13 [Announcer]: Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language.

00:01:14.75 [Michael Helbling]: Hey everyone, welcome. It’s the Analytics Power Hour, and this is Episode 274. Yet today we’re diving into a topic that sounds like maybe it came from a sci-fi script, but it’s actually very much part of the real-world data landscape. That’s right, synthetic data. You know, whether you’re using it for machine learning, protecting privacy, or just giving your dashboard something to chew on when the real data won’t play nice, It’s definitely having a moment in our industry. And unlike original data, it won’t ghost you with missing values or weird outliers or those inexplicable rows that look like someone fell asleep on the keyboard. We’ll talk about how it’s made, where it shines, and where it might fall short. So whether you’re deep in data science or just data curious, I think this podcast will be for you. And it’ll be kind of like synthetic data, hopefully generated with purpose and surprisingly useful. What first let me introduce my co-hosts Val Krull. How are you going? Or how are you doing? I’m so used to introducing you.

00:01:16.15 [Val Kroll]: Yeah, I’m not Mo.

00:01:18.64 [Michael Helbling]: I know. How are you?

00:01:20.70 [Val Kroll]: I’m doing good, Michael.

00:01:27.03 [Michael Helbling]: Happy to be here. And also joined by Julie Hoyer. Julie, welcome. How are you doing?

00:01:29.98 [Julie Hoyer]: I’m doing great. I cannot wait to talk about this topic.

00:01:57.83 [Michael Helbling]: I know. I’m excited as well. And I’m Michael Helbling. And so for this show, we absolutely needed a guest. Winston Lee is the founder of ARIMA, a startup specializing in synthetic data and marketing mix modeling. Prior to that, he led data science teams at PWC Canada and Omnicom Media Group. He’s also a lecturer at Northeastern University and sits on their program advisory committee for the Masters in Analytics. And today he is our guest. Welcome to the show, Winston.

00:02:01.98 [Winston Li]: Thank you. Thank you. Great to meet you. Great to meet everyone. And thanks, Michael.

00:02:22.28 [Michael Helbling]: Awesome. Well, I think a great place to start on this topic of synthetic data is really just to talk about what it is. So, you know, how would you define synthetic data and what makes it fundamentally different from anonymized or sample data, let’s say? Yeah, very good question.

00:05:14.51 [Winston Li]: So synthetic data, to put it simple, it’s data sets that are generated by an algorithm as opposed to being collected from some sort of real event. So specific to consumer data, which is the space that we work in, we’re not going out to conduct surveys. We’re not tracking people. We’re not asking people to provide anything to sign up. None of that. Synthetic data simply means we develop a computer algorithm where we could generate data in a way that mimics real data, so to say. So we’re not just randomly spreading out numbers, we’re generating data based on certain patterns. Obviously, there are two things that we should note. One, we’re not making up data. A lot of people, when they say synthetic data, they think of the word fake. They think it’s fake data. It’s not. And the algorithms that we use to generate synthetic data is indeed trained on real data. So it is based on learnings of patterns from real data in which we generate synthetic data. So first point is that it’s not fake. It’s just like real data. And it is very useful for statistical analysis. We’ll get to the whole discussion of privacy a little later on, which is the main motivation of synthetic data. But from a utility standpoint, In theory, it should be as useful as real data. The other thing I also want to point out is there’s actually a lot of synthetic data in our day-to-day lives that we simply don’t realize that being synthetic data. We call them something different, but they are, in fact, along the same lines of same motivation, so to say. If you imagine something like a mid-journey, you know, we’re generating synthetic images. If you consider image to be data, then synthetically generated images like synthetic faces or, you know, synthetic pictures of, you know, different places, animals, sceneries, whatever, that’s a form of synthetic data too. We’re simply generating synthetic pixels, so to say, but based on the patterns so that they look like a picture in the end. Same thing with, you know, even chat GPT or Gemini. If you consider again, if you consider words to be data, that too is a form of synthetic data. So the fact that we use computers to generate some form of information, let’s say, based on learnings from real information is much more common than we see and is much more common than we recognize. And broadly speaking, that that is synthetic data.

00:05:51.62 [Val Kroll]: I was 100% I’ll admit in the camp of thinking synthetic data, just like with something materialized out of thin air. But I have to say that I did have the benefit of getting to see you present Winston at measure camp New York slash New Jersey a couple of months ago, which was a fantastic presentation. I learned a ton about it. I think one of the other questions I’d love to ask you as we’re like kicking this off. It’s like, what are some of the really common use cases that, you know, people within our field are using synthetic data. Like what problems is it really solving?

00:06:36.86 [Winston Li]: Yeah. In some ways, synthetic data does not attain to new use cases, so to say. It’s not like there is something synthetic data can do that real data cannot do. People consider synthetic data more as a way to, let’s say, be able to do things that, you know, privacy laws don’t otherwise allow them to do. So So in some sense, you know, people are doing certain things with synthetic data because doing to try to do the same thing with real data while technically possible is from a procurement or from a legal standpoint, very, very difficult. So there’s actually nothing special about synthetic data other than the fact that it is.

00:06:42.09 [Val Kroll]: So should we just wrap it up here?

00:07:44.29 [Winston Li]: I don’t want people to think synthetic data is fundamentally different than real data in any way that I have to pick one or the other. The best analogy I can think of is think of it like a photocopy document, if you will. You have an original document. You don’t want to use it. You’re scared to lose it for various reasons. You make a photocopy of it. That photocopy version will bring enough use, bring enough utility, just like the original document. It’s much safer to work with the photocopy version because you can write on it. You’re not afraid that you’re going to lose it. In a data scenario, obviously, you don’t have to worry about people suing you for doing various things. You don’t have to worry about revealing identity of somebody. You don’t have to worry about people opting out. So it brings you all of the safety benefits while achieving essentially the same thing that real data will help you do.

00:09:09.64 [Michael Helbling]: Hey folks, let’s talk about data pipelines. You know, nothing says fun like scalable architecture and secure ETL, right? Okay, maybe not. But you know what is fun? Not actually sweating over whether your sensitive data is floating around the cloud uninvited. That’s where 5Trans Hyper Deployment comes in. It’s the magical middle ground between we need full control and please somebody else manage this thing for me. You get self-hosted pipelines into your own environment, but with all the ease of monitoring and updates, handled by Fivetran. It’s like owning a fancy car and having Fivetran’s pit crew, keeping it running 24-7. It’s cloud-based, on-premises, secret volcano layer. It doesn’t matter. Fivetran’s got a unified platform that will manage it all with governance and security features that keep compliance teams calm and engineers caffeinated. Bottom line, your data is secure, your life is simpler, your pipelines always on. Go to fivetran.com/aph and start a 14-day trial today. That’s F-I-V-E-T-R-A-N dot com slash aph. Go ahead, give your data the treatment it deserves.

00:10:12.96 [Julie Hoyer]: So my head always goes to How do you make sure it reflects the real world enough, though? Because I know you said it’s not something that people just make up. It’s supposed to be based on the real world. But I can’t wrap my head around the use case you’re talking about because I’ve talked to colleagues about this in the past. And I was really uncomfortable with this idea of, we didn’t call it synthetic data. We called it modeling missing data. And it was all around people not opting in. Okay, well, what would that traffic be doing if they had opted in? Like, can we fill those blanks in our data sets? And my head right or wrong, I was like, I can’t compute how this will work. How can it not be biased? And how do we know what people who inherently don’t want to opt in are going through the steps to opt out? Maybe they act completely different than the people we can sample to do the modeling on. So maybe we want to go down that rabbit hole, like how do you make it not bias or try not to?

00:11:38.84 [Winston Li]: Uh, yeah. Well, the data sources are also changing are also being updated regularly. So building a synthetic data set is very much like writing software. You don’t just write code and kind of leave it there and, you know, code will go stale. You have, somebody has to maintain it. Same thing with the data sets over time. You’re going to have, so first of all, even things like panels, that every year they do it or every six months they have an update and you can, you know, you’re going to pick at least in our case, we pick that up and we rerun, retrain the models. In some cases, a data source becomes stale over time, like maybe a data set source loses its reputation or their panel becomes bad. We also had data sources. This is a case in Canada. We have a source where they try to save money by going into a different panel that’s less trustworthy and their data sets. app and sort of deteriorate over time. So part of the work is not only to generate synthetic data, but also to source the ingredients, right? It’s kind of like cooking, you know, like your job as the chef is not only to cook the food, but also where to buy and what food to buy. That’s also part of the equation as well.

00:12:22.82 [Julie Hoyer]: That makes total sense. But then it kind of makes me feel like the specific use case of missing data, modeling missing data that I was referring to, still may not be then a good candidate for synthetic data unless you believe your assumption that those that you are tracking are representative enough of those that you’re not tracking. And I think to your point, it sounds like you’d want some other type of research or something to hopefully validate whether that assumption would hold or not, but otherwise you’re kind of going under the use case of like, I’m gaining more data on the population. I know you’re not necessarily replacing the unknown population that you have really no data on.

00:15:07.11 [Winston Li]: There is always, there’s always sort of the, the, the consideration of how much can you extrapolate and how much is extrapolation is appropriate. And this is not only the case with building synthetic data or the synthetic generation models we developed, but also as You know this is one of the first things we learn I learned as a statistician when I was in third year university or second year university. When they teach us to build regressions you know they might say if you build a regression model let’s say where you forecast the housing prices based on the square footage of the house. And if the range of your data was from, let’s say, 1,000 square foot to 5,000 square foot, if that was the range of your input data, then you cannot ask the model to forecast a house that’s 5 million square feet. That’s outside of your scope. And that’s too much extrapolation. So the same kind of consideration here with synthetic data, too, is how much extrapolation is extrapolation. And obviously, you know, the whole point of building models is we want to extrapolate in some way. If we didn’t want to extrapolate, then there would be no need for a model, right? So, you know, an appropriate scenario might be something like, let’s say, if we didn’t have any, you know, let’s say if we were to look at pet ownerships, if we didn’t have any data, let’s say in the state of Arizona, but we had data from other states on what kind of people are dogs, what kind of people are cats. And then for the state of Arizona, we have, let’s say, the general population data. That’s a pretty good extrapolation. If you generally know what people look like, what kind of people are dogs, what kind of people are cats, and especially for the surrounding states, you can make the assumption that people living in Arizona will likely behave the same way. And that sort of extrapolation is appropriate. Whereas if your data was like, I don’t know how long we’ll, you know, somebody’s surviving on Jupiter or something like that, whereas it’s like absolutely zero data whatsoever on that particular topic, any extrapolation you’re going to make is inappropriate. So, so there is a little bit of a judgment call as to what is appropriate and there’s definitely no, you know, sort of fixed formula to say, well, there’s just plug these numbers in and, you know, here comes the synthetic data and then we’re done. It’s definitely not that. there is certainly a level of maintenance, both in maintaining the source of the data, in maintaining the methodology, but also in keeping your judgment up to date or keeping your assumptions up to date, so to say.

00:15:48.41 [Val Kroll]: Yeah, it sounds like it. Absolutely. So one of the questions that I had, because you kind of tease this a little bit earlier, Winston, about that synthetic data is helpful in some of the context for privacy concerns or where personally identifiable information is really sensitive. And a lot of times people, you know, the solution is like, oh, we’ll scrub that data or we’ll be identified or we’ll aggregate it. But sometimes then that isn’t the type of data you need for the problem at hand. either. So is synthetic data like walking this balance between those two in some way, or if you could just expound a little bit on that and talk about some of the ways that synthetic data helps overcome some of the challenges that we’re seeing more and more with the PII?

00:19:54.30 [Winston Li]: For sure. Sanitizing the data often is not good enough. And it is increasingly so because the lawmakers lag technology, so to say, like, you know, a piece of technology becomes available or a new kind of way of tracking people becomes available. And then, you know, data sets get, get, get collected. And then people look at it, people go like, Oh, this is dangerous. Then you go, you know, the lawmaker realizes it, you know, many months later and then they update what privacy laws are using this kind of data and not being able to use this kind of data. So let me give you an example. One of the data sets we have access to is mobility data. Mobility data basically tracks people’s location based on your lat and long. If you carry your phone, it tracks you through a number of apps. They have a whole portfolio of apps that allow you to share your location where you’re asked to share your location. And once you agree to it, they will ping you every couple of seconds or every minute or every few minutes, and they will get your exact lot along. So this is a type of data that people sell on the market. You can fairly easily acquire that data set. It is considered PII in the sense that there’s no names attached to it. There’s no, you know, there’s no email. There’s no phone numbers. The only identification you get is what’s called then sort of a device ID, which is usually IDFA or, you know, depending on whether you’re an iOS phone or an Android phone, but usually a hash device ID that you would not be able to identify the person, but you can stick it into a media buying platform and target that person. So that is considered PII because it’s hashed. However, every couple of minutes, you’re going to get a ping for somebody and you’re going to find the exact lat and long of that. person and the phone GPS is pretty accurate these days. I guess it goes down to maybe 10, 20 feet in accuracy. So you could totally look at somebody’s phone, plot it on the map, figure out where they are, 3am at night and then knock on their door and you know exactly where that person bit has been to in different parts of the day. So that is an example where just anonymization isn’t really enough. It’s a little bit of a gray area now. There’s no policy on what mobility data can be used, what exactly is considered PII and so on. And then if you imagine joining that data set with, let’s say, something like the sensors where you might be looking at a very small area, the smallest, the smallest sensors block in the US has less than 50 people. So you could. pretty easily identify somebody if you had someone’s device knowing they lived there and then took the census with, you know, less than 50 people, you could very easily guess a lot of data about that person. So that’s just one example of how where each of the individual data sources is anonymized and quote unquote privacy compliant. And yet they each reveal something about that person. And when you piece all these different information together, you get a pretty good, pretty good idea of, of who that person is. So this is why just sanitizing data isn’t, isn’t enough in most use cases. And this is the whole point why we sort of recreate, you know, synthetic copies of, of, of datasets to, to, um, to, you know, to, to deal with that problem all together. So if there’s one takeaway from what I just said, that would be to uninstall all of the apps on your phone, if possible, otherwise you’re going to get tracked and your information is going to get bought and sold and it might end up in people’s databases.

00:20:29.99 [Julie Hoyer]: So I turned all mine off the other week actually because I got some like scary, you know, real telling me like they have all your locations. So hearing you say that makes me feel good that I did that one. So do you, do you then, are you saying a use case for synthetic data could also be that by adding in synthetic data, you’re almost adding like representative, maybe like volume. Therefore, I don’t want to call it noise, but it’s almost like, well designed noise. So would it help with the anonymity of like combinations of data sets? Am I understanding that?

00:22:33.20 [Winston Li]: It’s not so much noise. That’s, that’s a different sort of area in data privacy called No, no, not round term at all. There are people who do what’s called differential privacy, which is actually along that line. Basically, you take some data set, you add some noise, you add some noise to the data, and then people could go back and identify where that data comes from. So that is a different area, but I’m trying to address the same problem too. For us, synthetic data is much more about recreating the data set in such a way that statistical properties are preserved, but the actual cells are different, so to say. One example of that would be, you know, let’s say if the only thing I cared about was average, I have three numbers, let’s say, you know, one, three, five as an example. And their average was their average in this case is three. A synthetic recreation of that might be something like two, three, four different numbers, but the average is also three. So in this case, one, three, five and two, three, four are two sets of different numbers, but for the purposes of calculating the average, these two numbers serve the exact same purpose because they give you the exact same average. Now, of course, it’s more complicated than that because statisticians or data scientists look for many more metrics. They’ll look for mean, they’ll look for average, they’ll look for correlation. There’s a whole bunch of checkboxes and your synthetic data has to kind of tick all of the boxes. But that’s just a simple example of saying, I’m not, adding any noise. My new dataset could be 2, 3, 4. It could be 4, 3, 2. It could be 4, 2, 3. It doesn’t matter what order it is, but all I care about is the statistical metric in the end that doesn’t change. So that’s really what synthetic data is.

00:22:43.34 [Julie Hoyer]: But you could beef up the 50 in that little town in hopes that it’d be harder to pin it down to one person. Yeah, awesome. Exactly.

00:23:28.63 [Val Kroll]: One of the things and I that was a really good explain I’m so glad you asked that question Julia was really good explanation. One of the things I saw in one of your papers Winston was the phrase assisting with low resolution data which I just that framing really connected with me because it my interpretation of that is it helps with exactly what Julia is talking about, assisting by kind of like increasing that small base size a little bit and in dealing with some of the anonymity. But also it helps when like the aggregation, it’s aggregated it to the point where it’s no longer helpful, right? So it needs to provide more granularity, but not in a way that’s identifiable. Is that like what is intended by that? Or what are some, some ways that the low resolution kind of problem, if you will, is kind of solved in some other cases?

00:26:44.40 [Winston Li]: Yeah, to answer that question, let me maybe just discuss a little bit about what we mean by resolution when we talk about data sets. In consumer data, it’s very common for people to aggregate data sets. What we mean by that is you have lots of individuals, but because of, again, privacy, you can’t give away information about each of the individuals, but we can group them, we can put them into cohorts and give you averages. And a common way of doing that is by using geography. So in the US, for example, that could be census blocks, that could be a zip code, that could be a county, that could be a DMA, whatever. So rather than saying, Bob makes $50,000 a year, Sarah makes $30,000 a year. Rather than saying that, we say, well, this zip code, the average income is $100,000. There are 10,000 people living in that zip, that’s the average income. There’s other zip, average income is $30,000, for example. And there are many advantages of using geography as a way of aggregating data, partly because geography is quite well-defined and straightforward. Everybody knows what zip they live in, which geography they belong to, but most importantly, geographies don’t complain. zip code will not complain and say, oh, you know, you fringe my privacy, you need to take me off of that. You know, take me off of your data set or whatever. So it’s a very common way for people to aggregate data. Now the issue with aggregating geography is what geography to use because if the geography is too small, like if you’re looking at, you know, every Hypothetically, let’s say if every other house was grouped into some kind of a geography, that’s probably not very helpful because I and my neighbor are in the same group. If we’re talking about average income, I know mine, I can immediately calculate what his or what her average, what her family income is, right? So that’s an example of where your geography is too small and individuals could infer what people still could infer personal information or personal data. And then on the other hand, you have a problem where if your geography is too big, let’s say if your geography was the whole country, you’re averaging too many people. And that’s not very helpful. So if people say, well, the average age of the United States is, let’s say people living in the United States is, let’s say, 48 years old. Well, guess what? That’s probably true for every other country out there. Because you’re simply averaging too many people. And when you average too many people, the variation sort of washes out and you just end up with end up with well, the average. So finding the right balance is important. And that’s what we mean by when we say high resolution, we mean the data is more granular at the closer to the individual level, no resolution, meaning, you know, a lot of them have been grouped up average and you lost information as a process. In the US, people typically, you know, census block is a pretty common choice, zip is a pretty good choice as well. So again, depending on the organization, depending on what their risk tolerance is, those are usually the geographic units that people work with.

00:27:19.72 [Julie Hoyer]: So then I guess to loop back to Val’s question, low resolution would not be describing the scenario we were talking about where there’s 50 people in a small census block and you’re combining it with another data set where it becomes somewhat more identifiable. So I guess do we want to circle back to Val because you were kind of asking them for the low resolution problem, you said synthetic data can help a lot there, right? So what would an example, I guess, be where synthetic data helps when you’re using more aggregated, low resolution data sets?

00:28:54.98 [Winston Li]: Well, when when when you’re dealing with low resolution data sets, what typically happens is, so there’s, there’s a little bit of a trade-off, high resolution data, more information, more granular, but you know, less safe from a privacy standpoint. And then on the other side where you have high, no resolution data, that’s more aggregated, less granular information being lost in the process, but the information being lost also means identities are lost and it’s harder to connect back to the individual. So, especially working with high resolution, low resolution data sets, you know, synthetic data could be a very helpful augmentation to getting more information. If you have low resolution data, if you have individual level data that you don’t need synthetic data, you know, you have what you need. So you can just, you know, go, go do whatever you need to do. The whole point of sort of leveraging synthetic data is when that high resolution granular data is not available. And then synthetic data becomes your alternative to, to do, to get to what you need to do, at least in a, in a statistically meaningful way, not to say, you know, your data set necessarily matches exactly the same as you, well, you know, as the low resolution data, if you were to have it, but at least if you were to build models or do analysis on this synthetic data set, you can expect it to work you know, just as well as if you were to use the real, real, real low resolution high resolution.

00:29:28.22 [Julie Hoyer]: Oh, okay. This may, sorry, I feel like I’m asking really naive questions, but this is so helpful selfishly for me. So when you’re doing synthetic data for low resolution data, are you doing it to get more of the low resolution like data points or rows of data, whatever it is, or are you saying you’re doing synthetic data to create a synthetic safe, it’s not real people, higher resolution data source? to augment your low resolution? Or maybe you could do both? The latter.

00:30:24.23 [Winston Li]: More of the latter. So the way I describe it is it’s like a little bit of, it’s like data compression where you have high resolution data. Let’s say you have individual level data. and people aggregated into let’s say zip codes or census blocks. So people aggregated to preserve privacy. And then now can we take the aggregated data set, apply some modeling, apply some algorithm to reconstruct the individuals? That has been aggregated. So it’s like data compression and decompression, except you’re not compressing data to save disk space. Like, you know, how you zip, you put a bunch of files into a zip folder. You’re not doing it to save disk space, but from that compression and decompression process, you lost, you took away the identity.

00:30:25.63 [Julie Hoyer]: Oh my gosh. Okay.

00:31:29.44 [Val Kroll]: It’s clicking so much more. Thank you for me too. This is really helpful. And I think one of the other things, and I think about this from like a, like a digital marketing perspective, like when you’re trying to identify and pick audiences, so you know how many people have like the pets like you were talking about before. And now you want to know Also, how many of them are interested in, you know, some higher education program. So you have to understand or try to figure out what is the size of my audience. If I’m picking like has this, this is true. Also this, this. And so those combinations aren’t easy to come by to try to like estimate how many people fall into that bucket because all you have is, you know, 30% of people have dogs in this, you know, zip code, right? So is that kind of like, Yeah, where the where you get better with your estimations or audience sizes because you’re making fake people that were like the statistics are still What was what how would you phrase it earlier? We’re like I forget what you called it earlier that you would make sure it’s still Yeah

00:33:22.36 [Winston Li]: Yeah, statistically equivalent. Yeah, exactly. Exactly. So let me give you a simple scenario. Let’s say in a neighborhood, we have 10 people. According to the circumstances, let’s say half of them are male, half of them are female. And according, let’s say, according to the census again, half of them are dogs, half of them are cats. And then now you ask, how many males own dogs in that neighborhood? Well, just by looking at the aggregated information, you can’t, you don’t know, you won’t know. you could have one extreme where all of the five males own dogs and all of the female own all the five male own cats or you could have it flipped all of the five males own dogs all of the five females own cats or you could have a pretty random mix of the two. And in all cases, your aggregated counts are the same, but they would mean very, very different things. So the way synthetic data, or at least how we develop our synthetic data is to be able to take that aggregated information and model out 10 individuals. So that we can now look at the 10 individuals and say, you know, of the 10 individuals, you know, three of the five males on dogs and two of the males on cats or something along that line. Now, again, this is not necessarily exactly true to match the population. At least at the individual level, you could go to that neighborhood and you could pull up three men and ask them, do they own dogs? You don’t necessarily agree. You won’t necessarily agree with our synthetically created individuals. But if you look at the national level, if you count up a lot of males, how, whether they own dogs or whether they don’t own dogs, that should agree with, let’s say, real, like your stats.

00:33:55.90 [Val Kroll]: Hmm. That’s interesting. And I’m also curious, like when it comes to some of this, when you’re doing some of this modeling, is it all like descriptive characteristics or because I’m thinking about my past life in market research where we did a lot of work like awareness trial and usage studies. And so it was like aware of certain brands like likes or brands like, you know, what percentage likelihood to purchase that brand in the future, can you, can you kind of do the same thing with like attitudinal data or is it more just like descriptive categorical or?

00:34:23.52 [Winston Li]: No, absolutely. Both, both works. It could be synthetically modeling categorical data. So yes or no, or, you know, you know, here are five different brands, which ones have you purchased? It could be that or it could be numerical. It could be a number and, you know, for example, how many How many minutes do you spend on social media every week? It could model that too from knowing the average. So yeah, you can do lots of things.

00:34:36.90 [Michael Helbling]: Where do you see people using synthetic data in ways that are not useful? What are some pitfalls to avoid with synthetic data?

00:37:46.51 [Winston Li]: There’s a, so there’s a, there’s a, I think there are at least two that I can think of. Okay. One is where they think synthetic data could just miraculously invent some stuff for them. And again, this we, this we talked about in the sort of the extrap, what, how much extrapolation is extrapolation discussion a little earlier. Some people might say, Oh, you know, I have this data set, you know, I, I, I don’t have this data set. Can you get a synthetic data, synthetic version for me? Well, that’d be pretty hard to do. You need to have, you need to base it on something, right? Although in the last little while people have been discussing the ability to integrate something like an LLM into the synthetic data generation process with the assumption that if the LLM has been trained on all resources it could access to across the internet then it should in theory know everything know everything. But that’s a little bit of a different topic. That is still understanding data, what data comes out based on what your input is very important. So I think this is definitely one thing. The other thing that sometimes people don’t get with synthetic data is that it’s only statistically equivalent. When we say statistically equivalent, you need to look at the data, not just looking at one individual, but you have to look at across a group of them. For example, when someone looks at our synthetic data, synthetic society data set, some people, the first thing they’ll do is go to their zip code or go to their census blog, pull up the person with their exact age or exact gender, and they start to compare that to themselves. And then they say, this person is not the same as me. Your date is not accurate. And this is true if you just look at one individual, that individual might not be. the person that you really wanted it to be, which she in this case, sure. So, but again, that’s not the, that’s not the idea of synthetic data. It’s almost like insurance, you know, I spent as a statistician, you know, we, you know, we, I worked pretty closely with actual science back in my days in the university. And in actual real science, people essentially calculate when you’re going to die. That’s what people do. So they’ll say, you know, all this kind of person, they’ll die at the age of 50, this kind of person, they’ll die at the age of 70. It doesn’t actually mean you will die at the age of 50. By the 50th birthday, you’ll just go drop dead. That’s not what it means. It just means across a group of people like that. You know, on average, they’re expected to die at the end of 50. So synthetic data is like that too. Not to say exactly, you know, this exact record is Michael, this exact record is Winston and so on. So it’s not that it’s across a bunch of people similar to Michael or similar to Winston in our synthetic data. We’re also going to have people like that. So yeah.

00:38:31.20 [Julie Hoyer]: Well, it’s interesting, too. Do you find it’s hard with stakeholders you’ve worked with? Is it ever hard to have the conversation where you’re asking them maybe to define and clarify which statistics matter to the questions they’re trying to answer? Because it sounds like, to your point, you’re trying to create the synthetic data to check certain boxes of which statistics it is preserving in the data set. So do you ever run into customers coming to you or stakeholders you’ve worked with in the past that they struggle to define that and so it makes it hard to create useful synthetic data for them?

00:41:20.69 [Winston Li]: Not so much, but that said, we do get So let me backtrack a little bit. When I say not so much, it’s because we tend to not create custom synthetic data for every customer we work with. We have one synthetic society product, synthetic data product built out, and that particular product is based on 10, 20 different data sources and we tell people this is the one thing we have. So let’s say for the United States, this is the one synthetic society data set we have. You can use that. There are lots of fields in here. There might be like 50,000 fields. You might not need all 50,000 variables. You might only need a part of that. But generally speaking, what’s there is enough for most people’s needs. So we don’t tend to create custom instances for every individual. There might be a couple of customers that come in and says, oh, I have a couple of variables. I’ve done an ad hoc, you know, I’ve commissioned, I don’t know, let’s say Cantor to do a custom study for me. And here’s a, you know, several questions. Can I embed that into the synthetic society? So there’s that kind of request. Now, that said, people do care about data quality. So even though we’re not generating data sets ad hoc for every customer, we do get people who ask, like, how do I trust your data? How is it accurate? Or why should I use your data set versus use something else? So we do get that kind of question. And depending on who the audience is, the level of sophistication of the person who’s asking that question, we sometimes give them different things. So sometimes we have very technical customers, and this is when I would go through the methodology with them. I shared our paper with them. I share how the method works. We have some customers who want to validate the data through use cases. So maybe they’ll say, okay, you know, give me some data, I’m going to build a model, I’m going to use that to enhance my model and see if my lift increases. There’s also companies like, there’s also people like that. there are also people who, you know, look at logos and say, Oh, what kind of clients you work with? And if you can give them big logos, so be like, okay, fine. You know, so we, we deal with, we deal with all kinds of clients, but one of the things I try to do is to try to be transparent as transparent as possible. You know, I’m, very transparent about the methodology exactly the steps we use to generate the synthetic population versus to say all this is like trade secret like you just have to trust it kind of kind of thing. I come from a somewhat of an academic background so to me like this kind of open open source let’s say is very important.

00:41:53.57 [Val Kroll]: I like it. Going back to one of your watchouts, it made me think of where you’re talking about materializing data out of thin air essentially. Michael, I’m going to put you on the spot and if you don’t recall, then we’ll just cut this part out. That company that you sent me because you knew about some of my market research, Roos that was like, forget serving people. Just put your questionnaire into this tool and it will model out the responses based on like the different respondent archetypes you’re interested in, right? Like am I representing that well from what you were

00:42:18.47 [Michael Helbling]: Yeah. Yeah. That’s, I’ve seen more than that now. There’s another one. So they’re using LLMs, like you mentioned, Winston, to come up with basically a response to, let’s say a survey, you give it sort of like, here’s who I’m targeting. And they’ll be like, okay, we’ll construct a, an LLM version of that person who will then respond to your survey and give you the feedback that you want to hear.

00:42:32.33 [Val Kroll]: Oh, you’re yoy. I would rather every app on my phone track me than make business decisions off of whatever garbage LLMs is spitting out for those different micro segments. I feel like that’s dangerous.

00:47:12.73 [Winston Li]: Well, well, actually, if you this is something we do as well. And if you if you do it properly, I mean, properly is a bit of a strong word because everybody thinks they do properly. And we think the way we do it is properly. You can get some fairly decent results out of that. Let me describe to you how we do it and why we think it’s actually kind of legit, quite legit. Last year, right before the US election, one of my academic collaborators tagged me and asked me whether I want to write up a paper. He’s now, he’s a professor at University of Southern California and I’ve known him for, for, for a long, long time. And so I said, you know, sure, what do you, what do you want to write about? And, and of course he’s intimately well, he’s well aware of what we work on. There are synthetic, our synthetic population, you know, all those stuff that we do. And he said, you know, right around the election, a lot of people are talking about using LLMs as a way to predict somebody who will somebody vote for. He says, but you’ve got this unique advantage in that you have a data set that matches the exact distribution of the population. So in theory, we can go through every individual in your population and ask the LLM to impersonate that person. So for example, to say you are a, let’s say, You’re a 24 year old, you know, black female with, you know, this kind of family income with this many kids living in the county of X in a state of Y, for example. Now, you know, now it’s election time. Here are the two candidates. Here are the policies and who are you going to vote for? you can do that. You can repeat that 40, 300 million times, 300 million times work. Well, everybody who was eligible to vote, obviously you repeat that many times. And then you can do a bottom up approach as opposed to just asking the LLM, you know, how will I vote? How will California vote? How will this stable? Rather than doing a top down, you do it bottom up. And the assumption of course is that the LLM may contain some bias in that, you know, the LLM may be trained, let’s say, mainly based on Reddit data and Reddit is, you know, you know, are generally young people leaning towards, you know, maybe voting for one party or voting for one candidate. Now we actually, if we supply the LLM with enough information, like the person’s age, the person’s financial situation, you know, all these things about this person, how will they vote? So that’s how we do it. And that’s how we sort of connect LLMs with our existing synthetic society to try to bottom up, you know, ladder up each county, ladder up each state, ladder up and go up to see how will people respond to certain things. So that’s how we’ve done it. The end result was interesting. I will share the end result is funny. It’s both right and wrong. I said I how I described it. It’s wrong in the sense that it did not predict, you know, Donald Trump would win. but it was fairly close to what the market research polls were saying. So we had pretty much very similar numbers to polls, what polls were were saying. So if you look at this as does it predict the election correctly? The answer is no. If you then ask a different question, if you say, does this give similar answers to polls, but at a cheaper and faster rate, then the answer would be yes. And then along the way, we, we, we learned, you know, after the fact, we tried different things, like for example, instead of just prompting the person to say, do you vote for a candidate or a or a candidate B, you may first have to ask, you know, will you vote in the upcoming election? And then only if the answer is yes, then you prompt, you prompt, you prompt the machine again to say, okay, now, given you’re going to vote, here are the two candidates, who are you going to, who are you going to, to vote for? So. So then the LLM would know this kind of a person is more likely to vote, this kind of person is less likely to vote and so on. So there are nuances to how these kinds of things are done. You know, people are always looking at ways of improving market research. And so this is one of the few things that people talk about greatly in the last little while.

00:47:21.66 [Val Kroll]: That’s very cool. I mean, when Michael first sent that to me, I’m like, what type of garbage water is this? But that’s legit. That sounds pretty interesting.

00:47:34.77 [Michael Helbling]: Well, in the interim, Val, I’ve stumbled across a couple more companies in the same space. So like, obviously it’s getting some kind of traction. Yeah, it’s a thing. It’s AI, smarter than all of us.

00:48:18.40 [Julie Hoyer]: But I wonder if it comes down to the validity of the questions you’re asking or the topic you’re wanting to pull them on. kind of back to what we’re saying like you’d have to have sufficient data on how someone like factors that maybe would influence that answer like I feel like politics right in that area there’s a lot of data there that would help and so I’m wondering too like being smart, if you’re a market needing market research and going to a company like that, like you’d probably have to think through a lot of the assumptions you have, or again, what are your knowns and unknowns and kind of balance? Like, do I believe that this is going to be representative? Um, so that’s interesting. Correct.

00:50:54.42 [Winston Li]: Correct. And there’s, there’s again, uh, you know, a couple of different layers to, to, to that question. For one thing, even how you prompt the LLM is will make a difference. But in market research, how you ask the question will also make a difference. So it’s not like that’s just the problem to allow for LLMs. It’s also a problem for people. And then the other thing is, as you are prompting, so prompting is really important as is with, with, you know, almost anything when you deal with LLMs, when you, when you prompt that, that person, what kind of an answer you give, what kind of information, biographic information you give, about that individual will also make a very big difference. So for example, and this is how, if you want to make the information more, you know, you want to make your poll more simulated, poll more, more accurate, so to say, you need to prompt it with the right information. So let me give you an example. We have the California Wine Institute as a client. And when the U.S.-Canada trade war first started, all of the American wines got pulled off the shelf. And this client we work with, which is one of the neighbor of one of our team members, she’s the head of California Wine Institute for Canada. And you know, the next morning she’s looking at news and going like, you know, how do I even do my job? They don’t sell American wines anymore in, in Canada. So, you know, she went to us and said, you know, can you do a poll for me asking who will resume buying California wine when, you know, Trump stops calling Canada the 51st date or when, uh, or when, you know, the trade war is over. And so in, in, you know, in answering this question, you know, you can prompt the machine, you can prompt the LLM in very different ways. You know, you could prompt somebody, you know, we have data off, for example, who’s currently buying American wise. So that would be a piece of information that we apply. We supply to the LLMs. You know, obviously your basic things like your age or gender or your income where you live would play a role as well. But there could be a lot of, you know, psychographic data that you put in this as well. So again, depending on what data you have already and how you prompt the machine, you will get different, you’ll get slightly different answers. So just like how people do real polls, you know, polling an LLM has its own nuances as well, so to say.

00:50:56.30 [Val Kroll]: So interesting. All right.

00:51:28.57 [Michael Helbling]: This is so interesting. And I think this is a topic that is sort of like not a lot on people’s radars. So thank you so much, Winston, for kind of coming in and sharing some of your knowledge with us. We do have to unfortunately start to wrap up and kind of close up the show. But this has been really fun to talk about and hear from you on. So I really appreciate that. But one thing we do on the show is the last call. We just go around, share something that might be of interest to our listeners. Winston, you’re our guest. Do you have a last call you’d like to share?

00:51:50.55 [Winston Li]: Sure. I’ll be in Chicago. I think at least Val is in Chicago. I’ll be in Chicago in September. Uh, we’re sponsoring the A and A measurement conference. So if any of the listeners of the podcast are based in Chicago, would love to connect.

00:51:54.90 [Michael Helbling]: That’s awesome. And are you going to come to measure camp Chicago too?

00:51:59.80 [Val Kroll]: I’ll try. I’ve, I’ve already started to twist Winston’s arm. Stay till Saturday.

00:52:05.48 [Michael Helbling]: Oh, okay. Cause we’ll be there. We’ll be there. So great to see you again.

00:52:13.06 [Val Kroll]: You have to go to me next because that’s going to be my last call. Sorry. It’s I’m teaming myself up with the last call.

00:52:18.94 [Michael Helbling]: It’s a natural next step, Val, that maybe we should hear what your life’s called.

00:53:01.79 [Val Kroll]: Oh, I’m so glad you asked, Michael. Well, I wanted to encourage all of our listeners to consider joining us in Chicago for our second annual measure camp. It is on Saturday, September 13th, and we will be downtown again at the Leo Burnett Building. which if you’re not familiar with Chicago, Chicago was built around being enjoyed from the river, from the water, and it’s right on the water. It’s a great location. We had just over 200 people join us last year, and we expect even more people there this year to celebrate and to connect and share and learn from. So we would love to see you. So measure camp Chicago, you can grab your tickets now.

00:53:03.63 [Michael Helbling]: I have my ticket already.

00:53:04.19 [Michael Helbling]: I have ready.

00:53:09.42 [Michael Helbling]: That’s awesome. All right, Julie, what about you? What’s your last call?

00:54:56.65 [Julie Hoyer]: Well, my last call, it happens to be an article, Cassie Kozikov, one of our faves. I know we talk about her a lot, but it’s a recent article she wrote and I just found it really interesting. It was how to pick a college major in an AI first world. And I really liked the way she framed it. She breaks it down into, it’s funny. She’s like, you should get into two different types of majors, either a major in clarity or a major in curiosity. And she kind of talks about why each one is going to be really beneficial to AI. And it made me feel better about my major. it technically is one in clarity. So it’s cool the way she talks about like, well, if you go for clarity, how does it help you for AI readiness? And it talks a lot about like, it is a rigorous black and white, like there’s a right answer or wrong answer and it helps you a lot with like logic and thinking through a problem and how do you frame a problem? And so how can you think about the use of AI more like, not right and wrong, but the logic side of, it instead of like soft logic things like it’s a more rigorous like you got to stay within these bounds type thinking and then curiosity is kind of on the other side it’s very much around majors that are going to help you with adaptability and lifelong learning like how do you fuel your strengths and what you’re curious in so I just really loved her framing because I think a lot of people now are like Should I go to college? Do I need to go to college? How do I get ready for a job that’s going to use AI inevitably? So it was just a really good read, whether you’re thinking of going to college right now or not. It’s a good one.

00:55:01.52 [Michael Helbling]: I’ve been to so many graduation parties over the last couple of months. So this is good.

00:55:06.63 [Julie Hoyer]: It’s timely, right? Yeah. Yeah. What about you, Michael?

00:55:33.03 [Michael Helbling]: Well, I also have something that’s AI related. There’s a paper that recently came out that took a group of users and had them take a quiz. And then while they’re taking the quiz, they compared the ability of an LLM to influence their answers versus a human who has paid money if they were able to correctly influence or incorrectly influence their answers. And the LLM vastly outperformed the humans in its ability to persuade people.

00:55:33.87 [Val Kroll]: I just had goosebumps.

00:57:12.45 [Michael Helbling]: Yeah, so it’s already there that LMS are extremely persuasive. And so the paper walks through some of the things, but we’ll put a link on the website if you want to check out that paper. It’s pretty interesting initial read. There’s a lot more research to be done, better experiments to do in this field. But as a first little read, it was quite interesting. So anyway. All right. Well, it’s been such an honor and a privilege Winston. Thank you so much for coming on the show, spending some time with us. Really appreciate it. Yeah, likewise. Thanks for having me. Synthetic data. It’s sort of going to be, I think it’s got a big future. So it’s really cool to kind of break into this topic for the first time on the show. We’ve been doing the show for 10 years. We’ve never talked about synthetic data. So this is a good, it’s finally time. So yeah, thank you so much. And of course, as you’ve been listening out there, you might have more questions or you might want to join the discussion. We would love to hear from you. The best way to do that is through the Measure Slack chat group or on LinkedIn. Or you can also send us an email at contact at analyticshour.io. We’d love to hear from you. And, of course, we want to always give a big shout out to Josh Coherst, our producer, because of everything he does, and thank Josh. And, of course, I think I could speak for both of my co-hosts, whether we’re using real data or synthetic data. Val and Julie, I think they all agree, people should keep analyzing.

00:57:31.20 [Announcer]: Thanks for listening. Let’s keep the conversation going with your comments, suggestions, and questions on Twitter at @analyticshour on the web at analyticshour.io, our LinkedIn group, and the Measure Chat Slack group. Music for the podcast by Josh Crowhurst.

00:57:51.93 [Other]: So smart guys want to fit in. So they made up a term called analytics. Analytics don’t work. Do the analytics say go for it, no matter who’s going for it? So if you and I were in the field, the analytics say go for it. It’s the stupidest, laziest, lamest thing I’ve ever heard for reasoning in competition.

00:57:51.93 [Val Kroll]: Rock flag in low resolution data.

Leave a Reply



This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#274: Real Talk About Synthetic Data with Winston Li

#274: Real Talk About Synthetic Data with Winston Li

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_274_-_Real_Talk_About_Synthetic_Data_with_Winston_Li.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares