When it comes to data, there are data consumers (analysts, builders and users of data products, and various other business stakeholders) and data producers (software engineers and various adjacent roles and systems). It’s all too common for data producers to “break” the data as they add new features and functionality to systems as they focus on the operational processes the system supports and not the data that those processes spawn. How can this be avoided? One approach is to implement “data contracts.” What that actually means… is the subject of this episode, which Shane Murray from Monte Carlo joined us to discuss!
0:00:05.8 Announcer: Welcome to the Analytics Power Hour. Analytics topics covered conversationally and sometimes with explicit language. Here are your hosts, Moe, Michael and Tim.
0:00:22.1 Michael Helbling: Hey everybody, welcome to the Analytics Power Hour. This is episode 213. I once had a conversation with a data leader, I was bemoaning the frequent ETL problems that we were having with some data sources, and they were like, Oh yeah, we call our ETL “self-healing” ’cause in a few days, it’ll probably start working again, and I was like, Oh, that’s actually a pretty good outlook for that, but it just speaks to how our data systems and platforms are largely pieced together and at really early stages of use and understanding, and at the same time, our industry is growing and maturing very quickly. And with that, we don’t just need more capability, we need to have that capability understood and used effectively across our companies by everybody who needs to interact with that data, and that’s sort of where the concept of the data contract comes in.
0:01:17.0 MH: What is that you might ask. Well, we’re gonna talk about it, and I’m pretty excited to introduce my co-host for this episode, Josh Crowhurst, senior manager of digital performance analysis at AIA. Welcome.
0:01:29.4 Josh Crowhurst: Hey, happy to be here.
0:01:29.9 MH: So some listeners might notice that Moe is absent, she will actually be back for the next episode, but she’ll be on a little break, and Josh, thanks for joining in as a co-host for this episode. And Tim Wilson, full-time quintessential analyst and industry bon vivant. Welcome.
0:01:51.0 Tim Wilson: Wow. We gotta work on that.
0:01:52.5 MH: I don’t know, what am I supposed to call you now? You’re a man that can’t be defined.
0:02:00.5 TW: We can all set up a meeting and we can workshop something.
0:02:01.8 MH: Okay, we’ll workshop some titles.
0:02:03.5 TW: Anything would be better. How’s that? Yeah.
0:02:05.0 MH: The, I was thinking like the pie chart slayer or something like that. We could, I don’t know. We’ll come up with something. Anyways, I’m Michael Helbling, Managing Partner, Stacked Analytics, but we needed a guest, someone who could help us make sense of all this data contract stuff. Shane Murray is the Field CTO at Monte Carlo, and prior to that, he was the Senior Vice President of Data & Insights at the New York Times. He’s also a founding member of the InvestInData angel investor collective. And luckily, today he is our guest. Welcome to the show, Shane.
0:02:37.4 Shane Murray: Thank you. Thanks for inviting me. Excited to be here.
0:02:41.7 MH: Yeah, you’re the second person I’ve met from the InvestInData collective. So I might know a couple of more people, but I definitely know at least one other person, it’s a pretty amazing group of people, and it’s pretty cool to see what you’re doing with that, which is not what this show is about. All right, well, let’s start with something, just I think, Shane, it’ll be probably really helpful for all of us to just start at the very beginning of this, which is just defining what even a data contract is, and then I think we can sort of use that as a jumping off point for more discussion about it.
0:03:12.5 SM: Sure, so a data contract is typically an, I’d say, an internal agreement between data producer and data consumer or consumers, it could cover things like the schema of the data, the range of the values expected, SLAs around the delivery of that data or the availability of that data, as well as things like access and security to the data, usually what I think we’re finding is that this is accompanied by a shift in process and culture within the organization towards more upfront gathering of data requirements, and even perhaps an emphasis on data modeling on the side of the data producer to ensure the quality of that data that’s delivered towards the warehouse, ultimately with the goal of building more reliable and trustworthy data.
0:04:06.3 MH: Am I right in thinking about it, ’cause I wasn’t super familiar with them until I started reading it seems like you and Chad Sanderson seemed to be the dominating the search engine results on data contracts, but I was kind of gleaning from it, which seems obvious in hindsight but the challenges that we’re saying data producers and data consumers, the data producers are often software engineers building operational systems, like the data consumer if you tell them you’re a data consumer, they’ll be like, Yeah, I’m a marketer, I’m a data consumer, I’m an analyst, I’m a data consumer, it feels more natural.
0:04:41.9 MH: If you tell a software engineer that they’re a data producer, that’s kind of the fundamental challenges that they… They’re like, No, I’m building a system, I’m working, maintaining a system to run some operational process, data is an artifact of that, so sure, knock yourself out. The data is flowing somewhere else, I assume that it is, it’s fine, it’s just fine, but they don’t really see themselves in a role of producing data that just kinda happens and that’s kind of the root of the problem.
0:05:13.2 SM: Yeah, I think that’s on point. And to your point, I don’t think many software engineers would label themself as a data producer, they would call themself a software engineer or a product engineer, and often I would say, and I found this from experience, they’re pretty unaware of the downstream uses of the data emitted from their systems, often, we’d call it data exhaust, where we’re taking in this exhaust and we’re making it something valuable. Refining it into something valuable.
0:05:46.9 SM: And these engineers I’d also say, are often pushing frequent changes to production, altering the schema and breaking those downstream data pipelines, and I think the shift we’re really seeing, and this is amongst the data consumer class, is that data products are just becoming more and more critical to organizations both the machine learning product, so the customer-facing applications that are running in production, but also the analytics, you can have financial analytics that’s supporting decision-making or pricing analytics that’s just critical to the business, and data teams are just looking for ways to ensure that data is high quality, it’s reliable and it’s trustworthy.
0:06:32.0 JC: So I guess one of the things I was thinking about, just kind of related to that point where you have software engineering teams that are not typically looking at themselves through a lens of data producer, and with the data contract, you’re asking them to do something a little bit differently and to change. So would you say that you typically would see some level of resistance to bringing this in or is it more, as those use cases become more and more critical, it’s self-evident that there’s value in this. I guess, how do you get that cross-functional buy-in from those teams that are more on the producer side where I guess the benefit is less clearly felt by them?
0:07:12.4 SM: Yeah. It’s a really great question, Josh. And I think this is still some of the battle that data teams are fighting, is getting the seat at the table to make data a first class citizen in these products. And I do think there’s an element of where you pick your battles, there are certain relationships where you have a clear, I would say like a one-to-one relationship between the software engineering team and the data consuming team, where it’s a lot clearer how you’re going to approach that team to ensure reliability, and there’s other cases like 50 product teams that are contributing to an event schema that’s constantly evolving, and that maybe serves 20 downstream products, so you have this kind of more complicated relationship.
0:08:02.2 SM: What I’ve found though, is that first, it starts with partnering with those software engineering teams to drive awareness of how the data is used to actually bring them kind of in the trenches of responding to data downtime, and I certainly saw this at The Times, both with how we were using commerce data to create subscriptions and financial data, as well as how we were using things like machine learning based on these upstream systems, but actually making them a party to the success of that product and breeding awareness in the use cases of the data is kind of this critical awareness step, and then I’d say one of the bigger shifts, and this is maybe a bit of a shout-out to Monte Carlo but is actually the… What observability can bring to the table, that’s really bringing for the data teams a workflow of detection and resolution and prevention of data incidents, but it’s also giving you this mapping of lineage where the upstream teams can see when they create, say, a schema change incident or a freshness incident, how many teams and reports are affected downstream, and the data consumers can also see upstream where the originating problem is.
0:09:27.0 SM: And so I think observability has a big role to play here, and then I think thirdly, you kind of get to this point where you’re like, all right, we’ve got the observability, we’ve got the awareness, but we’re actually still seeing regular schema changes that are affecting our downstream systems, how can we put a contract into place?
0:09:48.6 MH: On the one hand, it sort of seems like there’s a little bit of a chicken and an egg problem with this, not totally, but can feel like it because in terms of kinda how you talk about data people are trying to get a better seat at the table, but it’s sort of like to get that seat at the table, you need some of that observability sort of those things, and so data contracts kinda solve that. Do you also see that there needs to be a lot of times structural changes in data teams to support it, because in my observation, a lot of times data teams lack what I’ll call sort of like a management layer or effective management layer, whether you call that a project manager or a product manager or whatever it is, data product manager, whatever, I don’t care what title you wanna call it, but a lot of times data teams are pretty small and they live under other teams that don’t effectively understand what their role or what they should be doing, like what structural changes do organizations need to make to leverage this as well?
0:10:49.3 SM: Yeah, I guess I’d say from my experience, one of the biggest ways to drive this change, to drive the awareness and the proximity to the problem is through some sort of embedding model, and so pushing out whether it’s engineers or analytical engineers or analysts, in the case of my experience, it was with analysts pushing them into the product engineering teams to work side by side with them as like a data expert partner and bring some of those best practices to play and then you just see that that software engineer or that group of software engineers they have a lot more clarity around how the data is used and they’re part of a cross-functional team that’s responsible for the downstream impacts of their data.
0:11:39.4 SM: I still think you have… You’re going to have these complex relationships between a central team that probably still houses a lot of the most critical data assets for the organization, a financial reporting data asset or my experience at The Times, we actually had the machine learning team sitting centrally, but then depending on various upstream teams that were emitting the critical data for those applications.
0:12:07.0 TW: So when it comes to the ways you started to go down the observability and having the lineage, it feels like there’s one extreme is you don’t try to deal with this at all, and you just deal with as the SEV one and SEV two issues come up, you scramble, you hope you get a fix quickly, you might not. It causes a bunch of chaos and at some point that becomes painful enough to enough people that it’s like, “Hey, we’ve gotta change,” like you wait for it to break, although there’s probably the more insidious you have things fail so gracefully that you don’t realize they’re broken, so you just have bad models running.
0:12:46.0 TW: The data contract feels like it’s like the other… The observability kinda falls in the middle of saying, Let’s at least proactively identify things that look like they’re not going well, the data contracts feel like it’s like the upstream, the mindful, and we should probably talk a little bit about when you talk about data contracts, it’s like it’s not a legal document with sub section A, B and C, one of the posts is like, “No, this is maintained in a system, this could be JSON or this could be… Somehow it is structured, it is metadata itself that is part of the data contract,” and presumably that’s not gonna solve it, it’s not like if you have data contracts in place, you’re never gonna have data breakage, but the goal is to say at least we’re all on the same page as to what the expectations are, so we don’t have a software engineer dropping a column without… They may do it once and then realize they’ve violated the contract, and then maybe they now know, “Oh, before I drop a column or change a schema, I better go take a quick look at the contract and make sure that’s legit.”
0:14:01.0 TW: But thinking about them across a spectrum of, it’s kind of trying to get to a more proactive, thoughtful way to approach things as opposed to just reactive, is that a fair framing?
0:14:15.6 SM: Yeah, I think it is. The reality if you… And even going back to Michael’s opening, a lot of data teams were built on these brittle processes, whether it was kind of change data capture from copies of the DB or in some cases even APIs that were provided at some point in time by software engineering teams built for a specific single purpose, but that team left or it didn’t have the sort of evolution of that schema. I think the phase we’re in right now is teams are, as I said before, teams are starting to see these critical reporting applications or machine learning applications where you just need a higher standard of data quality and you need it to be to your point, Tim, you need it to be more proactive as opposed to the kind of cost you’re seeing in the downtime of being mostly reactive, and I think in many ways it’s like has the cost of that downtime become greater than the cost it would take to actually resolve these issues proactively by having some upfront contract and communication and evolution of that contract.
0:15:34.7 SM: And so I think just in terms of the use cases, the two I think about from my New York Times experience, one was the financial data, where you had to be accurate to the penny and available, and you had a clear team upstream that was responsible for producing the data that we were curating into financial reporting, and that team was on the hook every time we had a high severity incident to respond to it with us, so they had a vested interest in putting in place a contract that we would evolve together and using these sort of schema registry and contract techniques to actually put that in place.
0:16:19.4 SM: The other one I think about is like a machine learning use case, so at The Times, we had a algorithmic content recommendations team that was heavily reliant on what we called the publishing pipeline, and that was data that was published to Kafka on the articles that were being published throughout the day. And so that was heavily relied upon by the machine learning team, and it made sense to have a schema registry and a contract of sorts between those teams, albeit somewhat primitive.
0:16:54.1 JC: So can I just ask maybe a silly question, when it comes to these data contracts, the actual artifact that represents the contract, does this have to be, Tim mentioned like JSON code, is it something that’s integrated into the development process and formally gets checked against when new code is deployed to make sure that nothing is breaking and there’s sort of a testing element, or can you even just have like just have a Google Doc that says like, This is the SLA, this is the expected format, this is the uptime that’s expected, et cetera. Those sorts of details. What is the actual form factor of a data contract ’cause yeah, it looks same as I think Tim mentioned it’s not something that I’ve seen in the wild yet, so I’m like, could we… Obviously, I think you could go to that level where you do integrate it into your systems in a more technical way, but can you as an easier start, maybe for some teams when you’re starting to experiment with it, could you have more of a lightweight sort of a form factor and yeah, how does that look?
0:18:00.9 SM: Yeah, and the more lightweight Google Sheets approach, I’d say, is often the tool of choice for analysts today dictating tracking requirements where you’re doing extensive implementation across product teams, and so completely like it feels like for data teams that would be a good place to start. Some places that might fall down, of course, is the inability to enforce checks and quality assurance on changes, the ability to evolve schema and iterate it over time, and you’re probably not integrating that very tightly with the software engineer workflow, so similar to something like a catalog that has this problem of perhaps sitting outside of the tool set that’s used. I think typically, the tools that have been adopted here, we mentioned JSON, there’s also Protobuf and Avro that are often used for actually defining the schema, and then you have in places like Kafka are the ability to hold a schema registry that can be stored and retrieved in order to put these checks in place on the changes that are made.
0:19:22.2 TW: But in those, if it’s in a system is it… And it seems like it’s talked about a lot in the data mesh, it’s a whole modern data stack-y thing, this whole data contract, so it’s like, presumably, if it’s stored in a system, it feels like it would need to be… The format and the structure needs to be stored in the services layer system, or it needs to be stored in the more centralized you can’t say, I’m gonna store my data contracts in all of the data producing systems. Is that right? So you have to have the data producing systems have to hook into the data contract that comes from the basically the same place that’s gonna be using the data. Am I thinking about that right?
0:20:16.6 SM: Yeah. Often it’s being attached to like a Google Pub/Sub or Kafka like a data bus approach, like a messaging system, where you’re actually saying to those teams, you can actually publish to this messaging bus and that’s what we’ll then consume from as a data team, so it’s a switch in terms of how they’re approaching that role of a data producer, but I think as we mentioned Chad Sanderson earlier, I think there’s some deep dives into the different ways this can be implemented in the various forms of ETL that teams are using.
0:21:00.9 MH: Well, it seems like there’s a lot of different kinds of data contracts as well, and that might also impact sort of like back in my earlier days, I was an analyst at Lands’ End and we had a set of reports that we sent out to the business Monday mornings, and we did that by 11:00 AM, and so in a certain sense, that was our data “contract” with the business that that reporting and that analysis would go out to the team by that time with that data, and then we tried to get our vendors who provided that data to have SLAs that they would process and present the data to us so that it would be ready at that time. Interestingly enough, our vendors refused to participate in those SLAs, that was before it was in the Omniture days way before Adobe. But that’s one of the things we were doing. So in a certain sense, that’s sort of a “data contract” if you will, but and so I think that’s where it gets a little…
0:21:57.7 TW: Uh, you’re killing me. You’re just… You’re trying to bogart a term that was, it was defined, it was like…
0:22:02.4 MH: It’s an SLA. It’s an SLA.
0:22:05.3 TW: Yeah, it’s like kind of an SLA.
0:22:06.6 SM: But I would say like the SLA as being talked about in… Alongside the more proactive preventative contract, we’re seeing a few variations of this discussed, we deal with a lot of customers, work with a lot of customers at Monte Carlo who are going from observability to establishing SLAs and SLOs over their data products. And this is, I think, a natural path in just adopting some of these software engineering and the best practices that we’re seeing from the engineering space, I do think there’s also then how much restriction or gating do you put on software engineering teams to deploy code without first considering the changes to your data environment.
0:22:57.8 SM: And then there’s been some sort of thought around even how can data teams use that gating mechanism to prevent bad data getting to production, even if you can’t prevent it being published by the software engineering teams, can you keep it in kind of a staging environment and deal with it there.
0:23:15.8 TW: Yeah, yeah, we had a client who had a customer service rep key in a $40 million order, and it triggered a measurement protocol hit and so now that’s in… So it’s like, “Oh, now how do we get that out of there?” It’s like, “Well, whatever.” But those kinds of things happen all the time. It’s just that’s very common. So let me lay out where I’m trying to go with this, because I think the first thing is sort of like, how do we categorize them, what category should people think in in terms of like where I should go set up data contracts, and then the follow-up from that is, how do I as an organization start to think about tackling data contracts, should I go in and do an entire review or should I start with one team and progressively build like how should we go in and tackle this? That’s sort of where I’m headed with this line of thinking.
0:24:09.9 SM: Yeah, I do think initially understanding which where you have critical downtime of data products that’s affecting the bottom line, right? And this is where now I think the data contract conversation is getting more nuanced, it’s unlikely you’re implementing this across your entire data environment, there’s a lot of room that analysts in data science just expect to be doing exploratory prototyping work around data. But I think once you get on this spectrum of production, you’re actually starting to look at where you can draw the line and say, “We need… It’s imperative for us to manage the quality of this data product and to have a higher standard than we do right now for our other products.”
0:25:01.8 SM: I think one of the… Some of the things you have to deal with here is, can you put in place the right balance of incentives, so data teams have to think about demonstrating why this is valuable, what tools they’re going to use to make it easy and build in the feedback loops, I mentioned earlier like where is the cost of slowing down less than the cost of the downtime you’re experiencing today, and then where are you… Where is it acceptable to essentially interpolate and extrapolate data where it’s missing or deal with data in more of a prototyping sense until you find that production standard algorithm that’s going to drive the business.
0:25:48.5 TW: It’s all trade-offs. It seems like it is the tough… The tough case is, it’s almost every kind of data quality or data reliability thing, is that if it’s going really well, there’s a cost for it to go well, because and you’re not feeling the pain, like if it starts to slip, it’s a gradual decline. Even if you have things in really good shape, there can still be a critical data failure and you have people pointing fingers of saying, “I thought we were investing in these data contracts, so that that wouldn’t happen?” So it seems like there’s still a level of the risk and return, it’s gotta be diminishing returns, if you get down to where you’ve got a data contract for the data that’s tracking how often the toilets are flushed in the office, which nobody actually cares about, the cost is way high, but it’s data no one’s using. So it does seem like taking that…
0:26:47.2 SM: Yeah, there’s a relevant concept in software engineering around as much reliability as required and no more, and I think that’s what you have to look at here around your analytical data, around your machine learning data, around your critical decision-making data is what’s the level of reliability you need. And when does it go beyond the point where you actually need to put in place a preemptive contract that prevents changes making it into production.
0:27:21.1 TW: And they’re like is every… Are software engineers completely. That idea of is that the analogy to draw with them, is that common in software engineering that they would say, “Oh yeah, yeah, of course, I wanna have as much reliability as required,” is that part of the discipline of learning software engineering is to not over-engineer?
0:27:42.6 SM: Well, I think this is the SRE concept that’s kind of taken hold over the past 10 years, and most tech organizations these days have an SRE org that’s working with the teams on reliability and resilience of their systems, and this is just something that I think is now making its way to data, but has been around tools like Datadog that are used heavily in organizations to ensure the uptime of applications and systems, but again, they’re looking at what is the minimum reliability that’s useful for this product because that is effectively a budget, and I need to be able to spend as much time as required to ensure that level and no more, and I think there is a parallel to data teams where they’re starting to think about how do I ensure there’s reliability so I don’t lose trust in the data, so I don’t lose revenue for the business, so I don’t lose reputation, but what’s the minimum I can spend on that for each data product I’m building.
0:28:53.7 JC: Right. And I guess also the other thing I’m thinking about is from the analytics side, you also still wanna keep some flexibility because if you put too much structure around it, you lose that ability to explore quickly and iterate quickly, which sometimes a lot of what we’re doing is that. So I guess it’s not like you wanna have this really locked down everywhere across the board because you lose some of that iteration speed. Is that fair?
0:29:22.4 SM: Yeah, I think the promise of data contracts and the only way they really work is if they are iterative, but I completely agree with you, Josh, that a lot of the times and I’ve been an analyst in various forms, could we predict all of our requirements on the data upfront? I’m not sure.
0:29:44.7 JC: Not always.
0:29:45.2 SM: We’re often experimenting with the right data product for the right use case and what that required, and so having this sort of iterative relationship where you don’t have to nail things down is useful for analysts.
0:29:58.5 MH: So that actually makes me think we’re talking, we’re framing it as, there’s this contract and that helps the software team make sure they’re not introducing breakage, but as I’m thinking about it, the depth and robustness of the contract, I can also see analysts saying, Oh, I found this other field. It got added, it didn’t violate the contract at all, it was another column that was added, and I started tinkering around with it and found it useful, but if I as the data consumer didn’t realize, “Oh, wait a minute, this needs to work into the contract,” it seems like the data consumers or the producers of the data products also need to be, have some ownership of the contract of is everything that they’re doing covered by the contract as well.
0:30:54.9 SM: Yeah, I should be clear, they will have data that’s more development, it’s not part of a contract that can naturally be picked up and experimented with and just doesn’t have the same level of requirements around it. One area, just for an example, I’m working with a bunch of customers that are implementing forms of data mesh, and of course they’re looking to enable data product development at scale across many teams, and where they’re seeing the need for contracts is any fields that run across domains. And so rather than having two different teams go and implement a different schema for the same type of data, they’re actually locking down contracts with those teams specifically around those cross-domain fields and the requirements for them, so they have a single format, a single schema. Common semantics for that.
0:31:51.6 MH: So but does that mean… Where does that common schema then lives in what part of the… I’m not deep on data meshes, but is that common schema, is that fall on the software team or is that kind of as it’s getting federated, that it’s saying, “We’re gonna federate this out and keep a common schema that serves different domains’ needs,” or is that the same thing.
0:32:16.5 SM: I guess if I’m answering this, your question, put more simply, the software engineer owns the contract, they own the SLAs around that contract and they own the way it’s delivered, they sign up for it, but in these data mesh instances, it’s often this kind of governing team that’s actually determining what those requirements are for cross-domain fields, so they’re actually playing the role of consumer there for the org and defining the requirements.
0:32:51.2 TW: And it does seem like in any data org, you’re gonna have layering of this to a certain extent, ’cause if that analyst or data scientist becomes a consumer of the data that software engineers are creating and making available, they themselves may then be creating data products that are out there, and so there’s another layer of data contract that may need to go into effect. How is that managed? And I think maybe you’re talking about that with the data mesh concept in a certain sense that as teams are working across, they need a layer up. How is that governed typically?
0:33:27.7 SM: Yeah, I think in data mesh commonly we’re seeing it take the place of or take the form of SLOs more so than contracts at this stage, I’m not saying it won’t go to that level, but data teams are effectively putting in place SLOs around the data products that they’re creating, ’cause they’re part of this interoperable mesh of tables or data products that need to be consumed and need to be trustworthy, and so the way that they’re building that trust is by accountability, expectations and those expectations might take the form of a service level objective that they can communicate out to the organization, and then communication over data incidents when they occur, so you have… Some of our customers effectively have like a data marketplace where they can show an indicator light of the current health of a data product and whether it’s ready to use.
0:34:32.0 MH: So we keep saying talking about sort of the data mesh world, I’m gonna steal something Josh had noted before we started, which do data contracts make sense, there is again, modern data stack data mesh, these ideas are kind of like that’s sort of the future, but they’re I would assume that still the preponderance of organizations, or there are many, many organizations that are kind of not yet modern and have a little bit more monolithic, they still may have… There’s still gonna have multiple systems pushing into a data warehouse, what are your thoughts on, does it makes sense to get a data contract in place there and say, “Sure, we may be going to a more federated modern architecture in the future, but if I get a data contract in place now that’s probably gonna help with that transition,” is there still value… Feeds have been getting broken going into data warehouses for 20 years, so does it make sense to do data contracts, not in kind of the sexy modern data stack world?
0:35:40.9 SM: Yeah, for sure. I think the reason we sort of land on talking about meshes is it’s almost this kind of idea of scaled product development and this interoperability, which means that there’s almost a cumulative effect of the problems in data, if you’ve got multiple teams engaging across data products without necessarily knowledge of the owner directly and their need to build trust in that system, but I think the concept of data contractors doesn’t have this direct link to data mesh, and even the example I referred back to of my experience at the New York Times, we were owning these core data products centrally, we had some teams that were operating with meshy principles, but a lot of the core data products, the most critical and valuable data products were managed centrally, and this is quite common in organizations, and that’s where you have this higher standard of proactive enforcement on the schema.
0:36:50.8 MH: So now, and maybe this was just a fad that passed, or maybe it’s still there, there was definitely a period where data lakes were kind of the… It’s like playing Whac-A-Mole. But the idea of like oh, just have your systems throw data into the data lake that’s gonna be a data swamp, whatever, and then we’re kinda gonna get it out of there and clean it. Is that intermediary step one, I guess I’m just not looped in enough to know if data lakes had their day and people have realized that was creating more problems than it helped, but if they are still kind of part of the… An option for architecting an overall system, then does the data contract fall to where it’s actually what’s going into the data lake, ’cause it always seemed like data lakes were talked about just like throw it in there, which feels not very contract-y, but that then breaks your link between the consumer and the producer.
0:37:50.4 SM: Yeah, I think if we talk about data lakes there’s both probably the culture or the how it came about and also the technology, and I think the technology has stuck around what we’re dealing with in Cloud, lakes is the step beyond what we had with Hadoop and other and sort of S3 and how we were managing data lakes in the past, but this idea that we could just drop all your data into the lake and then do all the transformation later, I think…
0:38:25.0 SM: I think this is why we’re seeing both data mesh and contracts emerge as better philosophies to manage trustworthy data products, ’cause data teams were spending the whole time, you know, cleaning data and dealing with the fallout from every one just throwing it in the lake. So I think mesh even positions itself as a move away from that paradigm of the monolithic, throw it all in their lake to data produces taking responsibility over the data they’re producing and the data products that result from that.
0:39:06.6 MH: Man, it’s confusing.
0:39:08.0 TW: It’s just a bigger version of collect everything on your website, right… Just throw it all in there. We’ll figure it out someday.
0:39:19.2 MH: Yeah, that was a very helpful. I say it’s all confusing… That was like a broad entire topic confusing. That was actually a very, very helpful in clarifying explanations, so it’s my inner voice coming out. So this is kind of looping all the way back to the… Talking about the New York Times, and we’ve talked a little bit about prioritizing of where their data contracts and you were kind of noting financial data, and all of a sudden this thought bubble of like, oh, wow this is like Sarbanes-Oxley coming up, which is a US-centric thing, but when it comes to that critical data, it does seem like this is just a little bit of an idle thought that we don’t…
0:40:03.3 MH: It seems like people sort of figured out Sarbanes-Oxley, I think it’s still there, there are still people, executives on the hook for saying this data is good and valid, that feels like a way that also to get some buy-in, at least on some of the data sources to say this is gonna add your confidence if you are legally being required to sign off on certain numbers, if you step down from Sarbanes-Oxley, well then what’s the data that’s being reported to the board, maybe that’s not federally regulated, but you’re still gonna be in hot water, if it’s wrong. So do you see that kind of a corner office kind of buy-in to say, we really need to know at least these core data products and the core data within them that we’re looking at, we are willing to invest in this sort of cross-departmental thing of a data contract to make that data more robust and reliable.
0:41:06.1 SM: Yeah, I actually, I think with the case of financial reporting, the case to the software engineering team that runs those upstream systems is pretty clear, ’cause if they’re rolling out changes that are affecting the downstream data model, they are the ones hearing about it and having to resolve the issues and on the hook, the same way you are as a data team for financial reporting, and so driving that change with those teams, I think can be pretty straightforward.
0:41:37.8 SM: I think as you get into murkier territory where you’re working with the team that owns the app or the team that owns part of the website that are emitting event data, that’s then being used for machine learning models that create some value downstream. You have to kind of weigh up, then is that value, does that value created give you sufficient seat at the table to actually say, we’re going to proactively put in place a contract that will slow down potentially these engineering teams in launching features or at least make them iterate the contract and treat it with the same importances of another feature that launching.
0:42:23.3 MH: Alright, well, such is our “data contract” with our listeners that we do have to start to wrap up. It’s not…
0:42:31.7 TW: No. It’s not a stated contract.
0:42:33.2 MH: I know it’s not a data contract.
0:42:33.5 TW: These are not funny. These are not funny. Damn it. Analytics engineering got, taken over. It’s not funny. You can’t make jokes where you take a well defined term and apply it elsewhere.
0:42:44.1 MH: Nobody let you… Tim doesn’t…
0:42:46.4 TW: I don’t see any humor in that.
0:42:48.8 MH: He sees no humor in that. Alright, Shane thank you so much, this has been a super engaging conversation for us, I think, because as you could tell, we’re trying to wrap our heads around this, and I think this has actually been quite helpful, at least for me. I think Tim’s still pretty lost, but I learnt a lot. So thank you for that.
0:43:06.3 TW: Wait a minute. I think I can do both. I think I learnt a lot and I’m still pretty lost.
0:43:12.3 MH: Josh learnt the most. So, Josh, you’ll carry this forward and maybe it’s too late for Tim and I.
0:43:19.7 MH: No, but listen, one of the thing we definitely gotta do is… Well, we don’t always do, which is go around and do a last call, something that our interest might find… Our audience might find interesting, and Shane, you’re our guest, do you have a last call you’d like to share?
0:43:36.6 SM: Sure. This is maybe a little off topic, but I wanna…
0:43:40.9 MH: No, it’s A-okay.
0:43:41.0 TW: That’s encouraged.
0:43:44.0 JC: Yeah.
0:43:45.9 SM: As everyone’s been talking about and experimenting with ChatGPT. I thought I’d reference artist Nick Cave’s takedown of ChatGPT on his website, the red hand files, I don’t know, has anyone seen that?
0:44:00.4 MH: Oh nice. I have not.
0:44:02.3 TW: I haven’t.
0:44:02.4 SM: So just to give a brief synopsis, a fan sent him a song saying, this was written by ChatGPT in the style of Nick Cave, and in his response, he makes some really valid points about limitations of AI to be genuine or original, but he also just has some great lines in there like, data does not suffer, so it’s worth a read I’d point you towards it.
0:44:29.7 MH: Nice. That is awesome. And yeah, that is definitely top of mind, at least in the Twitter world I live in.
0:44:37.4 TW: I have a cousin who’s, a little aside, second cousin, who she does some content marketing and we were talking and she was like, “Yeah, but it’s changing the game, like I’ve gotta put prompts in to generate this content,” I’m like, “It’s just… But why, like why?” The content marketers use of it to try to do an SEO battle, just kills me, I’m like so you’re saying you just wanna throw a more crap out there and hope that you bubble up. She’s like, “I use this tool that takes and tells me what the top ranking post what they use and it writes a better post and it’s an arms race.” I’m like, “No, it’s actually generating nothing adding value to society. You’re just creating”…
0:45:18.4 MH: Nice.
0:45:20.0 TW: Sorry.
0:45:20.4 TW: Another rant… Cut me on that, I’ll stop.
0:45:21.1 MH: Well, Tim, since you’re on a ranty moment, why don’t you share your last call.
0:45:27.5 TW: So there will be no AI, no ChatGPT MeasureCamp North America coming up shortly on March 11th on a Saturday. So it is North America, but it also is virtual, so there is no travel required depending on which hemisphere or where you are, maybe it’s not practical to attend, but there are tickets they have, that is now, I think gotten enough of a foothold, they… This is… I don’t know their fourth or fourth one maybe, I’m not sure exactly, but I feel like it’s got the staying power, and I will be there on Saturday, March 11th…
0:46:06.0 MH: Nice.
0:46:07.5 TW: Learning from real humans who are not just auto-generating content…
0:46:12.9 MH: Now I’m committing to a session where I’m gonna auto generate at least one slide to see if you notice, but I actually gotta take it too, so I’m gonna be there also. Nice. Alright, Josh, what about you? What’s your last call?
0:46:28.2 JC: Hey, yeah, so I saw this one, I think it actually came out a couple of years ago, but I thought it was pretty interesting, hownormalami.eu. And it’s a website. Yeah, did you see this one?
0:46:41.3 TW: Yeah, it’s out of… It’s a government… It’s like someone in the EU that’s actually… It’s a government…
0:46:47.1 JC: Yeah, it is, it is. They’ve got the logo on there, and I’m not sure exactly how maybe they provide access to the data or funding or something, but… Yeah, basically what it is, it’s this artist, technology, critic, privacy designer, I’m gonna butcher the… Pronouncing the name, but I think it’s Tijmen Schep put this together and it’s kind of like in your browser, real-time documentary that uses your webcam and tracks your interaction with the site, and what it does is it uses facial recognition to make judgments about your emotional state and things like, it guesses your age, your BMI. It has an algorithm that guesses your life expectancy, which, it hooks you with this sort of entertaining thing. But as you go through the documentary, it sort of gets into this deeper message about biometric tracking and how this can be used to sort of threaten privacy in some pretty creepy ways. So it’s super entertaining. I shared it with all my friends when I saw it, so I thought if anyone missed that one, give it a look ’cause it’s pretty entertaining, and it’s also got that educational piece to it as well, so… Yeah, hownormalami.eu.
0:48:05.2 TW: I remember we did that on-site with a client who had not heard of it, which was kind of interesting watching a client sitting in the conference room and go through it, and luckily, it did pick her as being younger by a few years than she actually was…
0:48:19.9 JC: Yeah, yeah.
0:48:19.9 TW: Then it kinda hit some other things and she was like, wow, so…
0:48:23.4 JC: Yeah.
0:48:23.4 TW: Yeah.
0:48:23.9 JC: It’s pretty accurate though. For me, age it nailed. I mean life expectancy, I don’t know. But yeah, it’s pretty cool.
0:48:32.6 TW: Does it rate the excellence of your facial hair, ’cause you’d just break the system.
0:48:38.6 JC: Yeah, no doubt. Maybe not as much training data available.
0:48:44.2 TW: What about you, Michael?
0:48:45.4 MH: Right, well, in the spirit of all things ChatGPT another big topic on that line is this idea of prompt engineering, which is sort of emerging as sort of like a new specialization out there, I guess, in terms of the people who are good at getting the AIs to reply back with things. To that end, Microsoft has been working on this and they recently released the paper basically outlining how they are trying to use extensible prompts to sort of create more inputs than a natural learning model would actually produce. In a certain sense, it’s sort of like, instead of saying a bunch of things, you sort of combine all those things into one fake word, and that fake word drives the response of the language model. Anyways, it’s kind of interesting. I think there’s also some interesting potential challenges or problems with that as well, so we’re just in a very interesting time with all things like ChatGPT, AI, all those things, so it’s interesting to see on the front edge of that, but we’ll link that paper in the show notes, so people can take a look at it.
0:49:50.4 MH: It’s kind of neat because they basically had a fake word that would make the model respond in the voice of Satya Nadella, the CEO, and then one that would respond in the voice of Sheldon from ‘Big Bang Theory,’ and they read exactly like you think they should read based on those extensible prompts. So it’s kind of nifty what they’re doing there. Anyways, interesting paper. Alright, thank you so much again. Shane, pleasure having you. This topic is one that I think is only gonna get more and more important. So we feel really lucky to be able to get a chance to ask you a lot of questions and have you come on. So thank you again for that.
0:50:29.1 SM: Thank you. It’s been fun.
0:50:31.8 MH: And as you’ve been listening, you’re probably like, wow, I wanna talk to you about data contracts, and also chastise Michael for trying to make jokes about it that don’t quite match up and all those things, and you can… You can reach out to us. So the best way to do that is through the Measure chat Slack group, or on Twitter or on our LinkedIn page. So we’d love to hear from you, mostly good, but if it’s bad, just reach out to Tim and… No, I’m just kidding. And it’s very fun for me to say that, hey, no show would be complete Josh, without thanking you for all you do behind the scenes, and now you get to do it in person right to your face, so that’s pretty awesome.
0:51:11.1 JC: Happy to help.
0:51:12.3 MH: And Moe, we miss you and we’re excited for what’s going on in your life right now, but I know that no matter the size of your data org and whether you’re in a lake house or a mesh, just… I think I speak for both of my co-hosts, Tim and Josh, when I say, keep analyzing.
0:51:34.2 Announcer: Thanks for listening. Let’s keep the conversation going with your comments, suggestions, and questions on Twitter @AnalyticsHour, on the web at analyticshour.io, our LinkedIn group and the Measure chat Slack group. Music for the podcast by Josh Crowhurst.
0:51:51.6 Charles Barkley: So smart guys want to fit in, so they made up a term called analytics, analytics don’t work.
0:52:00.4 Kamala Harrid: I love Venn diagrams. It’s just something about those three circles and the analysis about where there is the intersection right.
0:52:09.0 MH: Get rolling. I’ll give us a five count and then we’ll jump into this. Here we go, in five, four…
0:52:17.8 TW: Good, I’m glad you checked to note that I just started recording, so I was like, “Oh boy!”
0:52:23.9 JC: Yeah, it was tight.
0:52:25.2 TW: Well, I’m rolling.
0:52:25.8 JC: No out takes.
0:52:26.7 MH: I’ll give us a five count.
0:52:27.3 TW: Josh isn’t going to have much to work with. Okay. Alright. Yeah.
0:52:28.0 TW: It’s recording and it looks good.
0:52:33.3 MH: Sorry. I’m used to sort of having started recording about two minutes prior to this, so I just…
0:52:37.3 TW: Yeah, but I just felt that I was gonna say things that were embarrassing and I tried to not do that when I’m gonna arm Josh with the serious stuff…
0:52:44.7 MH: Yeah, yeah. Fair enough.
0:52:45.2 MH: Awesome. Okay, here we go. This time for real. In five, four…
0:52:51.1 TW: Now, you’re just triggering me that I have questions about digital transformation. I have no idea what that is. I don’t understand, but it’s always like… It seems like it’s talked about in the context of data and analytics, I’m like, but if it’s a digital transformation…
0:53:06.8 MH: If you’ll stop recording Tim, I’ll tell you what it is.
0:53:14.6 TW: Rock flag and let’s get data meshy.
This site uses Akismet to reduce spam. Learn how your comment data is processed.