#160: Data Reliability and Observability with Barr Moses

You know that sinking feeling: the automated report went out first thing Monday morning, and your Slack messages have been blowing up ever since because revenue flatlined on Saturday afternoon! You frantically start digging in (spilling your coffee in the process!) while you’re torn between hoping that it’s “just a data issue” (which would be good for the company but a black mark on the data team) and that it’s a “real issue with the site” (not good for the business, but at least your report was accurate!). Okay. So, maybe you’ve never had that exact scenario, but we’ve all dealt with data breakages occurring in various unexpected nooks and crannies of our data ecosystem. It can be daunting to make a business case to invest in monitoring and observing all the various data pipes and tables to proactively identify data issues. But, as our data gets broader and deeper and more business-critical, can we afford not to? On this episode, we were joined by Barr Moses, co-founder and CEO of Monte Carlo to chat about practical strategies and frameworks for monitoring data and reducing data downtime!

Articles, Research, and Other Interesting Resources Mentioned in the Show

Episode Transcript

[music]

0:00:04.4 Announcer: Welcome to The Digital Analytics Power Hour. Michael, Moe, Tim and the occasional guest, discussing analytics issues of the day and periodically using explicit language while doing so. Find them on the web at analyticshour.io and on Twitter, @analyticshour and now, The Digital Analytics Power Hour.

0:00:27.7 Michael Helbling: Hi everyone, welcome to The Digital Analytics Power Hour. This is episode 160. Most of us drive around in our cars without ever thinking about the horsepower to each wheel or the fuel-to-oxygen ratio as you move up to the gears. Well, that’s because for most drivers, performance is just kept within a certain threshold but jump over to car enthusiasts trying to eke out maximum performance out of their vehicles and suddenly you have laptops full of data and dyno machines and purpose-specific tuning for engines and every aspect of the vehicle. Its engine, tires, suspension, etcetera, is being brought together and measured to achieve the best possible result.

0:01:10.7 MH: What if in the field of analytics, we never looked under the hood of our performance engines? Well, when we try to hit the accelerator, we might find out that our acceleration was sluggish or non-existent. Worse, we wouldn’t have the first clue where to look for the problem. What we’re talking about is tracking our data so that our analytics platforms can perform. For any business, our data and analytics capability has to be at peak performance at all times, which means we need to be able to understand where our systems are limited, what we can do about it and how to fix it. Hey, Mo, do you know anything about the Aussie burnout scene at all?

0:01:49.7 Moe Kiss: I know a little bit about how to avoid it.

[laughter]

0:01:53.3 MK: A lovely festival in Canberra called “Summernats”, which I used to do my best to get out of town for.

0:01:58.6 MH: It’s so strange that there is this connection of data between car enthusiasts and data enthusiasts and yet mainly culturally, we don’t always align so well.

[chuckle]

0:02:08.1 MH: Tim, you ever get under the hood and tune any of your vehicles?

0:02:13.0 Tim Wilson: Occasionally. Funny enough, I wound up just connecting with a guy who used to work for me years ago, who was an autocross guy and literally out of the blue, we had a little bit of an exchange about that but I think this might have been one of your longest wind-ups to a topic in the history of the show.

0:02:30.9 MH: Yeah, well, I do watch a lot of YouTube videos about cars and car tuning. So, yeah…

0:02:37.0 MH: I gotta make the link, I gotta make the link to analytics.

0:02:40.2 MH: Well, it just sort of seemed to fit because that’s what we’re talking about but we do need a guest, somebody who can kind of make this connection a little stronger and dig a little deeper into what’s holding back analytics performance at most companies.

0:02:54.3 MH: So Barr Moses is the CEO of Monte Carlo, a data reliability and observability company. She is a member of the Forbes Technology Council, was previously a VP at Gainsight and a management consultant at Bain & Company and today she is our guest, welcome to the show, Barr.

0:03:11.4 Barr Moses: Hi everyone, great to be here.

0:03:13.5 MH: Yeah, it’s great you could be here. So does that analogy hold up for you?

[laughter]

0:03:18.5 MH: Or is that more of a personal question?

[chuckle]

0:03:24.1 BM: No, for sure, at Monet Carlo, we’re big Formula One fans over here, so it definitely resonates.

0:03:29.9 MH: Yeah, that’s another place that’s awash with with data, right? Formula One, oh my gosh but yeah, let’s maybe start with just breaking down data reliability, observability more generally, ’cause I think that’ll just be a good jumping off point to kind of what that is. It’s not something that I think gets talked about a lot in the analytics industry, so it might be good for our audience to just get a little bit of a background.

0:03:56.5 BM: Yeah, for sure. I think the concept of… We kinda call it “data downtime”, if you will, as a corollary application downtime and I think it really starts out from this change that we’re seeing in the last decade or so, where in the past, applications were really driven by logic largely and today more and more applications are driven by the data actually that’s feeding them and so when companies are using applications or data, whether to make decisions or to operate their companies, the importance of that data being accurate and being reliable, it becomes obviously way more important.

0:04:32.8 BM: The other thing that we’re seeing is that more and more people using data in real time for real-time decision making and so in the past, if you had just a handful of people looking at small amounts of data maybe once a quarter or so you, could really get away with some issues in the data or data arriving late, ’cause you had a lot of time to make up for it, right? But today, you have organizations where literally, everyone from marketing to customer success, to sales, to product, using data to make decisions. We can’t really afford to continue to have inaccurate data at the rate that we’ve been used to and so I think because of that, this issue is becoming more and more prominent.

0:05:15.5 BM: I really like your example of a car actually, because I feel like what we’re doing right now, so that the car… How you think about performance, what we’re doing right now is kind of crazy. It’s as if we… We’re putting together… Think about a car factory or putting together the car and all the parts but then the only time that we know if it’s working is when we actually drive the car 60 miles per hour and suddenly all the windows burst open and just everything blows up at 60 miles per hour. You’re like, “Well, if only we would have checked things upstream or put some mechanics in place to make sure that the inputs are accurate and then the process is good, we wouldn’t be in this situation.”

0:05:56.9 TW: Not to belabor an analogy and encourage Michael by any… But I grew up as the child of an engineer and he religiously, every time we filled up the gas, he would put in the date, the miles and how much gas he put in and then he would periodically go back and actually calculate the miles per gallon. This was the ’70s and I thought that’s just what you did. I thought everybody did that. I didn’t realize he was kind of… And I was like, when I figured it out, I was like, “Why do you do that?” He’s like, “Well, if the mileage starts to decline, that could be a symptom that there’s something wrong with the vehicle.”

0:06:33.3 TW: I don’t recall… I think he would actually get around to calculating the mileage maybe like once a year. So back to your kind of the frequency of it and for the first, I don’t know, 10 years that I was an adult, I was diligently tracking it as well and recognized that I never was actually checking the mileage. Whereas now, when you have that… You’re getting the mileage that you’re getting kind of in real time, if you’d say, “Gee, I’m starting to drive a wrong… ” Well I guess now, you just have sensors in the car that say, “You have a low tire or something.”

0:07:03.5 MK: But now you make so much sense as a human, like I understand you on a whole new level.

[laughter]

0:07:13.1 TW: I’m sorry.

0:07:13.4 MH: I’m still stuck back at Barr thinks my analogy was good. I’m kidding.

0:07:18.6 TW: Thanks everyone. See you next time.

0:07:21.6 BM: My dad actually just visited me last week and he was driving and he was on a five-hour drive and he called me to say that he ran out of gas. So there might be another lesson here too.

0:07:36.4 MK: So Barr, with this… I don’t know the word to frame it but this topic or people starting to care about observability, where have you found… What areas of the business are really driving this and making companies focus on it? Is it coming from the data teams? Is it coming from the execs?

0:07:57.1 BM: Yeah, so there’s a couple of different forces here, I think, to your point and again, I also clarify what does observability mean in this context. So I think it’s not an obvious sort of term that we use today. So starting from sort of the pain to being a down time of data being wrong, to your question, who actually feels that in the business? Is that something that execs feel or sort of engineers or data people? And the interesting thing about data down time is it’s kinda like that sentence… I don’t remember how it goes, something like “Failure has many parents and success has one.” or something like that. There’s something around that. I’m probably misquoting it but I think the thing with data downtime, data problems is that, there’s no one who really wants to take some responsibility for it but everybody feels it.

0:08:45.8 BM: From the engineers who are responsible for setting up the pipelines to the data engineers who maybe kind of munging data or data scientist to product managers all the way to marketers who are using data in their marketing campaigns and executives who are looking at reports. So when data goes bad, actually, each of those folks might be feeling that. It is top of mind for them. The question is who’s actually responsible for resolving it and so, what we see from our customers is that oftentimes this topic comes from the tension between these organizations. When it is actually, oftentimes the data team that’s the last to find out about these issues, it’s the data consumers that catch them. So how often did it happen to you that you’re working on something or fray your morning on Monday and then you get your phone call from the CMO saying, “What’s going on with this marketing campaign? What’s going on with this report that I’m looking at?” And this throws your day and does an entire kinda race to find out who’s responsible for that pipeline, what exactly broke, thinking, if you’re lucky, 10 different people on slack and figuring that out.

0:09:58.5 BM: And so in that tension, I think is where sort of this problem comes up and then the question is, how do you resolve it? The traditional ways that we’ve solved it are no longer sufficient. So in the past, we could basically say, “garbage in, garbage out” because there was one point in which we ingest the data and that’s it. So we just need to make sure that the data that we ingest is clean and we can call it a day but today, our architecture is way more sophisticated. Data can break anywhere along those lines and that’s where observability comes in. You actually, observability helps you get visibility into the health of your system and see it in real time wherever sort of data breaks.

0:10:34.7 MK: But do you think people got complacent along the way, because that scenario you described, I sat here nodding and nodding and nodding because that literally happens a couple of times a week at most organizations I’ve worked in and stakeholders start to get to the stage where they’re like, “Oh, did the data warehouse build work this morning, because all my numbers look off.” And stakeholders start to just be like, “Oh, it’s gonna happen a few times.” like they just start to expect it and it doesn’t necessarily mean that that’s a healthy thing, it just means people become complacent about expecting data to fail.

0:11:10.3 BM: Totally. I think there’s a couple of things. One, 100% of people have become complacent, for sure. It’s really hard that you would talk to someone and they haven’t experienced that. Also, by the way, it’s really kind of un-sexy to try to solve that. People are not excited about trying to solve… They wanna build something that’s core to their business and so it’s kind of easy to get complacent with that. I call BS on that a little bit. I think in today’s world, if we really want to get our organizations to be data-driven, we can’t really… We can’t cut that slack for ourselves anymore. We’ve been able to do it so far but if we’re headed in the direction that we wanna be headed, we’re not gonna be able to do that anymore. We’re just getting too many people, too many eyes on the data and it’s just its impact on business is becoming immense and so because of that, we have to rethink how we’re doing this.

0:12:02.7 TW: It does seem like… Well, one, I feel like the digital analytics industry has taken kind of a… Companies have cropped up with solving one little point solution and I frankly think not solving it well so I won’t name the three or four platforms. We’ll just crawl your site and make sure the raw data collection is happening but you’re… ’cause this is my big question is, you describe it as the system, all of those pipes and the pipes are running from the raw data into maybe the tool and then there’s a pipe from the tool into a data lake and then there’s a pipe from the data lake into three different data marks and so by the time somebody has found a problem… But it sounds like you’re saying well you do…

0:12:41.6 TW: You do kinda have to look at the whole system. By the time there’s a problem, the troubleshooting, which does tend to suck, although it does feel like organizations tend to have the common breakpoints and they’re like “Well, was it the data warehouse load again?” But then that also winds up competing with the people wanting to invest in that is the same as the challenges of technical debt that companies wind up with saying “We could spend $100,000 doing this cool new feature or we can spend $100,000 getting our house in order.” Which seems like it could be just refactoring stuff to remove technical debt, which just makes everything more robust and pays dividends in a hard unseen way down the road but is data observability also wind up in that?

0:13:26.6 TW: Say; well, we could spend $100,000 getting a really well-instrumented observability process platform system in place and people look at that and say “Yeah but I wanna go chase the cool new feature that I can go talk about.” are those similar in… I guess, do you run into those challenges? Is that where people push back?

0:13:46.7 BM: Yeah, for sure. It’s such a great question. I think, going back to the sort of corollary for application a little bit. Sort of, if you think about the best correlation here from observability for software engineering where it’s this is a very well-understood space. The best, or kind of an example would be like New Relic or AB Dynamics or Datadog. I mean, you don’t really see an engineering team today operate without something like that and you’re right, there are some organizations who may choose their own home-grown solutions for that and invest in sort of what you’d call application or tech debt more generally but there aren’t that many. Oftentimes, you just use something that’s off the shelf.

0:14:27.1 BM: Observability in data is new. It’s not as well understood and I think we’re gonna get to a point where every data team will have something like New Relic but for your data and I actually agree with you that it doesn’t always make sense to build it in-house. It’s probably better to do off the shelf. I think in terms of kind of your questions on data debt and whether to invest that or not, that is 100% something that we see a lot across the board. It’s very hard to justify kind of investing in data debt. I think people also don’t have this negative reaction to investing in debt versus investing in the next cool feature, to your point and so I think the struggle for data organizations is how do they justify their activities in their investments?

0:15:09.9 BM: How do they tie that to real impact on the business? And I think observability in that perspective is easier to tie because it allows your organization to actually move faster and to focus on those new features because you have something else that kind of you can rely on to make sure your data can be trusted.

0:15:28.0 MK: So I imagine it is a similar process of a business case that you would do for anything at like… I know, for example, when our data warehouse goes down for a week, that means our attribution model has failed. That means our entire marketing team don’t know what they’ve spent to the ROI on that. It is actually very quick and easy to put a monetary figure on data failing but it doesn’t always seem like people… I don’t know. I feel like sometimes data people are not always great at making that case. I don’t know. I just see it as any other business case that you would do in the company, right?

0:16:01.2 MH: Yeah, for sure. So I think in some businesses, to your point, it’s very easy and very clear to tie that, to your point. Like the models aren’t working or oftentimes in organizations where data literally is the product. Think about an ecommerce website. I was speaking to a head of data at an ecommerce company and he was telling me, “If the data is wrong in our website, I’d actually prefer that our website would be down.” For example, this large retail company that’s moving their parts of the business into ecommerce and a common issue is coupon code issues where you’re literally giving the wrong discounts to customers. You might be overcharging or undercharging on a particular item that you’re selling. That’s real material impact to the business and so I think for folks who are in data, it’s quite clear, Moe, to your point. I think for companies that are new to the data, potentially just moving to the cloud or figuring out how to use data as kind of a competitive advantage obviously do more work to figure out the ROI, in that case.

0:17:08.6 TW: I’ll just say; it does seem like the case would be as you were saying the organizations that are relying on the data, the automation of dynamically making offers, dynamic personalization, those sorts of things, that if you say “Okay, that means that this is all the data that’s going into that model that is automatically deciding to do things.” If this data, if all of a sudden this flatlines, either I have to figure out that my model has to take into account for that and have built-in governance. Or I need to have a better kind of systemic solution so that the alerting system will say this is a problem and will know to sort of shut stuff down and not be throwing the wrong discounts out or making the wrong offers.

0:17:52.9 TW: So I wonder if it is… Whereas I’m looking at… If I’m an organization looking at monthly reports… Okay, so maybe the data was broken for the whole month but yeah, it feels like it’s more for when there’s active use and you need complete data and it needs to be detailed from multiple systems and if something goes wrong, you’re really screwed. That makes for an easier business case to make than and there are more and more companies that are operating there that are operating in the, well, let me look at the monthly report. It comes out two weeks into the following month. Like yeah, you’re probably not quite ready or you’ve got bigger fish to fry.

0:18:32.0 BM: Yeah, I think that’s right. If you’re looking at your report once a quarter. Maybe you’re not even a public company so you’re not reporting on it to anyone, it’s just internal. You have a lot of time and a lot of people. I talk, folks in those categories typically tell me, “Yeah, we have often between three to four people we need to look at every report before it goes out.” so you can afford to have three to four sets of eyes on a report to do that. To your point, the minute that it becomes either it’s reported to the street or reported to the board or its financial data that actually is being used as part of your model, as a part of your product, whether it’s to value a home or provide a loan or decide on discount, etcetera, any of that uses that are more real time.

0:19:19.9 MH: I often use this kind of thinking as a marker for where an organization is in their analytics. Maturity is kind of an overused word but it’s indicative of is this an impactful program or a set of things that we’re doing or is it we’re just playing the same song we try to see everybody else playing because we think we’re supposed to be doing it? And so if you’re an analyst in that kind of an organization or in an organization, a really good test of does my work matter and do I have an exciting future here is basically how this kind of stuff is being treated. So just a pro-tip for everybody. If they don’t care about this stuff, they probably don’t care about anything else you do either.

0:20:00.2 BM: Amen to that. [laughter]

0:20:01.3 MK: Barr do you see, typically with companies… For context, the company I work at, I don’t feel like we’re totally shitting the bed in this space. I feel like…

0:20:13.6 TW: Is that where the bar is set?

0:20:16.2 MH: You’re only partially shitting the bed. [laughter]

0:20:19.8 MK: No but really I feel like we have a pretty good data team. People generally do care about data quality and the build breaking and stuff like that. All of the code that we release has tests that built in place to make sure we know if things are gonna break before they’re released to production. We’ve just changed our whole data warehouse build from a daily to a weekly because we’ve intentionally are now trying to go through that we wanna release once a week. We want everything to be tested in dev incrementally first and then we build rather than like at the moment where I build files a couple times a week because someone pushes something and we haven’t checked it. So I feel like we’re really doing the crawl and then walk and then run journey that probably most companies go through but I just wonder, in your experience, do you feel like most people do that, or do most companies go from crawling to running or straight to a tool, or do you feel like people try and solve this internally, they find that they don’t get the results they want and then they move to tooling? What do you see in the market?

0:21:25.5 BM: Yeah, so I’ll share with you a little bit about how actually, what got me interested in this space and how we got to into this. So actually, I was at a company where I was responsible for a set of internal customer data and analytics and we started becoming more data-driven as a company and so we started becoming more sophisticated in how we’re using data and we had maybe a similar situation too, where I felt like our team is good. We care about data quality. We’re not completely shitting the bed to your point and that data downtimes that have happened and I asked myself, “Am I crazy? Is the word crazy? Should I just accept this as is? What’s going on?” And actually, what I did was I ended up speaking with over 150 or so organizations ranging from free small start-ups to large companies like Netflix and Uber and Facebook and really folks invested a lot in data organization and ask them first of all, is this happening to you all? Am I the crazy person here or does this actually happen?

0:22:29.3 BM: And two, what are you all doing about this? And I learned so much from that actually, when I use the word maturity but we put together a maturity curve based on these learnings. We’re basically plotted people applied to organizations where they are across this curve in terms of how proactive they are in addressing these and I would say that almost all organizations were serious about their data, find themselves treating this or addressing this problem in some shape or form. It could be in the very early stage where they’re very reactive about it, so just whenever there was a fire drill, there’s this SWAT team, everybody just jumps on it to the very other end where they actually, Netflix for example, invested a lot in anomaly detection and different observability features that they’ve built in a more scalable solution, if you will and then in the middle there’s a couple other stages were folks might find themselves on that journey and so to your point on what are some of the things that folks do along the way.

0:23:31.4 BM: Testing is definitely a big one that comes up and that I think is very important to do. I will add that similarly to how in software engineering testing is one component but you still have monitoring and observability. I really think they go hand in hand, so it’s very hard to specify tests for everything that you, everything under the sun if you will and so you want to argument that testing with strong monitoring and observability to help catch things that you probably might not catch similar to how you do in software engineering.

0:24:06.3 TW: So how does the monitor, you mentioned at the end of the curve, anomaly detection. I can see in theory, you’re saying well, if we just put taps at all of the critical spots and then we look at all of the data and then we just look at all the points and say “This used to have 500 unique values and all of the sudden it’s coming in with two.” Is that… Is it kinda conceptually like you’ve mapped out the whole system and you figured out the points where you’re going to observe and your observing is looking for anomalies or is there… Back to the defining observability, what does that really mean?

0:24:49.8 BM: Great question. First of all, I’ll start by saying I do think that in order to solve this, it has to be end-to-end, so it has to work with your data stack, whatever that is, including sources, data lake, data warehouse and BI or machine learning to simplify that. I think a strong observability solution is one that connects end to-end to your existing stack, so it doesn’t actually require you to replace anything, doesn’t require you just write in a specific language. It literally just connects to your existing stack. The other thing that’s important is that I think there needs to be a strong focus on security. When you’re connecting to… This is kind of like the heart and soul of an organization and so you need to have a strong security-first architecture for whatever you’re connecting to your data and then, to your point, what does it actually mean? What are we observing on earth?

[laughter]

0:25:42.3 BM: And we ask ourself the same question. In software engineering, what you’re observing is very straight forward. It’s very well understood, whether it’s performance or storage or whatever it is that you’re measuring but in data, it’s not clear to us. We have different ways of thinking about that. So what we did is we compiled from these conversations that I mentioned we had and from our work, we compiled all the different reasons why data breaks and at first, we were like, “Data can break for so many millions of different reasons. [chuckle] What are we gonna do? How do we gonna get started?” But actually, we were surprised to see that everybody thinks that they’re a snowflake and everybody is a snowflake but there are some common patterns that we can match here and I’m not… This isn’t to say that with a great observability solution, your data will be 100% accurate. I don’t think that’s…

0:26:34.9 BM: Data will never be 100% accurate but in the pursue of accuracy, you can get pretty damn close to excellence and so we came up with, based on kind of the work that we did, we consolidated all of these findings into five key pillars of data observability and from our experience, if you are systematically instrumenting and monitoring and analyzing these five key pillars, then you have a strong sense of the health of your data and can have that confidence. Those five pillars, the first is freshness, which you mentioned Moe that at some point your data warehouse was down. That happens all the time. You’re dependent on third-party data, oftentimes, or third-party infrastructure that might not be updating your data and so having a sense of the freshness of how up to date are your tables or your data assets is super critical. The second is volume, so pretty straightforward, like the completeness of your data tables. The third is distribution, so this is Tim, to your point, you were asking, “I have a field and it’s typically between X, Y in ranges and today it’s in very different ranges”… It could also just be other statistics about the data, percentage null values, for example.

0:27:49.5 BM: If you are used to having 0% null values in your credit card field. Suddenly, you have 70% null values in your credit card field, then obviously something is wrong. You’re not charging your customers, you need to get on that and then the fourth is schema. Schema is things like an audit log of who makes changes to your data and when. We talked about the different personas in data and cross that end-to-end it. Who’s making changes to your data lake and who’s making changes to a table that depends on that in your data warehouse and who’s making changes to the 10 other tables that are relied to it? What kind of changes? And then the last one is lineage. Lineage really helps us answer questions like where exactly did the data break? How did it affect downstream dependencies? Maybe there’s a table that’s not getting updated but there’s literally no one watching, literally no one using that data, so who cares? [chuckle] But if there’s a table that… This is a basic example but if there’s a table that gets updated three times an hour and your entire C-suite is using it every morning at 7:00 AM and you need to make sure that they have data that is updated by then, that’s probably a data asset that you should have more monitoring and more checks on.

0:29:00.1 MK: Those five pillars are amazing, I love it. It’s like a really clean way of thinking about it but… So on the point of lineage, would you put in place different practices for say, like the table that you mentioned that the C-Suite used every morning at 7:00? There would be different standards of observability and monitoring applied to that versus the table that, you know, old mate built six months ago that they’ve never looked at.

0:29:23.9 BM: Yeah, I think there needs to be different levels of monitoring and different levels of observability and also different actions taken. For example, if there’s a data asset… Let’s say there’s kind of like… We can take a specific example, like a fintech company that has a report that the executive team is looking at every morning at 7:00 AM and rely on thousands of different sources, so you need to make sure that 6:50 AM, all of those sources of process data and everything is up-to-date and if anything goes wrong, you want the right person in the organization to be notified ASAP.

0:30:00.5 BM: The actions you will take will be different, whereas if some table that’s not being used at all, perhaps… Maybe you would wanna know about that but maybe you can deprecate the table [chuckle] and not use it at all because if we have so many copies of tables… Maybe you can make better use of that.

0:30:20.9 TW: To do this… Does it start… And I’ve gone through this with a few clients and sort of found what seemed to be the most effective way to get at it… Getting an organization to even wrap their heads around what the data flows and the systems are and the… Lineage can’t start… You can follow one point all the way back through but are you trying to start by saying, we kinda have to understand the whole system, which… I’ll say the way that a colleague of mine was like, “It’s a network. Diagram, you’ve got, nodes are your data sources and edges are the flows between them and that is a very simple schema.”

0:31:00.4 TW: You have a table of your notes and you have a table with your edges and you can put metadata against that and we’ve done that with multiple clients where they’re like, they’re kind of like; We get diagrams in PowerPoint where they’re like, well, this is this part of the system and nobody’s figured out how to actually wrap their head around the entire infrastructure and then we wind up saying, well, the first thing you have to do to shore this up is actually capture the whole system and it seems like that would be the… Do you have to actually get a system, data system map, which is gonna be dynamic and fluid and changing before you start on this or do you start with a… Let’s find a critical what the executive team looks at and kind of work from there and then kind of expand out with the different kind of monitoring and observability. Or is it, it depends? ‘Cause you were a consultant, you were at Bane so you know.

0:31:55.8 BM: Yeah I actually just recently read a blog post called, “Metadata is useless” so I’ll refer to that in particular. It’s a little bit strong but I think in the past, we had a tendency of collecting a lot of data and then not always making use of it. We just had tons of data and I think we’re a little bit of risk of doing the same thing with metadata. So we’re just collecting a lot of it and I’ll sort of include lineage under the category of metadata and so yeah, we see folks kind of invest actually years in putting together this map Like manual hard work, which is a serious undertaking for an organization to map to your point of the infrastructure end-to-end. I think those often times very important for compliance and financial reasons and all of that and the thing that I’ve seen work best to your point is when metadata is applied to a particular use case.

0:32:51.3 BM: So we can say, we can actually start with a particular area that’s either driving product, digital product or to your point, is being used by folks in the organization to make decisions and then use lineage to help drive the question of where is the data coming from? And can I trust it, right? So rather than generating a map that’s you look at, you look at meeting once a year or you look at and you need to update, let’s put that map to use right? So let’s generate a lineage that’s actually dynamic, that updates automatically when someone deletes or adds a table, the lineage should be updating of automatically without someone having to manually add that or edit that and let’s also marry that with information that helps us make decisions about the data.

0:33:36.8 BM: So for example, looking at a table knowing that it’s not up-to-date and then using lineage to understand what are the implications of that downstream so I really believe in applying metadata, lineage, whatever you’d like, for a very, very particular use case in a very, very narrow need for data consumers.

0:33:55.9 TW: As opposed to the glorious, wouldn’t it be nice if we just had all this? I’m sure if we had all of this, it would take us five years to get there but man, then it would be so powerful and it’s like well, you’ll actually never get all of it and then nobody remembers what it’s supposed to be used for and it’s already out of date anyway. So that was a colossal waste of time. I like it.

0:34:12.9 BM: You got it. Yes.

0:34:15.0 MH: I know, by the way, we were collecting it wrong this whole time. Even get something we wanted.

0:34:20.7 BM: And I forgot why we even started this to begin with.

0:34:23.6 MH: Yeah. And also everyone involved quit three years ago.

0:34:27.5 BM: Exactly.

0:34:28.9 MK: Barr, I wanted to talk a little before we were talking about notifications and alerts and I realized that this is an area that enduring has solved fairly well but I’m thinking about it more from the; how do we get out data people to have that same mindset in that if you have too many alerts, people start to ignore them, if you have too many people on the alerts, there’s bystander effect and no one takes responsibility for it. What is your best practice recommendation around that to make sure that when there is a really high priority build, it’s flag with the right person and the right action is taken at it, versus like just getting 20 emails a day being like something bugs, something else for something else berg.

0:35:20.2 BM: Yeah, alert fatigue is a real thing. That in sort of conversations with in our work and our conversations, that has come up a lot. Actually, from day one, we’ve been thinking about that, to your point, I think both, there’s definitely a risk of sort of being inundated with information that you actually might not care about, right? And so the way that we’ve approached this is in a couple of ways. One is trying to and this relates a little bit to sort of the question of metadata but adding context to alerts so that you can quickly decide whether it is important to you or not right?

0:35:58.3 BM: As an example, if you set automatically alerts for all of the tables that you ever had, including all of those that are not being used, you are very likely to be bombarded by alerting you do not care about but here’s the thing, you can actually know automatically without your input, I can tell you which tables are being most used. I can tell you, here are the top 10 assets and you know what, they are being used every day by a large number of people and so you probably wanna know about most changes in these assets, for example and so that gives us clues or sort of strong hints into what are the parts of the data that you really want to know about.

0:36:36.5 BM: So that’s kind of like what I would say, we’re really trying to push how much we can automate here and use based on metadata and statistics and query logs that we can use to inform alerts and to make them more practical and so there’s definitely way more that we can do just by observing your data, that’s first of all, looking who’s using the data and when and how often, how important it is and then the second part is, there is a component where you know, Moe if you wake up every morning and you use this particular table specifically and you’d like to apply your own custom rule to that monitoring, that’s definitely an important part. I don’t think we can fully replace people with automation, for sure. There’s a lot that we can do here. I think we’ve actually done very little as an industry, on the automation of this but there’s definitely always a need for the specificity that people will know about their businesses that we add in.

0:37:34.2 TW: Is there a way to guard against or I guess having worked with alerts or any sort of anomaly detection and finding, “Oh, whatever, crude, imperfect kind of system is doing the monitoring says, This is… And maybe it’s just looking at… Michael likes to about how he used to have a dashboard that he would just look at to see if anything happened and so you see something that looks like a data issue, your observability says, “Hey, something’s gone swirly here.” and you think it’s a data problem but the digging to address it, actually turns out that the business did make a decision, “Gee, our revenue for this whole class of products flat-lined and well, it turns out that that whole class of products is no longer being sold or that whole line of products has a supply issue and it’s in inventory.”

0:38:23.2 TW: So I’ve lived through that a few times where it’s like “Oh, something happened in the data.” and I spent a few hours getting to the bottom of it and what I find out is, I just didn’t get to the right person who said, “Oh yeah, we turned that off.” or, “Oh yeah, we stopped doing that.” and so I think that would come into the distribution.

0:38:42.3 BM: Or the… Sorry. My favorite, the engineer who’s like, “Oh, we were doing load testing. Is that something you need to know about?”

0:38:51.4 TW: It seems like that would hit distribution category, where, “Oh, this data went wonky.” and you start tracing it saying, “Well, we gotta go through the lineage and figure out upstream, where did they go to and we’ve traced it all the way back to the source. It’s a problem. What’s going on?” And it’s like, “Well no, it was a business change.”

0:39:10.2 BM: Yeah, for sure, it’s eery question. So here’s the kicker with observability that we’ve noticed, that I found fascinating is, the minute you start surfacing this information to people, the way that they start thinking about this changes, because now people know that when I’m making a change upstream, you’re gonna know about that because I’m tracking kind of schema changes as well and so if there’s something that’s happening upstream, you’re very likely to be known… To know about this.

0:39:38.4 BM: You’re also likely going to be more thoughtful about the… Before making that change, you’re gonna look at lineage, to see who’s relying on this, if there’s any one that’s relying on it because you know that you’re gonna hear from that person and so the interesting thing is that the more that you make… I don’t like sort of these jargony words but the more that you make observability self-serve in the organization and actually share this information, imagine that every person who is sort of owning a set of reports or data tables or whatever it is, owning parts of… Owning some data assets, they actually have information, access to observability information that empowers them to ask questions earlier and we actually see the rate of data downtime or data incidents often goes down just because people have awareness of who’s relying on the changes that they’re making, who might be impacted and so, just by serving that information, that makes a significant impact and specifically to your question, yes, I agree with you.

0:40:39.3 BM: Oftentimes, or sometimes, big changes in data might be intentional, maybe I just turned off a data source or maybe we shut down the line of products. Here’s the thing, many times people still wanna know about that because what if the marketing team was still… They still have campaigns in Facebook to publish that line of products that you just deprecated. Well, we’d like to know about that and so oftentimes, there could be a notification or, “Yep, confirmed, this is intentional, we’re glad to see that you turned off the data so it’s not taking up space or something like that.”

0:41:16.1 MH: So, okay, I’ve got a question. Let’s say someone’s listening to this podcast and I’m like, “Wow, you’re right, we need to get started on this data observability.” One of the first steps companies should be taking. So how do they go from zero to 0.1 in this journey? Aside from the obvious of maybe buying the product that you’re the CEO of but we’ll leave that one out.

0:41:39.6 BM: Yeah, no worries, I was just gonna say, they called Tim up and within five minutes, it’s solved. I think there’s a number of different things that companies do. First, or what we actually see most often is, that folks go through recent issues that they had and then set up specific monitors for that. So for example, if schema changes, for example, that’s something that lots of people actually start as a good place, to just have awareness of that. The second thing that companies do is, often set up some pretty basic checks, just look at row counts – it’s not very effective, it’s not the most robust but that’s pretty straightforward, you can just look at some of monitors like that.

0:42:27.4 BM: That’s oftentimes a place where people start. What they… If they decided to invest more in that, then they might start hacking their own solution, using some SQL wrappers or this could also be just defining SLAs between different organizations, defining SLAs for specific data assets that are more important that they wanna hold themselves accountable for and so, actually start building reports to track that. How often does data break and when they start holding their teams accountable? And then you see teams building more kinda like they’re hacking their solutions. So we really see a wide variety, in terms of where you’re at and so in my opinion, what you’d want is have… Going back to those five data observability pillars, you think of it as like a spot check for your car, going back to the car analogy, you wanna make sure that what you’re looking at is a complete set of checks.

0:43:25.5 MH: Nice.

0:43:25.8 MK: I was just gonna say and Michael, surely that would also depend where you are in the organization.

0:43:31.2 TW: Sure.

0:43:31.3 MK: Like one little thing that I do, for example, on my dashboards is I always have the latest date of the data at the top so that if the data warehouse didn’t run this morning and the date is not yesterday, there’s a problem and then as soon as the stakeholder sees that the date isn’t yesterday, they see it, I straight away know “Hey, that’s something little I can do to flag that something is broken.” And I mean, that’s obviously the tiniest of little things you can do but at least then you know, “Okay, our build failed this morning, that’s the first thing I’ve got to do then is go talk to the data engineering team.” Where it’s like for the data engineering team, I’m sure there’s a gazillion other checks and steps that they do that I’m not even across.

0:44:17.4 MH: But no, I mean it fits within the freshness pillar, if you leverage that paradigm a little more.

0:44:24.0 MK: Yes, yeah.

0:44:24.1 MH: I’m totally gonna be using a ton of…

[laughter]

0:44:27.3 MH: Well, I think what you’ve done really nicely, is encapsulated the problem in a very approachable framework, that a lot of times it’s sort of like, Oh yeah, data and then you don’t really have a systematic way to approach it. Data people aren’t necessarily have the mindset of an engineering organization, who’s developed best practice around this and they’ve got telemetry data, they’ve got this other data. We’re sort of like “Oh yeah, Facebook changed their API again and it broke everything.” Well, now what? Yeah, so this is where I feel like that’s just really super helpful.

0:45:07.6 BM: Yeah, we sort of call it the good pipelines, bad data problem, right?

[laughter]

0:45:12.2 BM: You’ve invested in awesome infrastructure, we have the best pipelines in the world but data is oftentimes not great and Moe to your point, actually, that was such a great example because that sort of first step of looking at freshness at the last stage and at least making sure that that works, right?

0:45:30.2 MK: Yeah.

0:45:30.6 BM: But thinking about it, you probably have seven or eight different maybe, I don’t know, five to 10, 20 different steps where freshness can break and so if you truly wanted to understand whether it’s fresh, you had to look at all of those steps kind of upstream to make sure which, to your point is maybe data engineering but that’s a great first step.

0:45:47.6 MK: Baby steps for me.

[laughter]

0:45:50.5 TW: Well, is there a case… And this is another… ‘Cause from the little tiny point solutions that are in the digital analytics side, where it’s the data collection side, I feel like it’s perpetually this closing the barn doors after the horses of what I have gotten out. Like something happens, the data breaks, it causes an issue and they’re like, “Oh, we’re gonna put a monitor in place to make sure we catch that.” and it’s like, well, because it broke and a few heads rolled, there was already… The process was shored way up. So if this had a 5% chance of breaking before and it broke, now it’s down to a 05% chance of breaking and now you’re monitoring it and so something else is gonna break, so it seems like that starting with specific areas but not saying I’m not trying to put observability in place for that one specific data field or that one specific thing I wanna solve for that class of things but not at the point where I’m trying to boil the entire ocean. Is that part of the…

0:46:53.3 TW: If I’m gonna put a freshness check in, Let me put freshness in multiple places that make sense, so that I’ve kinda got this freshness check in a bunch of areas, not just for one table. Is that part of the approach, or… I’m making this stuff up as we go So I have no idea if that actually makes sense. It’s just where my mind was going.

0:47:15.5 BM: Yeah, it does make sense. I think that’s right. I think oftentimes… I mean… Well, taking a step back for a second. You are probably already paying in some shape or form for this problem, right? Whether it’s in… You have sort of data issues and it’s impacting your business, or you’re building it and addressing it in some way, right? So this is costing your organization already. The question is, how do you wanna tackle that?

0:47:40.5 BM: You can start with doing a lot of post-mortems and figuring out point solutions, or you can think about a more holistic way to your point and say, “Let’s focus on the things that, one are easy to automate or relatively easy to automate and that will give us some coverage to start out with, so that we don’t have to spend our entire time on this and we can focus on other revenue-generating things.” And you’re right, part of that is looking at some…

0:48:07.7 BM: Looking first ways to leverage where we can do automatically in a scalable way, freshness being one of them and the second is providing more of that coverage versus very point solutions, because at the end, the freshness of your data or the robustness of your data in a report or in a model really relies on your data warehouse and your data lake as well.

0:48:31.0 MK: And I really love the point you make, Barr, about using data to inform… You could potentially be like, “We’re gonna do this for the top 10 most important tables, or builds, or whatever, or five, or whatever it is and incrementally show that value to the business of like, “Look, we stopped all this stuff from breaking as often, we’ve had less downtime… For me, sometimes it’s even just getting some of the stakeholders that are involved in that data pipeline to realize the impact on the stakeholders and so sometimes I say to my stakeholders, I’m like “Can you go and put it in the channel that it broke… Because you need to explain what the consequences of it breaking is because they’re just like “Oh well, it broke again, we’ll fix it over the next few days.” whereas me and the marketers are having hot flashes.

[laughter]

0:49:16.4 MK: And so I love that idea of really being able to choose how you roll this out and show value and then take another step and another step. I think that’s a really… A nice way of thinking about it.

0:49:28.2 BM: Yeah, it’s definitely a part of thinking about how do you empower your data consumers to use data, right? Because if there’s this thing where the group of folks who’s responsible for generating the data, transforming the data, are often not the same people who are actually using the data and so there’s this gap between that, right? And so bridging that gap and to your…

0:49:54.2 BM: Telling someone to write it in Slack so that they can link it, is one great way to bridge that gap. The other is by surfacing information that helps both parties make better decisions about that. That means that the folks that are generating the data and confirming the data needs to know that the changes that they’re making are having real business decisions and the data consumers need to know who to go to and what data they can use.

0:50:16.7 MH: Excellent. All right, we do have to start to wrap up. This is such a great conversation. I think two reasons why. The first is the fact that we’re talking about this makes me feel like as analytics, people were finally growing up into a real important industry. We’re gonna not only use data but we’re gonna try to be adult about it too. So, that’s really cool and secondly, I just love Barr that you’ve put the time and work into really conceptualizing this in such a good way. It’s like it’s just very obvious. You’ve really thought deeply about this problem space. So, I appreciate that.

0:50:52.9 TW: I was so sure the second one was gonna be that it was important because she liked your analogy and…

[laughter]

0:50:57.5 TW: To a certain point.

0:51:00.8 MH: Just wait till you hear my last call to… No, I’m just kidding.

[laughter]

0:51:03.3 MH: Okay. We do… That is that little pointer. Well, one of the things we do like to do on the show is go around the horn and share a last call. Just something we think might be of interest to our audience. Barr, you’re our guest. Do you have a last call you’d like to share?

0:51:18.8 BM: Yeah, so I love what you said about how this is… I think as an industry, we’re growing up and I think that’s right. Good for us. One of the things that I love seeing is in collaborations or folks and how other folks think about data observability and so actually we are collaborating with Riley Media, who’s been a strong voice in the data space on the very first data observability training course and I think that’s a real sort of milestone for the industry in solidifying how we think about data observability and yeah, looking forward to that.

0:51:55.4 MH: Excellent. Hey Moe, what about you? Last Call?

0:52:00.5 MK: Yeah. So my team got into a discussion over Slack, big shock about time on site and whether that’s a good metric and as you imagine…

0:52:11.0 TW: I believe your sister has some opinions and has written about that.

0:52:15.2 MK: I’m aware. Anyway, we have some very strong opinions that were floating around the group and big shock, one of my colleagues, Nick, who… I mean, I don’t know how he has time to absorb all this stuff but he does. He recommended this article called Jointly Leveraging Intent and Interaction Signals To Predict User Satisfaction with Site recommendations, which as you picked up in the title, is an academic paper. It is not light reading but it’s by about five people from Spotify and the thing that they talk about, which is really interesting, is that user interactions are really conditional on the specific intent that the user has and so a metric like time on site could be a really good metric based on user intent and it just kind of… It was the first time that really got me to think about that problem in a very different way.

0:53:02.0 MK: So yeah, I’ll share that link. It’s, like I said, not light writing but really, really fascinating and you can imagine obviously for a product like Spotify, time on site would be a really good metric. So capturing user intent really is important to nailing the right metrics.

0:53:17.5 TW: Harmless the companies that are selling toilet paper but think they’re Spotify. Like that’s the…

[laughter]

0:53:26.4 TW: Maybe toilet paper is definitely…

0:53:28.0 MK: So true.

0:53:28.4 TW: I’m sorry. Wait a minute. I’m now just thinking if we can go into it a little bit further, I just picked just random consumer package good.

0:53:34.7 MH: Though given 2020 e-commerce toilet paper is not necessarily the worst idea I’ve ever heard.

[laughter]

0:53:40.7 MH: OK Tim, what about you?

0:53:45.0 TW: I’m gonna do a quick twofer and this is light reading one of them and this has been out for a while, the OpenAI Project, the DALL-E Project. You guys familiar with that? It’s a GPT3-based thing of generating images from text. So it’s D-A-L-L-dash-E. It’s a combination of Salvador Dali and WALL-E, the movie but it’s this interactive thing. We’re basically taking… It’s almost like doing MadLib phrases. Sort of, except they’ve got pre-populated options and then it generates images. Like an illustration of a baby DICOM radish in a tutu walking a dog and then it generates a bunch of images from it. It’s interactive and it’s a little addictive. It’s a little… I don’t think it’s scary. I think it’s delightful and awesome.

0:54:34.7 MH: Oh we’ll find a way to make it scary.

0:54:38.4 BM: Anything related to WALL-E can be scary.

0:54:41.4 TW: Yeah. That’s well…

0:54:43.7 MH: The message of that movie is pretty dark, honestly.

0:54:45.3 BM: Fair blame.

[laughter]

0:54:48.1 TW: And then my other one is, for several years now, I have wanted there to be in it an R Package for the Version 2.0 API for Adobe Analytics and Ben Woodard has actually successfully gotten that out. It’s not on Chron yet but you can install it from GitHub. He’s kicked the tires on it pretty hard. He’s got a site built generating from package down. So basic functionality for anybody who’s been using our site catalyst and I said, I want that for the 2.0 API adobeanalytics.com. It’s been a long time coming but he did a great job on getting that initial beta out the door. So, that’s exciting for the R-users out there.

0:55:35.9 MH: That’s outstanding. Now, how has been built into five pillars of observation… No, I’m just kidding.

[laughter]

0:55:42.4 TW: I think on the email Dr Joe’s other ones take a little bit of a code review and there were some actually things on logging that Joe said that he was like, “Yeah, so putting a systemic kind of approach to logging as opposed to the initial version. So, I do believe there’s actual. It’s certainly using something like R and having it scheduled to do some of the, I think, observability. I’ve done that in the past with Adobe data of where I’ve just built my own run it daily to do basic. It would fall under, I guess, distribution and freshness type checks but pretty crude but so Ha! I can take you wit and turn into a real thing.

0:56:23.9 MH: That is why you, Tim are the quintessential…

0:56:26.0 TW: So what’s your last call?

0:56:29.5 S1: Alright, well I also sort of have a twofer, so really quickly, the Digital Analytics Hub, is a conference that comes up from time to time. It got postponed last year but I think Tim, you’re leading a huddle in March, about a month from now?

0:56:46.4 TW: Well yes, I am. I believe you’re leading a huddle as well.

0:56:50.2 MH: I am too, yeah so anyway so if you are someone who is excited about digital analytics and those kinds of things, that’s always been a really great conference. It’s gonna be virtual this year but they’ve also made the price to match so that’s a really good one to check out and then secondly, there’s a survey that I like to check out every so often, it’s a survey by NewVantage Partners around the state of corporate data initiatives or becoming more data-driven, you know, AI and those kinds of things. It’s the big data and AI executive survey. That’s the title I was looking for. Okay, anyways, they just came out with their 2021 version at the beginning of January. It’s always a fascinating read. I think the section that always I jump right to is when they ask executives why they haven’t made more progress in these initiatives and basically it’s 92% culture process and talent, 8% technology and it just shows you where the imbalance is in that kind of thing but it is actually rich of lots of great information so it’s always a good read, whenever you’re looking for surveys to shoehorn an opinion in on a group of executives who are swayed by such things, look no further than the NewVantage Partners. Big data, AI executive survey. Alright, that’s the last calls.

0:58:09.7 MH: It is, this has been a great episode, I’m really excited about it and you know what’s interesting? Our episodes just get even better through the efforts of our producer and that is, Josh Crowhurst and so Josh we’re thankful for everything you do. Big shout out to you and we would love to hear from you, the listener and there’s a couple of ways that we would love to do that. So first, obviously, any time you’re on the Measure Slack, most of us are always there too and so we love to hear from you that way. Also, we have a LinkedIn group and on Twitter and Barr, are you on a Twitter or you have a Twitter presence?

0:58:50.4 BM: I am on Twitter, BM underscore Data Downtime.

0:58:54.7 MH: Alright, so you can also reach out to Barr or her team via Twitter as well, so probably a good follow, ’cause I think this will probably be something that’s talked about more and more in our industry. ‘Cause we’re grown up now, it’s 2021. We’re kind of kind of mature.

0:59:12.2 BM: Yeah. That’s how we roll.

0:59:15.2 MH: Yeah, I feel like I did something. I didn’t do anything.

0:59:18.9 BM: That’s great.

0:59:19.8 MH: Anyways, we’d love to hear from you and you know, not for nothing. Come give us a rating on iTunes. It couldn’t kill anything. Unless it’s less than a five and then we don’t wanna hear. No, I’m just kidding. We do, we wanna hear from you and just do that. All right, I know that you’re busy, that you’re doing analysis and you’re putting out reports and you’re delivering all these things to try to drive value for your business and things like data breaking and things like that does make your life harder and I thought one way to capsulate that is, we are already paying for this data downtime, we just might not know how much we’re paying but as an analyst, you’re out there doing the work, so no matter what the quality of your data today, I know I speak for my two co-hosts, Tim Wilson and Moe Kiss, when I say; “keep analyzing”.

1:00:12.8 Announcer: Thanks for listening and don’t forget to join the conversation on Twitter or in the Measure Slack. We welcome your comments and questions. Visit us on the web at analyticshour.io or on Twitter @AnalyticsHour.

1:00:27.2 Charles Barkley: So smart guys want to fit in, so they made up a term called, analytic. Analytics don’t work.

1:00:32.8 Thom Hammerschmidt: Analytics. Oh my God! What the fuck does that even mean?

1:00:40.8 TW: Red flag and five Dollars! Sorry, they won’t let me warn people that we’re gonna do that and that that’s gonna happen.

1:00:52.3 MH: You have literally said, “I do not wanna warn people that I’m gonna do that”. Like you literally said.

1:00:58.5 TW: I didn’t say no. I have said no. I have always said, and you said, “no, no, no”. I think it’s part of the charm.

Photo by Alex Franzelin on Unsplash

Leave a Reply



This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#178: The Modern Dashboard Dilemma

https://media.blubrry.com/the_digital_analytics_power/p/traffic.libsyn.com/analyticshour/APH_-_Episode_178_-_The_Modern_Dashboard_Dilemma.mp3Podcast: Download | EmbedSubscribe: Google Podcasts | RSSTweetShareShareEmail0 Shares