#040: Google BigQuery with Michael Healy

Published: Jul 5, 2016

Subscribe: RSS

Subscribe: Apple Podcasts | Google Podcasts | RSS

0 Shares

In this episode, we dive deep on a 1988 classic: Tom Hanks, under the direction of Penny Marshall, was a 12-year-old in a 30-year-old’s body… Actually, that’s a different “Big” from what we actually cover in this episode. In this instant classic, the star is BigQuery, the director is Google, and Michael Healy, a data scientist from Search Discovery, delivers an Oscar-worthy performance as Zoltar. In under 48 minutes, Michael (Helbling) and Tim drastically increased their understanding of what Google BigQuery is and where it fits in the analytics landscape. If you’d like to do the same, give it a listen!

Technologies, books, and sites referenced in this episode were many, including:

Google BigQuery and the BigQuery API Libraries
Google Cloud Services
Google Dremel
Apache Drill
Amazon Redshift (AWS)
Rambo III (another 1988 movie!)
Hadoop
Cloudera
Observepoint Tag Debugger
Our Mathematical Universe by Max Tegmark
A Brief History of Time by Stephen Hawking
A video of math savant Scott Flansburg

Episode Transcript

The following is a straight-up machine translation. It has not been human-reviewed or human-corrected. However, we did replace the original transcription, produced in 2017, with an updated one produced using OpenAI’s WhisperX in 2025, which, trust us, is much, much better than the original. Still, we apologize on behalf of the machines for any text that winds up being incorrect, nonsensical, or offensive. We have asked the machine to do better, but it simply responds with, “I’m sorry, Dave. I’m afraid I can’t do that.”

00:00:04.00 [Announcer]: Welcome to the Digital Analytics Power Hour. Tim, Michael, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com forward slash analytics hour. And now, the Digital Analytics Power Hour.

00:00:24.25 [Michael Helbling]: Hi everyone, welcome to the Digital Analytics Power Hour. This is Episode 40. Big 40. Big 40. Tim Wilson, my co-host, did you ever think we’d make it to 40?

00:00:36.82 [Tim Wilson]: Big Tim, here to talk about BigQuery on the big 40th episode.

00:00:41.68 [Michael Helbling]: Yep, here we are. It’s episode 40, and we’ve barely almost never talked about big data. Specifically, Google’s BigQuery. And it’s been around for a long time. It’s very popular. So we thought we would talk about it a little bit. That and maybe other cloud platforms, cloud data platforms. if we feel like it. So we needed someone who knows a little something about it. Luckily, we know a guy. Michael Healy is a data scientist that searched discovery prior to SDI. He was at Corsmart where he built analytics tools from the ground up. And he is our friend and colleague. Welcome to the show, Michael.

00:01:20.94 [Michael Healy]: Well, hello to both of you.

00:01:22.41 [Michael Helbling]: So here we are to talk about BigQuery. First off, why is it named BigQuery?

00:01:28.62 [Michael Healy]: That’s an excellent question. Why is it named BigQuery is because you’re allowed to run very large queries against it. BigQuery is what they call analytics as a service internally at Google. And they describe the acronym as unfortunately ASS. But that’s what it is. And so their goal is to provide perfectly scalable and managed service, which allows you to scale your early analytics services to whatever you need to do.

00:01:51.59 [Tim Wilson]: Somewhere at some point there was a possibility that it would be big ass.

00:01:55.31 [Michael Healy]: I can’t speak to that because I never worked at Google, but I’m sure I’m sure I might guess is it came up. You’d have to search like random config file surface for that. So yeah, that’s what BigQuery is a way to store and query data at basically any scale and completely platform managed. So you have very low overhead in terms of managing the data.

00:02:14.98 [Tim Wilson]: But it is a big data store for large volumes of data, even though in the name it’s kind of referring to the query. And that’s because the nature of the way the store is structured, it allows it to be query.

00:02:26.54 [Michael Healy]: It may be a product differentiation. When we talk about Google Cloud Services internally, At Search Discovery, it’s more about the Google Cloud Platform, which has in its platform a number of things, something called Google Compute Engine, which is a way to do your own servers, sort of like Amazon Cloud Services. They have Google File Storage or Google Simple Storage. Forget what it’s called. It’s a way to do dumb storage. And they have Google BigQuery, which allows you to query data at scale. So you’re right. It actually is both storage and query. Exactly why it’s called BigQuery is that BigQuery is built on top of a tool which Google calls Dremel. which if you know what a Dremel is, it’s that fancy little grinder, which seemingly everybody is buying or has bought to do sort of cool make projects. So they have a tool called Dremel from that. They’ve built BigQuery, which allows them to effortlessly see query data. But in fact, it is actually able to store and query data both at the same time.

00:03:21.04 [Tim Wilson]: So I’ve got a picture of a Google engineer who’s got a bubble butt with a Dremel working away on silicone to make a data. So I think I’ve got it this nail. This is going to be a very short episode. I’ve got this all figured out.

00:03:33.16 [Michael Healy]: I’m not sure you’d want to use a Dremel on silicone. You might want to use a hot knife or something.

00:03:39.64 [Tim Wilson]: So wait, is Dremel a Google platform or is that a?

00:03:43.43 [Michael Healy]: Dremel is an internal Google technology, which I don’t believe has been exposed to the public in its entirety yet. Like pieces of it have been discussed in their Google academic papers. I don’t believe the whole thing is exposed as of yet or parts of it are. And so if you’ve ever used Apache Drill, which is a way to sort of query anything and query at very large scale, it’s somewhat similar to that. Apache Drill is built on the same technology. Sort of that’s collapsed inside of Dremel plus the storage. It becomes BigQuery.

00:04:11.80 [Tim Wilson]: So is there any, and this is again, this is all speculations. I don’t think any of us have inside knowledge, but you’re more equipped to speculate than I am. So when it comes to querying a massive set of data and returning something potentially very small based on what you’re querying. So is there a link to the actual Google search engine technology? Like if those merged at all or no.

00:04:34.08 [Michael Healy]: No, so the Google search engine technology is built for a specific purpose, which is to provide language search. This is built to basically query, to set up the data in a very conveniently scalable fashion. And so a little bit of behind the scenes for database nerds, it’s all stored in column based data. which means that it’s easy to do scans and just make table scans very cheap, as well as something I more recently found out is that each column in a Google BigQuery table structure, each column is actually an individual file, and they do that for compression so that it’s easy to scan the whole thing. And so the data are stored in very specific manners, which makes it very easy to query. And I’m not sure if I answered your question, because my train of thought left me at the station.

00:05:23.08 [Tim Wilson]: You did. And I mean, I remember back in, it had to be 99 or 2000. We were, we were, we were reporting from a Lotus, a bunch of Lotus nodes databases hacked together with some static pages, static, static files hacked together with some Oracle stuff. And we were trying to get like a better content management platform for our website. And I remember the one architect was like lobbying for us to architect this whole thing in a column based Yeah. Database. So I think so that the idea that column-based storage can be faster, more efficient, it still kind of blows my mind a little bit just about, kind of barely grasped, but joining a couple of tables.

00:06:05.95 [Michael Healy]: Another product, which is not specifically the subject here, which is Amazon Web Services Redshift database. So Amazon Redshift. is likewise a column-based database. Now, they do some other things over there. That’s more of a database as a service or, you know, there’s more legwork in getting that up and running. What’s really nice about BigQuery is that it’s completely a managed platform. So, like, you don’t need a database administrator necessarily to say, hey, it’s my server uptime. What’s my server uptime? Like, that’s taken off your plate. 100% so you don’t have to sit around all I need to reboot the server or these other I need to rebalance the tables like Google manages all that for you so you’re basically coming down to a few Discrete tasks one is you kind of want it you would do want to have some sense about how your tables are set up So some sort of design to is how are you gonna get your data in there? and three is how are you going to query your data basically three jobs which is not impossible so the sort of technical overhead is completely taken care of for you on BigQuery which is another nice feature of their platform now what you can see if you go into YouTube or if you try yourself Like you can query terabytes of data in a second. So this happens not infrequently. Customers come to us and they’re like, we have really big data. You don’t understand. And then we talk to them and they’re like, yeah, we have a terabyte of data. And you’re like, that’s really not that goes on one hard drive. It’s cool. Don’t worry about it. That’s not a problem. We can handle it.

00:07:34.81 [Tim Wilson]: It’s a lot if you try to print it.

00:07:37.67 [Michael Healy]: Yeah. If you want to print that out, that’s a lot of paper, but like it crashes Excel every time I try to load it. Yes, if Excel is the upper reaches of your technology, yeah, a terabyte’s big data. If you want to think in current terms, a terabyte is not very big at all. So I think Google’s advice is something like, if you have over 10 petabytes, they want you to reach out to them.

00:07:59.70 [Tim Wilson]: So you said, she said three things and I missed one of them was how you get the data out. One is how you get the data in. And I guess the third is how you store it, kind of how you structure it.

00:08:07.45 [Michael Healy]: Just kind of like, you want to have some concept of if you’re, and we can speak specifically to the Google Analytics data and then second, aren’t, aren’t two of those taking, like if you’re taking GA premium data, do they not, it’s kind of flipping a switch and. Yeah, so let’s kind of dovetail into that. So the first is if you don’t have a structured data, if it’s not coming out, like if you’re getting data from an API or already from another source, my initial recommendation would be to just shove it into BigQuery in the way that you get it. The less you handle the data, then the less overhead you have on your end.

00:08:42.19 [Tim Wilson]: So it’s an EL rather than ETL.

00:08:44.77 [Michael Healy]: Forget the T, just do extract and load. That’s typically what we do. Now, saying that you want to think about for us when we’re doing data that’s not structured like Google Analytics data. So we work with AdWords data or other marketing channel data, sort of search engine marketing channel data. And for those, we structured the API exports correctly so that they would be easy to query in the later on. So that’s what I’m saying. You want to kind of think about this when you’re extracting the data. If you have the option to have input there, that’s a great time to step in and say, here’s what we want to do in the future. So let’s structure it so that we’re just putting it into BigQuery in the right format.

00:09:24.63 [Tim Wilson]: In conceptually, or you sort of potentially saying, I’ve got an API, one call to the API builds or adds to one table, another call to the API is to another table. And I know that that’s because I’ve got the common key in those.

00:09:37.83 [Michael Healy]: Exactly. Simple. Yeah. So joins are very simple. Any sort of table queries are very simple. If you’ve ever worked with another database system, what we call relational database system or RDBMS like on MySQL or PostgreSQL or Microsoft SQL Server. So one of those would be those, you know, there’s a lot of overhead and making sure your tables are normalized, which means in a very specific data term, like an entry can only exist once in a table. And so you have to spend all this work normalizing these tables so that the query engine can be run efficiently. And you also have to prune your data sets, and you’re archiving your data sets. One definition of big data, and that’s the BigQuery team, is when it’s cheaper to store the data than it is to try and figure out what data you need to get rid of. So with BigQuery, definitely the mindset is just dump it all in there, and you can prune whatever you need to on the query string back out.

00:10:31.07 [Tim Wilson]: Okay, so talk about how with, well, okay, so you’ve got GA-free, which you could still use BigQuery, but in that case, you’d just be making API calls and pushing stuff from a kind of traditional, that’d be kind of weird.

00:10:43.29 [Michael Healy]: It would be less than ideal.

00:10:45.31 [Tim Wilson]: You’re better off, they’re already storing it, so just make your queries.

00:10:49.88 [Michael Healy]: Yeah, just get GA-premium. I mean, so when you make queries from the API, essentially, you’re not getting what we call event level. And this is not Google Analytics event level. So in database terminology, the event level is essentially the irreducible record of what happened. So we can’t simplify this any further. So it would be each server call, essentially, in a nutshell. So you want to really do data analysis, more advanced data analysis, you’d want to have essentially those server calls at some level. And so you can’t get that out of the GA API very intentionally because even for a small company, that would be pretty significant, right? For each visit, you’d have an average of 10 events. If you have 5 million, how am I going to get that out of the API and serve that up as a scalable fashion? That’s not its intention. its intention is to provide summary numbers. For Google Analytics Premium, yes, if you sign up for BigQuery, they will actually automatically pump the event level data into an instance for you.

00:11:59.98 [Tim Wilson]: And then what’s involved. So, so say you’ve got that setup and then somebody says, I want to see, I want to see visits by day, something that’s definitely a rolled up aggregated thing is that one where you say, yeah, for that stuff, still go back and use either the standard API or whatever. Or would you actually say, no big query, I can set up a bunch of those common things and I can, cause you’re basically re-engineering the, the processing logic engine that GA is doing, right?

00:12:27.67 [Michael Healy]: to a degree. So let’s take a step back. How do you get it enabled first and foremost? So hopefully you’re working with a partner or Google and you have to contact your, you have to get with your contact at Google, tell them you want to turn it on and then they can enable it for you. The pricing structure for Google, let me just mention that for Google Analytics Premium, the way it works is that you get a credit for Google compute services per month. And you have to pay for actually using BigQuery. As a Google Analytics premium query customer, you have to pay for it. But you get a credit. And in every case that I’ve seen, the credit covers what it costs to use BigQuery. So you get this credit to use Google Analytics BigQuery. They charge you against that credit. And then at the end of the month, you settle your records. The one component you do need is you do need to set up billing. Just in case you do go over, I personally have not seen that yet. because it’s just so darn cheap. And every time there’s a price cut by Amazon or by Google, they kind of go in a pricing war every once in a while and it just gets cheaper and cheaper.

00:13:31.21 [Tim Wilson]: So well, and just just so we can kind of rattle that off because I remember being my mind being a little blown by that as well is like you get like a What a $500 a month credit and we’re talking $5 per terabyte in a query two cents per gigabyte per month for storage. So right you look at that. I mean, yeah, it is kind of a it’s like it’s a it’s a it’s a silly small.

00:13:55.83 [Michael Healy]: It’s so ridiculously small. Now, is it possible? I 100% believe that if you had massive data in there that you would be paying more than this, or if you weren’t a Google Analytics premium customer, you would have to pay a nominal fee. However, in every single case that we’ve implemented it, I would argue that the benefit to the company from having access to the entire data stream is greatly, it greatly outweighs the small cost. So even if you have to pay, you are achieving so much more value from your data that it’s well worth it.

00:14:27.20 [Tim Wilson]: So do you have any cases that are not Google Analytics Premium?

00:14:30.92 [Michael Healy]: Yeah, yes.

00:14:32.10 [Tim Wilson]: They’re W, they’re GA, and they are using BigQuery. Yes. Is that web analytics data going in, or is that because there’s other large? It’s web analytics data.

00:14:40.08 [Michael Healy]: Well, some of it’s search engine data, some of it’s other web analytics data. And I don’t want to get into specifics, but it’s so marginal.

00:14:47.36 [Tim Wilson]: It’s silly.

00:14:48.57 [Michael Healy]: Yeah. Now, if you utilize it a lot, like if everybody’s doing queries, then it’s possible your cost rise up. Now, Google also helps you by caching query results and kind of like they’re not out to like ding you on pennies. Like when you have a query $5 per terabyte and the first terabyte per month is free. So like if you had to retire by it, then you’re free. I mean, so I had one that I ran for a while and I had tried to just like duplicate. I think it was like $10 a month or something or five. I mean, it’s so inconsequential for most companies that it doesn’t really matter.

00:15:21.59 [Tim Wilson]: That was my sense. So what about in helping you can jump in anytime because I wanted to cut to how you get data out too.

00:15:27.62 [Michael Helbling]: You keep it going. This is one of the first conversations we’ve had on this podcast where I feel this urge to be taking notes.

00:15:39.10 [Michael Healy]: I want to point out, Tim, that I gave a very detailed presentation about BigQuery internally probably a year ago. And I won’t say who missed it, but they’re probably taking notes right now.

00:15:49.30 [Tim Wilson]: They might be taking notes right now. Hey, we’re all a team. So what we haven’t talked about, which is another huge curiosity area, is how you get data out. Can you talk through that third piece that you referenced earlier?

00:16:05.38 [Michael Healy]: So the four ways you can really get data out is the web UI, command line, REST API, and then ODBC, which stands for Open Database Connection.

00:16:14.37 [Tim Wilson]: So you can literally take Microsoft Excel and hook into BigQuery with an ODBC?

00:16:19.51 [Michael Healy]: There is a degree. So let’s be careful, though.

00:16:22.11 [Tim Wilson]: Easy there, killer, huh?

00:16:23.91 [Michael Healy]: Well, yeah. So this literally happened to me before. I’ve literally seen this where somebody made a query against a very large-scale database. And I’m like, I’m having a problem with my computer. And I’m like, wow, what’s the problem? But it just won’t start. I’m like, well, oh, wow. Hey, look, your disk is full. Well, what did you query from the database? Well, what did you think was going to happen if you did a huge query on a big, huge database? Of course, like it’s going to kill your computer. There’s just common sense. So could you connect with Excel? Yeah. Is it like a really good idea? I’m not sure.

00:16:55.83 [Tim Wilson]: Well, but in that case, when you’re making a query, I’m assuming, regardless of which one of these mechanisms, or let’s say that it’s one that I’m initiating locally, either through the API or ODBC, it’s actually, if that query is gonna, it may be on two petabytes of data, but it’s gonna return 10 rows.

00:17:14.75 [Michael Healy]: So part of the ODBC standard includes things like row numbers, like you would actually get a concept of like how you’re getting, how much data you’re getting back. But so you mentioned the web UI, so you’re able to actually like log in with your what’s great about the Google ecosystem is they handle your login fantastically. So if you’re already using Google Analytics or some other thing and you’ve connected your identity that you use for normal Google purposes with this BigQuery project, Then you can just say take me to my console and it takes you right to the web UI and that’s where I would say almost everybody should I should say everybody should start there and you should basically live there For a couple of reasons. One, it’s a great way to like build or debug queries, as well as if you have a scale limit, they do some limiting of the data return. So you’re not going to run to these data scalability issues. So if you happen to happen to get back 10 million rows, like you have to worry about it killing your computer. It’s all based in the browser. So it’s a much better way. If you’re doing data exploration or building or debugging queries, I totally use the web UI.

00:18:16.38 [Tim Wilson]: And then you can take, it’s kind of like using the Adobe Queer Explorer or Google, you can then take what you’ve got and say, okay, now I want to run it through Python, through the REST API, is that?

00:18:25.83 [Michael Healy]: Exactly. And you can basically copy and paste that, or if you’re actually building the query in a tool, you can build the query. It also gives you some diagnostics, like, hey, this query took this long. So if you actually want to be really super concerned about pricing and you want to be concerned, like, how long are these queries running? Is there anything we can do to make these faster? Then it gives you some diagnostics inside the tool. A feature of BigQuery is that actually allows nested data sets. So you mentioned Google Analytics API or the Google Analytics premium data set rather. So it comes across as event level data, which means each row is one event. Now, embedded in that event, there are actually JSON objects. And what that means is that instead of mySQL row, each row is the column and each column there’s a value. So think about BigQuery, there’s actually like in column browser type, there may be browser type dot build date or browser type dot platform. Some of these are nested data structures inside BigQuery. And so that’s something that if you’re trying to get used to BigQuery is actually a little bit hard to wrap your head around how queries work with these next data structures is not always intuitive, especially if you’re coming from a relational database system. And so the web UI really gives you a concept of like, here’s how the data are structured. Here’s where the nested data elements are. And how you want to start to unpack those. And so I would definitely go with the web UI for 99% of your building debugging or a data exploration.

00:19:53.65 [Tim Wilson]: So to that point, like if you in the web UI, will you actually get results and you have a little twisty or something next to it where you can.

00:19:59.20 [Michael Healy]: Yeah, I mean, in some ways it’ll expand and or it’ll give you the feedback like, hey, you need to somebody or you need, hey, dummy, you forgot to explode this value and you can’t combine that. So you can actually expose what’s going on between for nested values. They do have a command line. I always feel like command line is like for maybe batch operations. I don’t really use it at all. That’s just the way it is. Maybe if you’re doing backups or something like something simple, like you wanted to shell script it. That’s fine. We use the REST API. Our workflow is to build it and debug it in the web UI and then pull that out, put that into a REST API inside a deployed application, whichever language we’re using at the time.

00:20:40.46 [Tim Wilson]: And so just as one of these days, I’ll listen back on this and say, Tim, you silly, like I’ve been hearing REST API for years. And am I right in thinking that means that that can be Python, that can be R, that can be how I can use JavaScript.

00:20:55.35 [Michael Healy]: Believe we use JavaScript today. So there are supported libraries.

00:20:59.24 [Tim Wilson]: I actually think I saw that that’s all linked to where they listed.

00:21:02.04 [Michael Healy]: There is a big query connector for Excel. So, hey, good, good news for you.

00:21:08.20 [Tim Wilson]: Uh, but no, no, I’m going to be, I’m going to be doing it with R at this point.

00:21:11.40 [Michael Healy]: I mean, this is just, uh, so there, so it’s possible, but I would say, yeah, REST API is, you know, there’s sort of a, there’s a Google. In different languages, they’ve tried to build basically a Google services package, which connects all these different things. So in Python, there’s a Google service package, and you basically authenticate. And what’s cool is that authentication can then be used. If you set it up correctly, can be used to query Google Analytics. It can be query BigQuery, whatever the case may be, as long as you understand how the REST API works.

00:21:38.91 [Tim Wilson]: And the rest API largely, I mean, there’s syntax and, but in a large sense you’re putting this kind of sequel like query, like you said, you could basically copy, you’re not, you’re not directly copy and pasting, but you’re close to copying and pasting from.

00:21:53.96 [Michael Healy]: Oh no, you’re basically copying pasting the query.

00:21:57.26 [Tim Wilson]: It’s just kind of embedded, embedded within some little piece of the API.

00:22:01.22 [Michael Healy]: Yeah, okay. Yeah, it’s embedded. I mean, and you may want to like, there’s no SQL, there’s no way in other tools they create what’s called a SQL engine. Other tools like Tableau, I think actually have some kind of SQL engine, like they try to build it for you based on the table. But if you’re using the REST API, then that’s kind of up to you to say, I want to build this query and have this component. Obviously, you can parameterize things like dates or ranges or whatever. So to meet your needs, it can be parameterized within the cons, the cons finds of the language you’re working in, but that’s how we do it.

00:22:32.15 [Michael Helbling]: Yeah.

00:22:32.37 [Tim Wilson]: Okay.

00:22:33.05 [Michael Helbling]: So I want to turn our discussion a little bit and talk a little bit about why companies would start to use this, right? So what’s, what does it look like for an organization to be like, you know what, we’ve got our analytics tool set up, we’re using it to analyze What’s making us take this next step and think about using BigQuery? What are some of the things we want to do or should be thinking of as a business to do?

00:22:58.02 [Tim Wilson]: Well, yeah, what do they have today? What is their environment today?

00:23:02.20 [Michael Helbling]: Yeah.

00:23:03.18 [Tim Wilson]: And then both what they’re trying to do and then is it a migration or something new? Yeah.

00:23:07.79 [Michael Healy]: So typically, it’s a couple of different things that companies are looking for when they get into BigQuery. So they may already be a Google Analytics premium customer. and they may want to enrich that data set at scale. So if you’re already a Google Analytics premium customer, and your data are being uploaded, and then you want to upload a bunch of customer service data. You know what? It’s fast, cheap, and scalable to put it inside BigQuery. So we’re just going to go all in on BigQuery. So that’s definitely one component of it, as well as that mentions the other. If you’ve ever been involved in a data warehouse project, data warehouse projects tend to be very long, complex, and not inexpensive at all. As opposed to BigQuery, which is basically very nominal fee,

00:23:48.26 [Tim Wilson]: You spend you spend six months debating Kimball versus Henry.

00:23:52.19 [Michael Healy]: Right. And not saying that’s wrong, but that’s the old archetype, which is to say, you know, we’re gonna set up and be everything very scale, you know, very specific. And it’s the way it’s done. Whereas BigQuery, I’ve seen a lot of companies, particularly marketing organizations, but a lot of other, you know, more agile organizations, they want to be able to Get a new data source whether it’s seasonal or whether they’re going to try and they want to look at the entire scan of the data so maybe they’re doing some sort of seasonal marketing message with a new vendor which they want to try you know is that worth spinning up a data warehouse project internally you know probably not. To be honest, it’s going to take longer to do the project than it is to do this test. If you have your data stored in BigQuery, then you can just upload that data and analyze it as needed. And bada bing bada boom, you’re done with the day. It’s much cheaper, faster, and flexible than. an existing data warehouse project. So that’s another component where people are seeing the value of, hey, it’s going to take us millions of dollars to build an in-house data warehouse. It makes more sense for us to just put it up in the cloud, put it in BigQuery, and we don’t have to worry about it. We have the same access to the data.

00:25:00.96 [Tim Wilson]: So is it ever the marketing organization saying, yes, we have, we have a big Natesa whole infrastructure thing. We’re never going to be prioritized around the operations group or whatever. And they just decide, they’re like, fuck it, we’re going to go and stand this big query thing up and start throwing stuff in it that we care about. Basically.

00:25:19.77 [Michael Healy]: And I mean, it might be, it’s probably a little more organized. Like they wouldn’t. hopefully involve the IT team because they may have any questions about what’s going on. It’s not a whole like, you know, Rambo jump out of the helicopter shooting missiles off, you know, the Russians Rambo three, you’re not quite doing that.

00:25:34.69 [Tim Wilson]: Was that Rambo three or was it?

00:25:38.32 [Michael Healy]: Was that Rambo two? Sorry. I have no I don’t even know why but the fact that you just rattled off to the exact scenario like you had a scene in your mind and we’re that’s always like my my nightmare when people are like try to be like Rambo programmers or Rambo you know Rambo program people are just like hey I’m gonna jump out I’m gonna do it all by myself I’m like no we’re team

00:25:57.30 [Tim Wilson]: Well, I guess I’m not I’m not implying kind of a going rogue, but I definitely have worked with a lot of organizations where the historical data, you know, it’s an Oracle data warehouse and it’s they’ve got their Brackspace stuff and they got to provision everything. And it’s just kind of after a year or two, somebody comes in and says, that’s fine, we’ll play nice with you. But, you know, we’ve got to get something done in the next month.

00:26:23.39 [Michael Healy]: And so that’s a lead in for them to get started, as well as a lot of companies, there’s technical debt, right? They’re already waiting for requirements to be filled in the data center, right? In most places, there, you know, wouldn’t you say that it’s true that in most places, there are requests for data warehouse upgrades that are already in the queue. And then you come along, oh, here’s another one that I need, but I need it right away. Well, it’s just gonna go at the end of the queue. So, you know, what is a better way to do this?

00:26:49.38 [Tim Wilson]: Or the for the fourth thing in the queue stays the fourth thing in the queue forever because there’s always something that comes in above it.

00:26:55.73 [Michael Healy]: So there’s a there’s a little more to it than just like, Hey, let’s turn it on. Obviously, you know, we really leverage the rest API a lot. So you need to have some developer resource in there to kind of like get this operationalized. Now, could you go in and kind of manual upload some subset of files? Great. You could totally do that. You know, you could upload it and to BigQuery or into the storage and then put into BigQuery. and fantastic, make queries against it. Like you could do ad hoc, basically by hand. But if you wanted to operationalize it, you need to have a little bit of developer resources, but not a lot. And after that, like I said, you don’t have a whole data warehouse infrastructure team supporting you. You just need to make queries and get your data out.

00:27:35.35 [Tim Wilson]: It seems like there is that people say, oh, I’ve got GA premium. If I’m an analyst and I’m excited because, hey, we finally went to premium because we were maybe with something fairly obvious, you know, we’re hitting limits. And so great, I’m going to get unsampled data. That seems like one of those sort of misnomers out there of. I’m getting unsampled data. That unsampled data is kind of really not the same thing as the event level data of flipping the switch and loading it into BigQuery, right? So if an analyst says, no, I really, I have the capacity and the ability to work with basically the hit stream. And I want to access that. I may have GA premium, but I’m not really going to get at that data until I flip the switch and turn on BigQuery. You made the point earlier that even with a relatively small site, the actual raw data can still be a decent sized data set. And I would think from doing true analytics on it, the more atomic you’ve got, the more data points you have, the more opportunity you have to to actually find something. If you just get down to the, you know, stack all your dimensions up with whatever metric, that may be a bigger flat table, but it’s still a stopping short of my actual atomic hit level data.

00:28:55.59 [Michael Healy]: Yes, so you asked earlier if you’d have to rebuild reports in BigQuery, basically rebuild Google Analytics reports via BigQuery. To some degree, that’s probably true. If you want to know how many page views or unique visitors, you’d have to have that query and run it in BigQuery and do some sort of SQL aggregation function, but that’s super simple. If you can’t figure that part out, feel free to reach out to us or read a book because it’s so simple. It’s really easy to do that anybody basically should be able to do it. So that is true. Now with regards to deeper level analysis, yes, it’s 100% true. You’re able to do further event level analysis. of everything that’s going on on your site where we really see like the value proposition just explode almost exponentially is when you start looping in other data. So suppose you have call center data or some sort of customer turns data or some sort of offline data, which is not captured in your web analytics tool. So linking up your web analytics data to other types of data. via the Google BigQuery tool, then you become this basically magician that is able to do amazing things.

00:30:05.61 [Michael Helbling]: Like what?

00:30:06.65 [Michael Healy]: So customer lifetime valuing. You could do very simple customer lifetime value if you identify your known customers.

00:30:12.18 [Tim Wilson]: So do you wind up doing cross device stitching?

00:30:14.26 [Michael Healy]: I mean, if I would take it even further. So let’s assume that you’re basically doing all your device or web analytics tracking inside Google analytics that you landed on that. So we’re using Google analytics for website, for device or whatever, but we also have voice call, or we also have customer supply chain, you know, like things like backend order tracking stuff, basically what would be considered ERP. So like when was order placed? When was the order shipped? When was the order delivered? this sort of thing, connecting that data all together, stitching a complete customer life together inside Google BigQuery is really where the value is. So the more you can flesh out your customer, take a step back, don’t just say, we’re going to capture device interactions. Those are table stakes.

00:30:59.82 [Tim Wilson]: Well, no, no, no. I mean, to be fair, you jumped in a little quick. I was going on the path of that, and then there’s an offline transaction. And so therefore, you can get across, yes, cross device. And then where I was kind of heading, because you still have to have a key, right? I mean, no amount of technology is going to magically make your key to do the join, right? have to have. And sometimes it seems like that has to be thought through in how your operational processes are designed. If they’re making an in-store purchase, have you thought through how that in-store purchase would be linked back to call center, to IVR system is linked to your digital data? It’s not magic in that because you have the two data sets in one spot, you still have to have a key to join them together, right?

00:31:49.30 [Michael Healy]: Right. So that’s the pre-planning that is required. And sometimes your vendor is going to really have an opinion about how the data is going to be structured. So you kind of have to figure it out. And some of your implementation from Google Analytics, and obviously we’re assuming that someone is sort of an identified customer in both instances. So you’re assuming that on the website, you’re identified in the store, you’re identified. And we can also get a key and put those two together and kind of like, stitch that together. You can also do, if you don’t have that, you can do some sort of meta analysis to say, okay, we have this in-store performance now. We don’t, we have this bucket of anonymous visitors. You know, how did our anonymous visitors as a whole behavior change and potentially affect our identified in-store purchases? So potentially you could work around that, but it would get messy and be sort of complicated. Ideally you’d have You’re right. Some sort of foreign key that would link all these tables together. And that way you could say, hey, great. Here’s your customer ID number that is from, once you logged in, we know your customer ID level. So if you have a vendor that is maybe tracking online e-commerce purchases, they can track you by email or if they actually have you logged in, they may have a customer ID.

00:32:58.25 [Tim Wilson]: Now we’re heading down the… I don’t know that I feel like a little bit that sort of the vendors are saying all you have to have. Yeah, yeah, yeah, yeah, yeah, yeah, yeah, you have to have your common key. Yeah, let’s just assume you have the common key. And now we’ve just like ignored the reality of the other 85% of the business models that don’t.

00:33:13.95 [Michael Healy]: I’m not saying it’s always easy. What I’m saying is that I’ve seen it accomplished. And we’ve we’ve accomplished it so it’s not impossible but it’s definitely something that you have to think about and kind of understand from the beginning that we want to understand customer life cycle especially the customer you know if it’s a situation we’re going to vendor where they have a serum system in place already then they already really should have hopefully you know fingers crossed. Thought out a customer model. So if they’re using a sales force, then they have customer IDs and sales force and they’ve kind of already thought about all the different touch points. So if you go into that kind of environment, then it’s more, how do we make that Salesforce ID or how do we do this exposed back out to big query. So it kind of depends on the industry.

00:33:56.20 [Tim Wilson]: So say I’m an Adobe user and Adobe’s got their data feed, the, you’re not going to customize it at all. And I think it’s included with most versions. If you are willing to take an unfiltered, unsanitized data feed from Adobe, have you seen that? Like, or I guess the simple question is, is there, are there cases where somebody’s using Adobe analytics where it makes sense for them to go into BigQuery? Is that, well, they’ve got other data that they’re already pumping into BigQuery for other reasons and they should bring it in or what’s kind of Adobe to be query?

00:34:27.69 [Michael Healy]: I think there are two data feeds. One is the data feed, which everybody calls the data feed from Adobe, which is the flat file you get. Then there’s the Adobe streaming data, which is called live stream, I want to say, and where they actually send the event out to you. So could you put either of those into BigQuery? Yes, absolutely you could. Would you ever do that? Typically, that would be a decision made on a new install. So when we’re working with clients who are on the data feed, if it’s an existing Adobe client, they may already be getting the data feed. And this may be a question that they have asked and answered of their IT organization, you know, five years ago. And in that case, you know, if they already have the process to read this data into Netizo or Teradata or some other onsite device, then what are we going to go in and say, Hey, you guys are morons, you got to put it all inside BigQuery. No, of course not. Right. I mean, they already have a whole process to read and analyze the data. So it doesn’t make any sense. In the case where it’s a new install, then it comes in and there’s definitely a conversation about where do you want to store this data and how do you want to access it? And it really comes down to one of three options. Basically, one is Google BigQuery. Some companies do have concerns about either putting your data with Google or they want to stay with Adobe. So they look at the Amazon Redshift product, which I mentioned, which is very scalable. It’s less of a platform as a service. So you do have to actually understand how the databases are spun up and managed. You have to do a little bit of management. You don’t have to have them on site. They’re in the cloud, but you do need to do some load balancing or whatever the case may be. And the third option is some sort of they may want to buy and build an internal database. For the final one, I would say the people who are using internal onsite databases That would be the net teaser or terror data or net teaser or terror data typically or the ilk if I’m forgetting one, please excuse me. That would be somebody that, you know, if you’re talking about a new install, not the legacy install, you know, like five or 10 years ago, if you’re talking about somebody starting new today, there may be info sec information security requirements or some sort of legal or

00:36:33.14 [Tim Wilson]: just old guard paranoia, I mean, potentially.

00:36:35.50 [Michael Healy]: Yeah, just some sort of like logistical requirement that they say, you know what, we can’t put our data in the cloud. We just can’t. And I’ve seen this, typically it’s like with financial firms, like they really need to be on site. And so that’s it ends the conversation there. That’s our info stack requirements. So then the question is, you know, first, can you put your data in the cloud or not? If you can put it in the cloud, then it becomes, you know, are you comfortable with the way Google Halons are data? Or would you, if they’re an organization that’s an IT organization that is already like knee deep in the Amazon Web Services cloud. So they’re using Amazon Web Services, Compute Engine, or what have you, they may be really interested in Redshift just for logistical concerns. Network IO is sort of the carbon monoxide in the room. So network IO is just the cost to transport the data from point A to B, your data, right? And so if you have a ton of data already being produced by your Amazon Web Services, then it’s pretty smart to just send it into Redshift. That’s the way it works. Now, if you’re coming into it fresh and you have nothing to worry about, then I would say, you know, consider, look at BigQuery. It’s the way to go for a lot of companies. The lack of infrastructure and organizational overhead that you have to expend to get it up and running is quite nice, and it makes it very, very affordable.

00:37:55.92 [Tim Wilson]: There’s one word we haven’t heard at all during this, and I just don’t know where it fits. So where does Hadoop fit into this entire universe?

00:38:04.23 [Michael Healy]: So Hadoop is actually an ecosystem of numerous things, of a few things. When Google published a paper which identified the Google file structure GFS, some very smart people read that, and they said, hey, we could basically implement something. We could kind of figure that out. And the technology is exposed, but conceptually, we understand it. So we’re going to use Java. to build what’s called a HDFS or Hadoop file system. So you started with GFS, the Google file system, then you went to Hadoop file system, they built that. Now Hadoop is actually the umbrella term for not just the Hadoop file system, but also the MapReduce engine, which Google wrote about in their paper. So they built these two components and put them together. And basically a Hadoop would be for someone who also wanted to be on site, but not in a relational database service. So it’d be a different way to go. Okay, that makes if you wanted to be on site, but not necessarily in a relational database format, you could go with a Hadoop vendor like a Cloudera.

00:39:00.13 [Tim Wilson]: Okay, that’s it. I’m totally I’m fully knowledgeable now.

00:39:06.04 [Michael Healy]: You’re ready to go pitch some clients. Now I know what you’re going to talk about a web analytics Wednesday. Let me tell you how smart I am. Let me give you a what hour discourse on the Google file.

00:39:15.55 [Michael Helbling]: Well, no, it’s really good. And actually, I’m just sitting here basking in the glow of proof that I’ve gone out and gotten people way smarter than me to work at Search Discovery. So that’s awesome. No, anyways, Michael, thank you so much for coming on the show. This has been great. And I think something, you know, people who listen to our show will get a lot out of, maybe not as much out of it as Tim Wilson, but no, it’s, I think it’s a great topic. And I think, you know, it’s timely. We’ve talked a lot on the show about how data science is becoming more and more central to what it means to be a digital analyst. And so I think there’s a lot of interest in these kinds of topics and these kinds of things. So thank you very much. And I certainly learned quite a bit that I could have learned a year ago, apparently if I had I attended to the internal. I’m sure I wanted to. I’m sure I wanted to. Anyway.

00:40:13.44 [Michael Healy]: Thanks for having me. It was awesome.

00:40:15.21 [Michael Helbling]: Yeah. Well, and so we do a thing on the show, Michael, where we do at the end of the show, we do thing called last call where we started doing it where, you know, if something cool you’ve seen in the last few weeks or whatever you think is worth noting. So we just go around and do that. And I don’t know, Tim, why don’t we start with you? What’s your last call?

00:40:34.75 [Tim Wilson]: Okay, so I got a couple I could do. I think what I’m going to go with and it’s going to be appropriate because now that my brain is full and I momentarily until I wake up tomorrow morning and realize that I don’t have any understanding of any of this, I feel like I’ve got a better sense of BigQuery. I’m going to go with something sort of simple and basic, but I realized how much I am using it now from where I wasn’t using it a year ago. So plain old web analysts, we always have to check the tag, see what’s firing. For years, I mean, I’ve used Fiddler, I’ve used Charles Proxy, I’ve used the Google Analytics Chrome. Debugger for GA. I’ve used the Digital Pulse Debugger for Adobe. But a while back, I hit a spot where I needed to capture post data, and that was not being captured anywhere. And Josh West, it demystified, pointed out. I was like, oh, use the Observe Point Tag Debugger. It’s free. It’s a plugin for Chrome. So it’s simple. I realized I used that 99% of the time now. If I’m not looking, if I’m not trying to debug a mobile site or mobile app where I need to run it through Charles. So I had a client a few weeks ago who said, I sent a screen capture and they’re like, what is that? That’s not the digital pulse debugger. I’m like, no, it’s way more awesome. So. That’s a plug. It’s free. It gets both your GA and your tag management and your Adobe tags all in one spot. It’s not perfect. I’d love to have, you know, the ability to do a little filtering in it, but yeah, it’s a handy little, handy little tool, little Chrome plugin. And it’s right there where you might be inspecting elements anyway. So that’s my last call.

00:42:13.48 [Michael Helbling]: Nice. What about you, Mr. Healy?

00:42:16.69 [Michael Healy]: So I have a book recommendation, Max Tagmark. He wrote a book called Our Mathematical Universe, which is discussing his hypothesis that our physical reality is a mathematical structure and his theory of the ultimate multiverse. So this is in kind of like Several people have been discussing it lightly about why is math? Why does math so?

00:42:38.18 [Tim Wilson]: Appropriately define our universe can’t say I understand a hundred percent or can speak eloquently about what is talking about but it certainly is very thought-provoking It’s like reading a brief history of time which I was awesome with for like the first two chapters and then he didn’t I’m like how can this how can this little fucking book like make me? Yeah, it’s gone

00:43:02.37 [Michael Healy]: I read, in the brief history of time, he was advised that for each equation, do you guess how many readers, what percentage of readers fall off?

00:43:12.19 [Michael Helbling]: There you go. See, analytics can solve this problem.

00:43:14.91 [Michael Healy]: 50%. So yes, what do you sell? 50% of your books are 50, like he, so he only has one equation in the whole and it’s E equals MC squared. So our mathematical universe has quite a few more equations.

00:43:27.99 [Michael Helbling]: So what’s interesting, my last call is not actually as very dissimilar Michael Healy, but totally coming from my point of, my direction of not being good at math as good as some people. And that is recently I stumbled across this YouTube video by this guy by the name of Scott Flansburg, who is the one of these guys. He’s like a human calculator. So he can add things up and really fast and do math really quickly. And he had a YouTube video that I watched about how he learned math and some of the things he’d found. And it was just fascinating. I guess maybe I’m getting old or something, but I just found it really interesting. So if you’re ever looking for a way to waste 45 minutes or so, go check out Scott Flansburg’s videos on YouTube and he’s got some pretty cool stuff. Sort of like the number nine means something and it ties into the mathematical universe because like math is totally universal. Anyway, if you’ve been listening to the show and you want to get in on the conversation, we would love to hear from you on our Facebook page, on the Measure Slack. Michael Healy also on the Measure Slack, dropping some science here and there. So come check us out, ask questions, hang out with us. We’d love to hear from you.

00:44:46.84 [Tim Wilson]: Can I throw in that the measure slack, because we occasionally get people asking. So if you go to Bitly, add measure slack, one word, you can be all over case. You can capitalize the A and the M and the S. If that bothers you, they both work. And there’s a Google form. And jump on in there. We know we have contributed to that community through this podcast.

00:45:06.83 [Michael Helbling]: That’s a good thing to add in. Well, once again, Michael Healy, thank you very much. And for my co-host, Tim Wilson, keep analyzing.

00:45:20.25 [Announcer]: Thanks for listening. And don’t forget to join the conversation on Facebook, Twitter, or Measure Slack Group. We welcome your comments and questions. Facebook.com forward slash analytics hour or at analytics hour on Twitter. Our shit we’re recording too.

00:45:47.27 [Michael Healy]: You guys work with some divas, I’m sorry.

00:45:49.63 [Michael Helbling]: We do. We absolutely do.

00:45:53.66 [Michael Healy]: Oh god, not again.

00:45:57.98 [Michael Helbling]: Mr. Wilson and me.

00:46:02.23 [Tim Wilson]: I feel like maybe that’s one of the, one of the entries, kind of where Michael was starting, or, hell no, this is the first time we’re gonna. Yeah.

00:46:11.74 [Michael Helbling]: What’s your, what’s your guys, what’s your guys internal convention? Doctor. Doctor. I go by my nickname. We do West Coast Michael in East Coast Michael. Westy and Easty? No.

00:46:25.89 [Michael Healy]: You’re sub-recording, right?

00:46:27.95 [Michael Helbling]: What’s the old saying, a prophet is respected except in his hometown?

00:46:32.79 [Tim Wilson]: So knowing we can fix this in post if it’s the world’s most dipshit question, that there’s… Our data, it’s so big and I’m like, let’s pump the brakes here, alright?

00:46:42.92 [Michael Helbling]: You’re not like either one of them, and neither are the two of them like each other.

00:46:48.95 [Michael Healy]: So what was your question again?

00:46:50.11 [Michael Helbling]: Pretty, pretty, pretty, pretty good. We don’t do this as a YouTube video because nobody could stand to look at me and Tim for that long. So, I’m having trouble right now. Hey! Hey!

00:47:09.84 [Michael Healy]: And bada bing bada boom, you’re done with the day.

00:47:13.26 [Tim Wilson]: West Coast, they’re still not realized what I’m talking about. I realize it, just don’t care.

00:47:16.84 [Michael Helbling]: Okay.

00:47:18.47 [Michael Healy]: I just don’t care. I mean, this is obviously like a Key Body award-winning episode.

00:47:26.60 [Michael Helbling]: Yeah, this is the one we’ll submit to all the podcast award for sure. Rock flag and big query!

Podcast: Download | Embed

Subscribe: RSS

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

SUBMIT IT HERE

Recent Episodes

#276: BI is Dead! Long Live BI! With Colin Zima

July 22, 2025

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_276_-_BI_is_Dead_Long_Live_BI_With_Colin_Zima.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares