#043: Open Source Analytics with Simon Rumble

Somebody wants to overthink their analytics tools? Tell ’em their dreamin’! We wanted to talk about open source and event analytics and Snowplow sits right at that intersection. Our guest Simon Rumble is the co-founder of Snowflake Analytics and one of the longest users of Snowplow. We wrap up the show with all the places you can find Simon and Tim in the next few months. Fun fact: You will also learn in this episode that conversion funnels go down the opposite direction in Australia.

Episode Transcript

The following is a straight-up machine translation. It has not been human-reviewed or human-corrected. However, we did replace the original transcription, produced in 2017, with an updated one produced using OpenAI’s WhisperX in 2025, which, trust us, is much, much better than the original. Still, we apologize on behalf of the machines for any text that winds up being incorrect, nonsensical, or offensive. We have asked the machine to do better, but it simply responds with, “I’m sorry, Dave. I’m afraid I can’t do that.”

00:00:04.00 [Announcer]: Welcome to the Digital Analytics Power Hour. Tim, Michael, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com forward slash analytics hour. And no, the Digital Analytics Power Hour.

00:00:27.52 [Michael Helbling]: Hi everyone, welcome to the Digital Analytics Power Hour. This is Episode 43. In our continuing series on topics that Tim Wilson is using the podcast to find out more about. Welcome Tim, by the way. We’re talking this episode about open source analytics. A few of these tools exist, but one that we see pop up on our collective radar more and more is Snowplow Analytics. So we’re very interested in it and we want to learn more. And since Tim and I don’t use the new and cool tools very often, except for Tim and R, which is, you know, his new love of his life, we needed a guest to help us along the way. And we’re very fortunate to have What we’ll call a very special snowflake with us today, Simon Rumble. He’s the co-founder of Snowflake Analytics, their consultancy built around the Snowplow product. In his previous roles, he was the head of analytics at Bower Media Group. He’s very active in the Australian analytics community. He’s also very active on the Measure Slack, and we’re very pleased to have him on the show. Welcome, Simon.

00:01:38.56 [Charles Barkley]: Thank you.

00:01:39.52 [Michael Helbling]: Oh, wait, I wanted to do since your last name is rumble, I wanted to do so let’s get ready to talk about snowplow. I think if I say the whole thing that might violate like that guy’s trademark, so I just went with the first part.

00:01:59.78 [Tim Wilson]: Yeah, we clipped that out because our legal department advised us that we should not go there.

00:02:04.67 [Michael Helbling]: So I think a good starting point will be maybe Simon, if you wanted to walk people through kind of what Snowplow is, what purpose it serves kind of in the broader constellation of analytics tools and kind of where you, let’s start with that and then we’ll go from there.

00:02:20.39 [Simon Rumble]: Sure, so Snowplow is an open source web analytics platform, or at least that’s kind of how it started back in 2012. It started out, there’s these two really smart guys in London, they’re working for an ad server platform, and part of their task was to try and integrate the ad server stuff with the traditional web analytics tools, Google Analytics, Omniture and the like that everyone else was using and what they discovered was that it was really old-fashioned and really kind of clunky the way that the web analytics tools were doing this stuff and it was really hard to integrate things together so they decided well let’s write our own as you do because you know it seems like a fairly simple straightforward problem and five years later you find yourself still working on it. And they put it up on GitHub. They announced it. I discovered it a couple of weeks later, set it up, started running it and thought, wow, this is really awesome. And they’ve kind of built a business around it. And they’ve really kind of started to hit some traction around various parts of the world. So they started off building this entirely within the Amazon ecosystem and kind of got broader beyond that. But the thing that was really important about it is that they were very clever in the way they separated out the different parts of what it takes to collect and process web analytics data. and built well identified and well defined protocols between each of those stages. So if we kind of go through it, you’ve got like the JavaScript bit that sits in the browser that they call the tracker, which sends stuff to a collector, which sends stuff into somewhere that stores that data.

00:04:01.15 [Tim Wilson]: And the collector is the collector when it switches over to the server. Like it’s basically the tracker just like your Google Analytics or Adobe JavaScript. When it’s making that call, it’s saying literally, they’re just saying, let’s just isolate the capture of that call. That’s collection.

00:04:16.48 [Simon Rumble]: That’s right. Yeah, so the tracker is your analytics.js or your S code. And then the collector is the server side stuff. So that’s the thing that collects the image pixel. Then it goes into some kind of storage mechanism. Then it gets processed or ETLed. And then it goes into some storage system which you can model and then operate on. And all of those components are interchangeable. And so that’s, I got involved fairly early on because the way they were dealing with cookies was very much the the Google first-party cookie set inside the browser approach and I needed to track sites that span multiple domains so I set up our third-party cookie collector and just slotted it in and it fit straight into the ecosystem and worked which kind of blew my mind because it was a really easy way to get into this open source product and since then lots and lots of other people have kind of jumped on and started submitting bits in.

00:05:10.20 [Tim Wilson]: So is that what, like with what you, when you put that little piece, did not only you, you built that and you slotted it in, but then you also added it. I’m not fully versed on the open source kind of checking in and checking stuff out, but does that mean you also then contributed it back so that now others can use the third party?

00:05:27.37 [Simon Rumble]: Yeah, that’s that’s that’s how it works with open source. Essentially, everything’s put under a license that allows people to share and allows people to other people to use whatever it is that you do. So my collector was actually a standalone component. So it didn’t become part of the standard snow plow stack it actually is a component that could be kept separately and for that matter if I had have decided to license it as a commercial product I could have done that I could have put it under different terms whatever I liked because it just talks using the same protocols as all the rest of the stack it can slot into that that whole environment but yeah the way open source works is generally you’ll you’ll post stuff up and submit back your contributions under the same license and that means that it can be reshared and there’s various open source licenses available that you can use that put different different kinds of conditions on what you can and can’t do.

00:06:19.29 [Tim Wilson]: So what of those different chunks? I would assume that some of those are lighter, like relatively speaking, thinking in the lines of code required that the, you know, collection can have a lot of, or the tracker can have stuff added to it and kind of grow collection seems like it’s kind of dumb and simple the modeling to me seems like the sort of thing where you’re really are having to make some fundamental decisions of how are you kind of stitching stuff together is it in a sense they’re each kind of independent stacks of code and is it oh this is the one that’s the real beast or they all kind of is it chunked in a way that when you break it down that way you really have kind of put an equal level of kind of complexity and weight and architectural decisions into each one of those buckets. Does that question make sense?

00:07:09.24 [Simon Rumble]: Yeah, yeah. So different bits have different levels of complexity, but then there’s different versions and flavors in each stage as well. So on the tracker side, the JavaScript, you know, that’s kind of evolved over time. And if you think about the stuff that’s gone into Google Analytics’s latest round of changes, you know, that’s that their tracker code is getting more and more complex. And there’s no class stuff has started doing some of that as well. But then there’s also apps. So you can do Android and iOS and Microsoft and They also have libraries for a whole bunch of embedded devices, server-side stuff, so there’s a whole bunch of different things there. But at the end of the day, they’re not doing a hell of a lot. Some of those do interesting things like store up beacons to send later, which, you know, is similar to what you do in embedded applications in other platforms. The track is pretty simple, yeah? That is a pretty simple piece. My initial tracker was something like, 50 lines of code written in Node.js and just you know because it really wasn’t doing very much and so that was pretty simple so the way that the snowplow whole platform has evolved has morphed into a more generic events framework and So it’s not really just focused on web analytics anymore. You can start shoving, you know, just about anything in there. And that means you could put in, you know, things like Internet of Things, devices, really, really complex stuff coming out of apps, recording just about anything you like, server to server kind of communications, pretty much anything. So the place where a lot of the work has happened is in that transformation, Stage and then the modeling stage as well. So turning different kind of event models into the right shape has been where it’s where the kind of actions been so that that That means there’s a whole language for embedding arbitrary event types called Context so custom contexts and so that means basically you can send any little piece of JSON so any kind of object that models what your event actually is. So that event might be someone loaded a web page or someone clicked on an exit link, you know, in the web analytics kind of space, but it could equally be someone just made a purchase at the point of sale in the store and here’s the loyalty card that they swiped with the with the transaction. Anything can be modeled in that and it all ends up in this one kind of unified event stream, which is kind of exciting.

00:09:31.19 [Tim Wilson]: So how, as you were, I mean you were kind of there as external user number one. When you look the learning, the adoption curve, well presumably at that point there was not much of a community. There wasn’t a whole lot of, hey we’ve headed down some rabbit holes that don’t pan out well. I guess one of the benefits of using kind of a mainstream been around forever Google Analytics, Adobe Analytics is, you know, the benefit is a lot of the thoughts and decisions about how do you structure this stuff and what makes sense have already been made for you. The downside is that the way that you structure this and how you think about it has already been decided for you. So it seems like it would require You wouldn’t want somebody who’s never done web analytics but is really sharp coming straight out of school. It seems like diving in and trying to update or use or model something here because they might just not know enough. It seems like you have to have kind of an architectural mind or is that not Has it been built such that you still can kind of be up and running with called mainstream tasks that somebody else has figured out called just web analytics and then are you being drawn to it because you’ve got some oddball quirky thing that you know you have to deal with and that’s why you kind of have the aha of oh wait something like snowplow is exactly what I’m looking for. I don’t know. Have you painted yourself in the corners with it? Have you done things that you’re like, whoops, throw that out. Let’s let’s start over.

00:10:56.99 [Simon Rumble]: One kind of architectural component that that was in the original release was that it uses the Amazon elastic map reduce, which is their version of Hadoop that you can use on demand. And the only way to to queries in the original version was using the hive query language, which is a kind of SQL ish query language for really, really, really large data sets. The upside of that is that no matter how much data you throw at it, so long as you throw more nodes at it, your query takes about the same amount of time. the downside of it is that at that point, the Hive queries took about seven or eight minutes. So you needed a quite a bit of patience. So that is one particular kind of architectural rabbit hole. And there have been a bunch of places and bits of development that have happened that haven’t necessarily panned out. But that’s that’s the nature of open source. And you know, there’s there’s certainly some some quirky corners of Adobe Analytics and Google Analytics that you can talk about as well. For example, the semicolon delimiting of everything, the fact that everything is a string in Adobe Analytics.

00:12:03.21 [Michael Helbling]: Oh, the product string.

00:12:06.27 [Simon Rumble]: Merchandising in e-mars and the like. Every time I teach someone how to do an s.product string and have to explain to them that the first semicolon delimits the category, which is deprecated. Yeah.

00:12:19.52 [Tim Wilson]: Yeah, there you go. There’s your street cred like there. This is not a guy who didn’t play around with Adobe before he went into Snowplow. And I think we’re going to talk a little bit kind of open source more broadly, but what is like Snowplow I have heard referenced? I had heard of it when you and I were chatting about it months ago, but I barely had heard of it. Like I don’t, I don’t come across it with any, anybody that I’m working with, no one’s asking about it. And that made just because I’m not working with clients that are sophisticated to the point that they’ve hit challenges where they’ve got to find something else. But what is their data on what kind of adoption? I mean, 2012 was yesterday. You know, is it taking the world by storm or?

00:13:01.59 [Simon Rumble]: If you look at the built with graphs for snowplow rollouts, so built with is a service that goes out and scans the web and finds what basically what JavaScript beacons are in in the pages. And if you look at that for snowplow, it’s it’s got a bit of a hockey stick that could be somewhat misleading though. So there’s a whole bunch of different use cases around why you would use a tool like snowplow and one of them is When you’ve got a white label product that you want to push out and have lots and lots of different clients using it, but you want to have an idea of what’s going on on those sites, but you also want to allow them to install their own analytics on those on those services as well. So to use it for that. It has actually become a bit of a common thing. So there’s a few services that have hundreds or even thousands of sites that show up in that built-with graph, but they’re not necessarily that the person who owns that site one day went, oh, I want to try out Snow Plow. It’s actually that the service they bought uses it and embeds it as part of their product offering. So this examples of that smug mug is one, which is a photo sharing service. So every smug mug professional site can have a custom domain pointed at it and you’d never know it was a smug mug site unless you looked at the source. But all of those have a snowplow beacon firing on them. So there’s a bunch of different places that are embedding it into their service. So it’s kind of really hard to gauge how much uptake there’s been.

00:14:25.73 [Tim Wilson]: Although that’s, I mean, running into those services that are embedded, you know, it’s so common right now for them to say, they just run Google Analytics. And often it’s usually one account and you sort of think they probably haven’t got premium. Like if their service takes off, all of a sudden they’re going to have this big line item, plus they’re now going to be possibly getting into challenges with their client, you know, you’re installed on the site that’s running Google Analytics. and you also have us embedded on your site and that all seems a little dicey. So it seems like there could be the appeal of open source, so therefore the licensing, in this case the scalability, presumably you can very easily control, even if you are running on a Site that snow powers their mainstream web analytics presumably you’ve got a collector and it doesn’t really care your your tracker You can have two trackers. I’m assuming probably you don’t have as many concerns as you do I don’t be as careful as if you’re doing Google Analytics and trying to run two versions of the code.

00:15:24.02 [Simon Rumble]: That’s right Yeah, it’s it’s it’s perfectly supported to be to have more than one and tracker running and sending stuff to different places. That’s totally doable.

00:15:34.08 [Tim Wilson]: Well, it’s interesting because we won’t go too deep into it, I don’t think, but when we were talking about this and I thrown out Piwik and that’s because there’s a local company in Columbus that they have some kind of unique needs for data collection and they’re dropping their stuff on other sites and part of what they needed was a greater like minute level time granularity, something they couldn’t get from GA. So those guys are kind of like Piwik’s awesome. It gives us exactly what we need. Now they have, they’re much more of kind of a very specific set of things they’re tracking, but from that perspective, and that’s maybe two data points, and I’m trying to extrapolate that open source has an appeal to widgetized or componentized or Service type things where it’s like we’ve got these weird unique things about like our whole company is built because we do something that’s weird and unique and we provide that as a Experience for a bunch of other sites to use that that does seem like kind of one natural way that you would you would use open source type tools that are have flexibility and scalability and and free so you don’t have to tack on some licensing costs for everything and

00:16:39.36 [Simon Rumble]: Yeah, that’s that’s certainly one of the use cases is where you where you want to do something a bit out of the ordinary and you want to have control over it and and that that means control over costs control over exactly what’s collected in what circumstances and then what happens with it. So that’s that there’s a whole bunch of kind of use cases around that and it’s showing up in embedded. products all over the place. I encountered recently, there’s an organization in the US called the Alliance for Audited Media, and they’re traditionally one of these circulation audit type organizations. So they put their rubber stamp on circulation figures for magazines and newspapers and say, yes, We’ve audited this and it genuinely does have this many copies in circulation. They’re attempting to launch a product around certifying people’s web analytics data. And they’ve actually partnered with their colleagues in Brazil who’ve built their own platform for doing exactly that. And they’ve built that with a collector and data processing pipeline built around Snowplow. So, you know, that I kind of encounter these things a fair bit. You just kind of you’re looking around and and you find this embedded product and you go, oh, they’ve got like a little web analytics thing going on and you have a look at it and you go, wait, that looks kind of familiar. And then you look at the code and you go, oh, that’s the Snowflake Collector.

00:17:53.99 [Tim Wilson]: But it is interesting. So I’m not knowing where to look for it. Like I go to your to snowflake.analytics.com and Ghostry doesn’t pick up. Snowplow as one of the tools running on that site. So is it still, is it kind of under the radar? Because you can actually customize. So it’s harder for them to put keys in to say we can identify this as it.

00:18:14.99 [Simon Rumble]: Yeah, I’ve actually talked to the ghostry guys a couple of times to try and get them to put snowplow in there. And I’ve tried to explain to them how the beacons look. So that they can construct it, but they seem to be fairly strictly only looking at URL components. And so they’re not able to detect it because the tracker code so the JavaScript that you load runs on your own server. It can run on your own domain. It can be any domain. And it doesn’t look like, you know, analytics.js. It just is a blob of JavaScript. So unless you know what you’re looking for, it’s not so easy to programmatically at least find it. The built with guys, however, have worked it out. So, you know, it’s doable. It’s just ghostry hasn’t managed to do it.

00:18:55.86 [Tim Wilson]: Well, my impression is always that ghostry is really more kind of around. They’re more interested in kind of the ad pixel type stuff than necessarily the like the analytics kind of gets thrown in. Yeah, if it was some crazy new type of retargeting thing, they might be more receptive.

00:19:09.60 [Simon Rumble]: Well, there’s this thing, there’s nothing to say that someone hasn’t already built some kind of crazy advertising retargeting thing based on snowplow. We you just you might not know because it’s embedded.

00:19:21.75 [Tim Wilson]: Of course, now I’m looking at the now I’ve got the developer tools open and I’m looking at snowflakeanalytics.com and trying to figure out if I can see where the what the actual beacon is.

00:19:30.92 [Michael Helbling]: I was looking for the same thing.

00:19:33.20 [Simon Rumble]: You really shouldn’t be reading my code.

00:19:34.91 [Tim Wilson]: It’s terrible.

00:19:36.60 [Simon Rumble]: That website got put up in a day.

00:19:39.34 [Tim Wilson]: Trust us, there’s no danger of Michael Hubling or Tim Wilson. I’d like to stand out. You’re safe.

00:19:45.40 [Michael Helbling]: It looks very elegant to me. I don’t know.

00:19:48.47 [Tim Wilson]: Pretty sure the fonts. Yeah, it looks fantastic. I’m determining that it’s not. So give us a hint. What are we looking for?

00:19:54.02 [Simon Rumble]: So we’ve got all our pixels deployed through Google Tag Manager. So you’re not going to find anything in the source there. And we push everything through the data allow. We do best practice implementations, which that website isn’t. But it comes close.

00:20:06.77 [Michael Helbling]: So you have Snowplow integrated through Google Tag Manager in this case?

00:20:13.06 [Simon Rumble]: That’s right. And that’s excellent. When I work with any clients these days, that’s table stakes. We start with, have you got tag management? If not, that’s our phase one, because it’s just too painful otherwise. One of my big bug bears, and I’ve worked in a lot of publishers and media companies, One of my big bug bears with traditional web analytics, which still isn’t resolved is the way that they record time spent. So, you know, time spent is just the difference between the two page view events. You know, that’s that’s what they do for time spent. But that means that the last page in a visit gets a big fat zero, no time spent on the page. Now the trend with publishers is that most of their traffic now comes from Facebook. And Facebook, the in-app experience, has this inexorable draw back to your newsfeed. And so most publishers get one page view in each session from people coming from Facebook, unless it’s like a gallery or something like that. Now that doesn’t mean that people aren’t reading the content. So when I was at Bower, we had, there’s one of their mastheads does really detailed, very high quality journalism, and they have six and seven thousand word articles. And Google Analytics will show you that the average time spent on that is really, really small. But when we started digging a bit deeper on that, we discovered that actually people were reading the whole damn thing. You know, they were spending seven, eight minutes reading the thing, even on a mobile phone, even in the Facebook in-app browser, they’re still reading this and engaging with this content. And when you’re a publisher, what you’re selling is people’s attention. And so it’s a pretty important metric. Google Analytics, Omniture, all of those products, they just throw it away. They just don’t have their numbers while accepted by the industry are basically garbage. So one of the things that Snowplow does that I’m quite keen on is the pagepings. So that’s configurable. You set it up so that after 10 seconds, It’ll send a bacon every five seconds and then ping, ping, ping.

00:22:07.29 [Tim Wilson]: Is that limited to when it’s an active tab? Because I think I’ve seen people, I want to say Analytics Ninja had written, you know, for Google Analytics. And granted, you’re inherently sort of hacking when you’re saying I’m going to write this ping thing to GA and how I’m recording it and how do I not fuck up all my other data. Is it because the other the other knock with time on page is you have no idea if it’s even the active tab. I mean, yeah, if I once a week, I’m trying to explain to somebody to please for the love of God, don’t put too much faith in that metric in Adobe or Google. But one of those points is what’s great. You loaded it and you wanted to come back to it later, but you went to another tab. So does it have that? And I’m again, I’m not a developer. So but you can detect when it’s an active. That’s right.

00:22:48.80 [Simon Rumble]: There’s there is some there’s some smarts in the code there. There’s other products that do this. Chartbeat has a very good implementation. as well. Yeah, that’s the one I’ve seen before chart B. Yeah, I’m happy watching these little pings now. So yeah, he actually does as well. And, and you know, there are tools that do this. And there are hacks to make it kind of work in Google Analytics, but they have their downsides as well. By doing that in Google Analytics, quite radically change your bounce rate metric, if you use an interaction ping like event. And I actually think that that’s That’s probably a good thing, and if you were setting up a brand new website, I would strongly recommend that you do exactly that if you’re using Google Analytics. But if you’ve got a site that’s got a whole history of data and you suddenly change the definition of bounce rate, well, it’s a big deal. I think it’s probably the right decision to make, especially in publishing, but it’s a big change, and your bounce rate’s not going to match in any other analytics tool as well.

00:23:41.64 [Tim Wilson]: So that’s funny. There’s a saw thread in the, I think the DAA forms where somebody’s trying to figure that out. Look, the old, I’m comparing Google and Adobe. So that’s a good question. So if, if you have got, let’s say you’ve got, you’ve got snowplow running. It’s on your homepage. You have the question of what is the average time on page for this page on my site? And that’s just kind of a specific example from a, where I’m still a little fuzzy is, Well, Snowpile, you’re not logging in saying, I’ve set all this thing up. Now let me log into my fancy web interface that has my 75 canned reports and my nice little segment builder, right? You’re having, you need to know the data and then you’re querying it as opposed to working in a interface or no.

00:24:25.50 [Simon Rumble]: Yeah, the interface for snowplow such as it is is generally a database query. So so that means your data ends up in either elastic search, or some kind of SQL database, or it can end up in files, if the volumes are really, really high, you can you can end up in files, and then you can there are ways to query that as well. So that there is no generic front end for Google Analytics out of the box. If you go and install, sorry, if you go and install Snopelow Analytics, there is.

00:24:55.28 [Tim Wilson]: Let me tell you something as it turns out there is a front end for GA.

00:24:59.20 [Michael Helbling]: Hold on, Tim. I’ve got this one. This is one I can answer.

00:25:05.31 [Simon Rumble]: Out of the box, Snopelow doesn’t really have a pretty front end that you can do queries with. It’s not like Google Analytics or Adobe Analytics where you get something for free as soon as you do it.

00:25:15.93 [Tim Wilson]: But is that an area that somebody will be building something for is that no it just makes no sense to even try to come at it that way.

00:25:22.62 [Simon Rumble]: There are there are actually a lot of opportunities in this whole space to build bits that are plugged that plug into it. And this is why, you know, we kind of encourage people to get involved in this stuff and start selling the platform. So it’s open source, but the more people using it, the more cool stuff gets built. And there are opportunities around this stuff. So building a front end is one of those opportunities. That said, in implementations I’ve done in the past, we’ve built front ends for it, but they’re not necessarily interactive analytical front ends. They’ll be things like dashboards that show aggregated stuff or they might be exports that are sent off to some system that that then goes into a dashboard or inputs into in bow we built a recommendation system. So all of the behavioral signals that went into that recommendations engine. came from the Snowplow data. So it depends on your use case. We also have done lots and lots of deep detailed analysis using tools like Tableau and the like to explore it. And if you model the data right and then stick a front end on it, you certainly can give your end users a front end, but it’s not going to be an open slather Here’s your dimensions. Here’s your metrics. Slice and dice at will. Environment like Google Analytics or Adobe Analytics might be. But then there are, you know, there are other advantages, I guess. But that’s an opportunity for someone. If someone wants to build a front end, we’d love to say it.

00:26:46.47 [Tim Wilson]: Okay, cool. But you can, it’s got a limited number of, you know, we talked about BigQuery a few episodes ago, and I don’t think it’s is that a it somebody could could replace that whichever one of those little components is to shove it into BigQuery but it’s not part of the the core or is it that would be a bad idea or is it not not there yet or what yeah at at the moment the the whole stack is fairly tightly coupled with the Amazon

00:27:14.22 [Simon Rumble]: platform, but there is actually some work underway to try and decouple that a little bit. And that has kind of two possible endings that I can see. One is self hosting. So self hosting is one of those, one of those options that is going to be really important. So there are organizations that just don’t trust the cloud. They are not willing to shove their stuff in the cloud. There’s also organizations that aren’t willing to have their data go offshore. so go international and so amazon has data centers in lots of places but there are plenty of places that are jurisdictions where there isn’t an amazon data center so you know if you’re in malaysia there’s no amazon data center there if you’re in new zealand the local amazon data center is in new is in australia so there’s lots of reasons why someone would want to self host then there’s just you might have ended up storing lots and lots of personally identifiable information or maybe you had your CTO stand up in a conference and big note himself by saying no data will ever be stored offshore by our company and so you’ve got to deal with that so that that’s one of the options and the other is Platforms like Google BigQuery the Google stack is ridiculously cheap big queries come a long way in the last couple of years like a few years ago it had those Hadoopy kind of big big data problems of you would do a query and it would take about two minutes these days it is much much faster so a port of the snow plow ecosystem across to BigQuery is definitely on the cards. There’s been a few people talking about doing that. We might end up doing it ourselves. That’s Snowflake if we find the right kind of use case because it’s kind of an exciting platform and it’d be good to be able to mix and match those components at wheel as well. Got it. Michael, I feel like I’m dominating the question.

00:28:52.30 [Michael Helbling]: Well, you certainly are, but I love the conversation. So I’m just sort of soaking it all in. One thing I wanted to talk about, because I always liked as we have these kinds of conversations, make sure that it’s not just all about Tim. It’s about the listeners.

00:29:08.90 [Tim Wilson]: OK, I’m ready now. You can stop again. I’ll continue with my questions.

00:29:12.52 [Michael Helbling]: OK, there you go.

00:29:14.06 [Tim Wilson]: I don’t like the turn this was taken.

00:29:16.75 [Michael Helbling]: One of the things to think about is, hey, if I’m thinking about maybe an application for Snowplow, what kinds of things should I be thinking about? Maybe a checklist for getting started. Then there’s concepts I was reading through some of their documentation around their event structure and how it’s set up. That is the concept of They have an immutable log concept in terms of this data is not going to change. We don’t want it to change over time, which makes a lot of sense. But there are things that do change about visitors over time and those kinds of things. And so they recommend merging that with another data set at the time you do analysis. So I want to hear from you on both topics. How do people get started? And then as things change, what has your experience been in terms of Okay, so this person is switching segments or cohorts. You know, what does that look like for that visitor? We don’t want to change those values historically.

00:30:10.03 [Simon Rumble]: So I’m going to answer the second bit first. Okay, great. One of the core principles that the snow cloud guys kind of wanted when they first started and what they found as a limitation in the existing web analytics tools was you should never throw away data because data storage is dirt cheap these days. So back when just James and Co were creating Omniture back in the 90s, data storage was really, really, really expensive. So they took a model of throwing lots of data away.

00:30:39.31 [Tim Wilson]: Nothing compared to what Webtrends was throwing away. Yeah, that’s right.

00:30:43.17 [Michael Helbling]: Hey, Webtrends was stored in a very nice, flat CSV.

00:30:46.79 [Tim Wilson]: With the table limits of would you like five thousand or ten thousand it depends you can set up for different things let’s make it in.

00:30:56.74 [Simon Rumble]: So they were all built in an era when data storage and data processing were expensive and so for example. simple, simple example, if you want to classify a reclassify your user agent string to mean something new, because some new user agent has popped up, but your web analytics tool hasn’t started recognizing it as such, you can add a lot because the data’s are it’s gone, you know, that user agent string is gone. So an example of that is the Facebook in app browser, which the which Google Analytics still doesn’t recognize as a distinct browser, and certainly doesn’t allocate to the right acquisition channel. So, you know, those are the kinds of things you can do when you store absolutely everything. So that’s, that’s kind of part of the thought process. So you store everything, you keep it forever. That doesn’t mean that you can’t change your mind about how you analyze it later on. So you might have a model for my definition of a session or a visit is blur. Let’s say it’s, you know, we’ll match the, the, the standard one, which is it’s no more than 10 minutes of inactivity and no longer than, um, no longer than 24 hours. Let’s say we want to do that. Suddenly you work out that there’s actually, um, some things, some edge cases in your application where people are seemingly not doing anything for 10 minutes, but actually they are doing something And so your sessions are getting chopped off. Hey, we want to make it 30 minutes. Well, if you’re collecting things in Google Analytics or Adobe Analytics, you’re bang out of luck. Your historical data is now wrong. In Adobe Analytics, I don’t think you even get control over that over what your sessionization is. With the Snowplow stuff, you’ve got the raw data. So you can go back and re crunch it. It’s not necessarily going to be an easy process, but you have that possibility. I did a lot of analysis around that Facebook in app browser. when I was in publishing land for much the same kind of reason. There’s a whole bunch of weird bugs in the iOS Facebook app where it doesn’t pass through a referrer when it sends a user to your site. So we would have these massive bucket of direct traffic and people would go, what the hell’s that? And we’re like, oh, I don’t know. We had the raw data to be able to find out that it was Facebook.

00:33:02.38 [Tim Wilson]: How, how was the raw data data enabling you to figure that out?

00:33:06.18 [Simon Rumble]: Because you’ve got the full user agent string, you can see that it was the Facebook in-app browser. So the Facebook in-app browser has F-B-A-N in the user agent string. And it’s the only user agent that has that.

00:33:21.18 [Michael Helbling]: Sometimes log file analysis is very useful.

00:33:23.95 [Tim Wilson]: I think I might have a use case.

00:33:25.35 [Michael Helbling]: I feel like I just jumped back 12 years log files versus JavaScript deployment.

00:33:30.28 [Tim Wilson]: But where is it when, because on the one hand, and this is kind of not necessarily understanding exactly the between the collection and the modeling and the storage, what decisions are you, you say you have the raw data, but it sounded like that there’s, there’s modeling and transformation happening before it goes to storage. So is there not data loss in that when you go from tracker to collector, it goes tracker collector, you said then modeling and then

00:33:56.79 [Simon Rumble]: Storage does the modeling not have some potential loss of fidelity or no so in between so this is tracker collector and then ETL and then out into storage and then modeling on storage. And you keep the raw data at every step of the way. So that data that came raw out of the collector, you don’t get rid of that. That sticks around because it’s ridiculously cheap to keep it and you might as well.

00:34:22.09 [Tim Wilson]: I literally just shoved into a flat table with the raw hits.

00:34:25.60 [Simon Rumble]: I mean, is that just files in S3 in fact? So it’s not a database. It’s it’s much cheaper than that. It’s just S3. Okay. And And, you know, you just keep that around forever, which means you can go back to it at any point and change your assumptions if there was something that you did. But the ETL process itself is actually pretty simplistic. Really, all it’s doing is there’s this kind of two lookups that it does. One is an IP address lookup using an IP Geo mapping database. And the other is looking up the user agent and classifying the user agent. That’s pretty much all that happens, apart from just moving, you know, doing

00:34:59.23 [Tim Wilson]: Format transformations, it’s really just yeah, that’s kind of enriching. It’s adding it’s adding descriptors that I mean it’s adding column I mean it sounds like super simplistically It’s adding a column every a couple of columns every hit to say what is the planar better name for this user a resolving this user agent and geo information

00:35:16.81 [Simon Rumble]: That’s right. And otherwise, it’s just basically pulling out the query string parameters from the beacon. So the right, the, the, the name value pairs that are sent through and shoving them into the appropriate fields. In some cases, it might do some string operations on those things so that it splits things apart. But really, that’s all it’s doing. It’s, it’s really simplistic. So that, you know, that means you can do kind of all sorts of stuff. So I can give an example of something I’ve done there that was really hard to model in certainly in the Google Analytics environment. So publishers of, of, you know, entertainment websites get a lot of traffic from galleries. And galleries are generally a kind of Ajax-y type thing where the page doesn’t load on every slide, instead you swipe through it and the page just the image changes. Now, if you did a standard Google Analytics install, you would only record one page view for that whole gallery. Instead, what you want to do is record a gallery event happened. But Google doesn’t really treat any event other than page view as a first class citizen in its UI. So you kind of end up recording it as a page view, which is kind of crappy. So then when someone comes to you and says, what’s the proportion of our page views that come from galleries, the answer is, I don’t know, sorry, you could go and also record another event into as an event as a you know custom event but then you’re getting perilously close to your limits or you’re increasing your costs if you’re on premium so there’s kind of no nice way to fix that so in snow plow implementations when I do them I model three different events one is a page view and that page view is the traditional concept of a page view as in a page loads, the whole thing loads. That’s a page view. And then we have events for other things. So if you open a gallery, we might record that you opened a gallery. So that means a page view fires, a gallery open fires. And also the first slide shows. So we also say a gallery view happens. So there’s kind of three events to model there. That just doesn’t really map very well to Google Analytics because you, everyone wants to talk about page views.

00:37:17.28 [Tim Wilson]: So it maps, it does actually map, but it maps better to Adobe where you’ve got, you can set up different events and different, okay.

00:37:24.70 [Simon Rumble]: And they’re not second-class citizens in the UI. Everything’s a second-class citizen in the Adobe UI, so everything’s a custom report, basically.

00:37:33.60 [Tim Wilson]: So are we going to get the, we haven’t gotten to the, what’s the checklist for somebody starting?

00:37:38.28 [Simon Rumble]: Yeah. So if you want to start, um, if you want to start using snow plow, first step is to go to the snow plow website and start reading some of the documentation. You do kind of need, um, some background. You probably would need to know how to set up Amazon stuff. And there’s a few decisions that you need to make. The choice of tracker and collector, um, are pretty straightforward. There’s a couple of choices at each step, but there is one really critical architectural choice that you can make. There is whether you want to go for batch mode or whether you want to go for real time. So there’s two kind of processing pipelines.

00:38:13.00 [Tim Wilson]: Real time. It’s gotta be real time. We gotta have real time. We are. Yeah. Real time. Why?

00:38:17.62 [Michael Helbling]: Boom, boom, boom.

00:38:18.59 [Tim Wilson]: What are you talking about?

00:38:20.01 [Michael Helbling]: Yeah, gotta have it.

00:38:20.97 [Simon Rumble]: See, that goes back to the that goes back to the one of the use cases as well, though, is that running snow plow on really large data sets can be ridiculously cheap, you know, really, really, really cheap. Bauer, where I worked, our costs were in the couple of thousand a year for collection and processing. the big cost was actually the database to store it all in. But if you don’t need that, so for example, the audited media guys, they run in batch mode, they can handle ridiculous volumes and not cost very much because their output is very tightly constrained. They don’t actually need to do much deep analysis because it’s embedded in product. So once that processing is done to output, you know, the few columns that they need, well, that’s it, they’re done. And they can shut down the processing cluster and wait for another day and then run it again tomorrow. So batch mode definitely has its uses, and certainly if you’ve got really high volumes, it’s a lot cheaper. The batch mode is also a bit easier to set up for you.

00:39:19.21 [Tim Wilson]: It’s batch mode, but is that still batches running hourly? You specify the frequency of the batch?

00:39:26.64 [Simon Rumble]: That’s right. That’s right. Um, and it’s hard to go much shorter than about an hour because it uses the, uh, the Amazon Hadoop and elastic map reduce, they call it, which has about a 12 minute startup time to start up the cluster. So you can’t go much less than an hour because sometimes you’ll, they’ll stomp on each other. Got it. And so the real-time mechanism uses Kinesis, which is Amazon’s event streaming platform. Basically, you can stuff events in at one end, and they’ll persist and be available for querying at the other end by multiple consumers for 24 hours. The reason that’s kind of cool is that it’s pretty fast, it’s really real-time, but it’s also kind of, it gets expensive quite quickly as well.

00:40:09.96 [Michael Helbling]: Oh, that’s cool.

00:40:11.10 [Simon Rumble]: Yeah. So so essentially, to get started, you just need to start reading the documentation and follow some of the steps. Sometimes the documentation would be a bit crafty gets out of sync with with the code. This is a problem not just with open source products, but but with everything, you know, you’ll find lots and lots of old Google Analytics documentation and forum stuff that doesn’t make sense anymore. If you have those problems, you’re welcome to ping me on Measure Slack. And there’s a very active community on discourse that the Snow Plow guys run if you’ve got questions about getting up and running. Or you can just pay me for it.

00:40:50.27 [Michael Helbling]: That’s what people should do.

00:40:52.35 [Simon Rumble]: So there’s one thing that we haven’t kind of talked about that I think might be important and that’s some of those use cases around why you might use an open source analytics tool.

00:41:03.92 [Tim Wilson]: We’re wrapping up.

00:41:05.54 [Michael Helbling]: I don’t think that’s important.

00:41:10.69 [Simon Rumble]: Do you want me to just crap on about that or do you want to ask me a question?

00:41:14.42 [Michael Helbling]: Hey, you know, Simon, another thing I think it’d be really great for people to hear about is sort of, what are some of the use cases that you might leverage an open source analytics platform for?

00:41:24.62 [Simon Rumble]: Yeah, well, so there’s a few reasons. There’s if your data just really doesn’t map to the web analytics space, if you’ve got things that just don’t, you know, are hard to shoehorn into that. So an example of that is, you know, video metrics and stream metrics, they really don’t fit very well in the web analytics space. I know Adobe is on their third iteration of trying to tackle that problem, and Google have just wiped my hands.

00:41:48.17 [Tim Wilson]: Adobe heartbeat. Yeah. I just got PTSD from the thinking about reading Adobe’s documentation on the number of EVARS and events needed to follow one of those iterations. Yeah, that’s right. Thanks for that. I’ll be curled up the fetal position while you guys wrap this up. What else do you get?

00:42:04.40 [Simon Rumble]: There’s just things that don’t model into web analytics space. If you want to embed it into a product experience, you can create a custom product. wrapped entirely around this open source product. So a custom service or custom product wrapped around it really easy to do. And there are people who have built kind of analytical pipelines that do, you know, stuff in particular verticals that just wouldn’t work anywhere else. So that’s a that’s a reasonable way to do it. So there’s a lot of places where cost can become a really big issue. And this is this is actually a bit of a bugbear for me. One of the reasons that some of the more expensive analytics tools are less popular for me, is that they suck all the money out of the ecosystem. So I think that if an organization is going and spending a chunk of money on digital analytics, they probably should be spending a third on the tools and at least two thirds on people to actually run the thing. But in a lot of cases, there are people out there who are spending well over two thirds on the tool and therefore they only have money for a junior analyst. And they wonder why they can’t get anything useful out of it. So by flipping that and hiring smart people instead, there are organizations out there who are doing quite clever stuff. But also if you’ve got ridiculously high volumes, and there’s there’s a few snowplow installations in in spaces like online gaming that have really, really high event volumes. and Snow Plays the only way they can do that kind of thing affordably. So those are kind of the really apparent use cases, but there’s also just if you live and breathe this stuff and you love it and you love to get down into the guts of it, having a table that just has every event in it is a really empowering kind of thing. People, you know, power users really like a platform like, like Snowplow. But also there’s, I’ve had quite a few experiences in my career where a company’s got a business intelligence or data warehouse team. And they say to the web analytics guys, Hey, we want to get your data. Can you give us your data? And some years ago, I used to fob them off by sending them an Adobe analytics dump. which came traditionally with no headers, which was really helpful of them, and had something like, I think about 800 columns, with some of the columns repeated in weird ways. And of course, no headers, so you had no idea what they were. I’d send that to them and I’d send them one day’s worth and go, there you go. see what you can do with that, and then I’ll turn on a feed for you, and I’d never hear from them again. Well, with a tool like Snowflout, you actually can give them a database. And for a lot of use cases, that can be really kind of useful. So, you know, if you’ve got a, if you’re in a telecommunications company, and you’ve got a bunch of guys who are building a churn risk model inside SAS, having your web analytics data in there is really useful. You know, someone who’s viewed more than three help things and looked at the phone, the call center page is a high churn risk. You know, those kind of signals are actually really valuable to have. So while, you know, I used to kind of try and get in the way of those guys because I knew they did bad things with it. There are use cases where it makes sense for all of your web analytics stuff to just turn up in a database in some format that your data miners can use. So those are kind of some of the use cases. For me, a lot of the cases is embedding into products. So embedding into other products. Um, that’s kind of exciting for me.

00:45:25.64 [Michael Helbling]: Well, no, there’s certainly, there’s a, there’s other companies that do that too, like Keen IO and things like that. So it’s definitely like a kind of open ground.

00:45:33.92 [Simon Rumble]: Yeah. So there’s, there’s kind of one more really critical one. And we touched on it before, which is if you decide that you need to own your data. So there’s all sorts of reasons you might do that. You might actually want to store personally identifiable information in your data set so that it’s all in one place rather than have to go through contortions and hashing and the like. You might accidentally have circumstances where that happens, and that’s something I’ve encountered a few times where referrer strings contain, you know, nasty stuff, and web analytics vendors get really, really antsy. Or you might have had someone who’s said, no, we have to keep all our stuff in house. And if you can, at the moment, if you can stretch in-house to mean in our Amazon account, then Snow Plow will suit you. And in the near future, hopefully they’ll, they’ll actually be on-premise options, which was one of the big selling points of tools like Urchin and web trends back in the day was, was the ability to make it an on-premise.

00:46:25.83 [Tim Wilson]: So it’s like having log files stored in a computer that you have on, on premise. So yeah, we have, we have fully gone full circle back to one decade.

00:46:36.40 [Michael Helbling]: your web trends data collection server. DCS multi-track.

00:46:42.33 [Tim Wilson]: This has been fantastically informative.

00:46:44.99 [Michael Helbling]: Yeah, it’s actually been great. And actually, Simon, it’s just great to talk to you about this stuff. And you can just tell by the things you’re mentioning, you’ve certainly seen quite a bit in the analytics community. So it’s been great to kind of share that and laugh a little bit about the good old days.

00:47:02.39 [Tim Wilson]: Check off another continent on our.

00:47:04.51 [Michael Helbling]: It’s right on our world tour. Hopefully we’ll be popular in Australia after this. So one thing we love to do on the show is called Last Call. We love to go around the horn and everybody kind of talk about something that interests them right now or something that’s coming up they’re excited about. So I don’t know, Simon, if you want to kick us off, you have a last call.

00:47:24.43 [Simon Rumble]: So I’ve got a couple of events that I’m involved in. Web Analytics Wednesday runs every month in Sydney and Melbourne. I work on the Sydney one. Our next one will be the 14th of September. It’s the second Wednesday of the month. Melbourne one should be on the 21st of September. But much more importantly is Measure Camp Sydney, which is coming up on the 10th of September, which is really exciting. It’s the first time that Measure Camps come to Sydney, there was a measure camp in Melbourne earlier this year and it was really successful. If you’re not familiar with the measure camp concept, it’s an unconference which means that instead of paying thousands of dollars to hear some sales pitches, there are no defined sessions. Everyone is a participant and everyone is encouraged to run a session and we turn up on the day and we put up what sessions we want to run and you can choose which sessions you want to go to. It’s also held on a Saturday, which means that those people who go solely to get a day off work don’t come.

00:48:20.28 [Tim Wilson]: I hear so many good things about them. Yeah, likewise.

00:48:24.00 [Michael Helbling]: I don’t know if they’ve cracked the United States yet or not, but I’ve heard a lot of good things.

00:48:27.81 [Tim Wilson]: I think there’s one coming up in New York. Outstanding.

00:48:30.75 [Michael Helbling]: Well, Tim, what have you got?

00:48:32.26 [Tim Wilson]: Well, so I am going to I will do a quick little log roll because I as this comes out, I will be about to dive into my fall conference mania being one of those people who shows up to, I guess, two sales pitches. Hopefully I’m not doing sales pitches, but I will be at the Copenhagen Web Analyst Wednesday on September 7th. I’ll be at the Boston DA Symposium in mid September. I’ll be at a senior care marketing summit in Chicago later in September. I’ll be at the Love’s Data Conference in Sydney in late October, followed two days later by the Love’s Data Conference in Melbourne. And then I will be in E-Metrics Berlin in early November. So we’ve got it. If you go to the About Us page on analogstomystify.com, we’ve got a speaking list. So if you’re in any of those towns, whether or not you’re coming to the event or not, I would happily try to grab a beer with you because I’m going to be exhausted. My actual kind of last call just for fun and is limited to, fortunately this is R-specific, but it’s still kind of amusing. And this is credit to active Slack member Pavel Kapuchinsky, the XKCD package for R is absolutely hilarious and I cannot wait to use it. So there’s one of those cases where visualizations you really can’t do in Excel. It is damn difficult to make an Excel chart look like XKCD. And that is just fucking awesome. So I’m looking forward to using that.

00:50:04.61 [Michael Helbling]: That’s one R package I actually know about.

00:50:08.35 [Tim Wilson]: There’s been around probably for 15 years, but you know just like like movie references. I’m one late to the game Well, you’ll get there Tim if you keep trying watch the castle.

00:50:18.87 [Michael Helbling]: That’s right. Oh, yeah, what classic movie?

00:50:21.41 [Tim Wilson]: What’s your what’s your last all right?

00:50:23.18 [Michael Helbling]: So my last call I’ve Just heard about this, and I was told I can share it, so I’m gonna with everybody. So Adobe, as everyone knows, is launching their device co-op, and they’ve launched a page showing what devices you have registered in the co-op, so what devices they know you’re connected on, and which companies are connected to it so far, at least in the US and Canada on the page I’m seeing. So the URL is cross-device-privacy.adobe.com. just had to work that privacy in there to make everybody feel better. No, I don’t know. But anyways, really cool little page, need to check out in any case. I’ve been trying to figure out ways that I can get my phone and my laptop and my home computer connected so I can start seeing myself connected to multiple devices just for giggles. But you can also use that page to disconnect your devices from the co-op as well should you want to, which I also think is kind of a nice feature for folks who maybe don’t want their devices connected. little thing there. All right. Well, hey, as you’ve been listening, you’ve probably thought these are all great tips, Simon Rumble, but what about question, question, question? And so that’s where you, the listener can come in. And the great news is Simon’s very active on the measure slack. He’s an active member of the analytics community. So he is easy to get a hold of and ask a lot of these great questions. uh him and the snowflake analytics team probably all um around and available so if you have questions about the episode tim and i are completely useless but we will point you to simon who as you’ve heard is a wealth of information and awesome uh kind of insight on this tool and sort of similar tools of the the kind. So thanks again, Simon, for being on the show. It’s been really informative. It’s helped, you know, I think solidify probably a lot of people’s minds kind of where a snow plow might fit within their overall analytics scheme. There’s certainly some very interesting and desirable things about snow plow, both from the way it collects data to it’s, you know, really low cost structure that could really potentially help a lot of companies. So I think that’s really great. And thank you for being on the show.

00:52:36.56 [Simon Rumble]: Thanks for having me and thanks for putting up with my loud planes going overhead.

00:52:41.99 [Michael Helbling]: No one hears that because of awesome post-editing production values that we have, I hope. Anyway, so obviously for my co-host Tim Wilson, everybody keep analyzing.

00:53:01.55 [Announcer]: Thanks for listening, and don’t forget to join the conversation on Facebook, Twitter, or measure Slack groups. We welcome your comments and questions. Facebook.com forward slash analytics hour, or at analytics hour on Twitter.

00:53:16.75 [Charles Barkley]: So smart guys want to fit in. So they made up a term called analytics. Analytics don’t work.

00:53:26.29 [Michael Helbling]: Of course, I’m sure when we put the outtakes together, Tim will not put that part in. of editing. It’s not going to be that big of a deal. So have you ever seen an Australian movie, The Castle?

00:53:47.58 [Simon Rumble]: So next time you fly into Sydney Tim, you can look down and you’ll see my house.

00:53:53.62 [Tim Wilson]: Oh, I guess I should have been unmuted when I was with my witty response today.

00:53:59.47 [Michael Helbling]: Early Eric Banna, Tim, if you’re interested.

00:54:02.54 [Tim Wilson]: Oh, I’m on Wikipedia right now.

00:54:07.63 [Michael Helbling]: We’re not dealing with that right now at all.

00:54:12.17 [Tim Wilson]: God damn it, you guys are strong now. How’s the serenity? Yes, we are recording. Oh, shit. Rock, flag, and open source.

One Response

  1. […] #043: Open Source Analytics with Simon Rumble (Simon’s first appearance on the show) […]

Leave a Reply



This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#274: Real Talk About Synthetic Data with Winston Li

#274: Real Talk About Synthetic Data with Winston Li

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_274_-_Real_Talk_About_Synthetic_Data_with_Winston_Li.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares