#043: Open Source Analytics with Simon Rumble

Somebody wants to overthink their analytics tools? Tell ’em their dreamin’! We wanted to talk about open source and event analytics and Snowplow sits right at that intersection. Our guest Simon Rumble is the co-founder of Snowflake Analytics and one of the longest users of Snowplow. We wrap up the show with all the places you can find Simon and Tim in the next few months. Fun fact: You will also learn in this episode that conversion funnels go down the opposite direction in Australia.


Episode Transcript

The following is a straight-up machine translation. It has not been human-reviewed or human-corrected. We apologize on behalf of the machines for any text that winds up being incorrect, nonsensical, or offensive. We have asked the machine to do better, but it simply responds with, “I’m sorry, Dave. I’m afraid I can’t do that.”

[00:00:27] Hi everyone. Welcome to the digital analytics power hour. This is Episode 43 in our continuing series on topics that Tim Wilson is using the podcast to find out more about it. Welcome Tim. By the way we’re talking this episode about open source Analytics a few of these tools exist but one that we see pop up on our collective radar more and more is snowplough analytics. So we’re very interested in it. We want to learn more and since Tim and I don’t use the new and cool tools very often except for Tim and our which is you know his new love of his life. We needed a guest to help us along the way and we’re very fortunate to have what we’ll call a very special snowflake with us today Simon Rumball. He’s the co-founder of snowflake analytics her consultancy built around the snow plow product. His previous roles he was the head of analytics at Bauer Media Group. He’s very active in the Australian analytics community. He’s also very active on the measure slack and we’re very pleased to have him on the show. Welcome Simon. Thank you. Oh wait. I wanted to do since your last name is Rumball. I wanted to do so.

[00:01:45] SKINNER You talk about snowplough if I say the whole thing that might violate. Like that guy’s trademark so I just went with the first part that we clipped that out because our legal department advised me to not go there.

[00:02:04] So I think a good starting point will be maybe Simon if you wanted to walk people through kind of what snowplough is what purpose it serves. Kind of in the broader constellation of analytics tools and kind of where the Maitlis start with that and then we’ll go from there.

[00:02:20] Sure. So snowplows. An open source web analytics platform or at least that’s kind of how it started back in 2012. It started out as these two really smart guys in London working for an ad server platform and that were part of that task was to try and integrate the ad service stuff with the traditional web analytics tools Google Analytics Omniture and the like that everyone else was using and what they discovered was that it was really old fashioned and really kind of clunky the way that the web analytics tools were doing this stuff and it was really hard to integrate things together. So they decided well let’s let’s write our own as you do because you know it seems like a fairly simple straightforward problem and five years later you find yourself still working on it. And I put it up on a good hub. They announced it I discovered it a couple of weeks later. Set it up and running it and thought wow this is really awesome and and they’ve kind of built a business around it and. I’ve really really kind of started to get some traction around various parts of the world. So they started off building this entirely in the Amazon ecosystem and and kind of got broader beyond that. But the thing that was really important about it is that they were very clever in the way they separated out the different parts of what it takes to collect and process web analytics data and built well identified and well defined protocols between each of those stages.

[00:03:49] So we kind of go through it you’ve got like the javascript that sits in the browser that I call the tracker which sends stuff to a collector which sends stuff into some way that stores that data the collector is the collector when it switches over to the server.

[00:04:03] Like it’s basically the tracker. That’s just like your google analytics or Adobe Javascript. When it’s making that call it’s saying literally they’re just saying let’s just isolate that the capture of that call that collection.

[00:04:16] That’s right. Yes. So the tracker is your analytics JSL your s code and then the collector is the set aside stuff. So that’s the thing that collects the image picks so that it goes into some kind of storage mechanism. Then it gets processed or held and then it goes into some storage system which you can model and then operate on. And all of those components are interchangeable. And so that’s going to involve fairly early on because of the way that we’re dealing with cookies was very much the Google first party cookie set inside the browser approach and I needed to track sots that spanned multiple domains. So I set up a third party cookie collector and just loaded it in and it fits right into the ecosystem and what which kind of blew my mind because it was a really easy way to getting to this open source product. And since then lots and lots of other people have kind of jumped on and started submitting between.

[00:05:10] So it was like with what you when you put that little piece that not only you you build that and you slotted it then but then you also edited it. I’m not fully versed on the open source kind of checking in and checking stuff out. But does that mean you also then contributed the back so that now others can use the third party.

[00:05:27] Yeah that’s that’s that’s how it works with open solus essentially everything is put under a license that allows people to and the last for other people to use whatever it is that you do. So my collector was actually a standalone component so it didn’t become part of the standard snow plastic. It actually is a component that could be kept separately and for that matter if I had decided to license it as a commercial product I could have done that I could’ve put it on to different terms whatever I liked because it just talks using the same protocols as all the rest of the stack it can slot into that whole environment. But yeah the open source works is generally you’ll all post stuff up and submit back your contributions onto the same license. And that means that it can be reshared and there’s various open source licenses available that you can use that put different different kinds of conditions on what you can and can’t do.

[00:06:19] So what are those different chunks.

[00:06:21] I would assume that some of those are Leider like relatively speaking thinking in the lines of code required that the you know collection can have a lot of or the tracker can have stuff added to it and can grow collections seems like it’s kind of dumb and simple. The mottling to me seems like the sort of thing where you’re really are having to make some fundamental decisions of how are you kind of stitching stuff together is it in a sense they’re each kind of independent stacks of code. And is it. Oh this is the one that’s the real beast are they all. Is it chunked in a way that when you break it down that way you really have kind of put an equal level of kind of complexity and weight and architectural decisions into each one of those buckets. That question makes sense.

[00:07:09] Yeah. So different bits have different levels of complexity but then there’s different versions and flavors in each stage as well Saade on the track side the javascript you know that’s kind of evolved over time and if you think about the stuff that’s going into Google Analytics this latest round of changes you know that’s their tracking code is getting more and more complex and this not staff has started doing some of that as well. But then there’s also apps. So you can do Android and iOS and Microsoft and they also have libraries for a whole bunch of embedded devices server side stuff. So this is a whole bunch of different things there. But in the end of the day they’re not doing a hell of a lot. Some of those do interesting things like store up beacons to send Lleida which is similar to what you do in embedded applications and other platforms that track is pretty simple yet. That is a pretty simple case. My initial track was something like 50 lines of code written in no J.S.. And just because it really wasn’t doing very much and so that was pretty simple sort of the way that the snowplow hope platform has evolved has morphed into a more generic events framework. So it’s not really just focused on web analytics anymore. You can stop shoving you know just about anything. And that means you can put in you know things like Internet of Things devices really really complex stuff coming out of apps recording just about anything you like service service to serve a kind of communications. Pretty much anything.

[00:08:37] So the place where a lot of the work has happened is in that transformation stage. And then the modeling stage as well. So turning different kind of event models into the right shape has been where it’s where the kind of actions change so that that that means there’s a whole language for embedding arbitrary event types called contex custom context and so that means basically you can send any little piece of Jaison to any kind of object that models what your event actually is. So that event might be somewhat loaded a web page with someone clicked on an exit link in the web analytics kind of space but it could equally pay someone just made a purchase at the point of sale in the store. And he has the loyalty card that they swapped with the transaction. Anything can be modeled in that. And it all ends up in this one kind of unified event strain which is kind of exciting.

[00:09:31] So how is you were I mean you were kind of there is external user number one when you look at the learning the adoption curve. Well presumably at that point there was not much of a community.

[00:09:41] There wasn’t a whole lot of hey we’ve headed down some rabbit holes that don’t pan out.

[00:09:46] Well I guess one of the benefits of using kind of a mainstream been around forever.

[00:09:52] Google Analytics Adobe analytics is you know the the benefit is a lot of the thoughts and decisions about how do you structure this stuff and what makes sense have already been made for you. The downside is that the way that you structure this and how you think about it has already been decided for you. So it seems it seems like it would require you wouldn’t want somebody who’s never done web analytics but is really sharp coming straight out of school. It seems like diving in and trying to update or user or model something here because they could they might just not know enough I guess seems like you have to have kind of an architectural mind or is that not. Has it been built such that you still can kind of be up and running with a mainstream task. Somebody else has figured out just web analytics and then are you being drawn to it because you’ve got some oddball quirky thing that you know you have to deal with and that’s why you can’t have the aha. Oh wait. Something like snowplough is exactly what I’m looking for. Have you painted yourself into corners with it. Have you done things that you’re like whoops throw that out. Let’s start over.

[00:10:56] One kind of architectural component that that was in the original release was that it uses the Amazon elastic mass produced which is that version of do that you can use on demand. And the only way to do queries in the original version was using the hive query language which is kind of Cuil ish query language for really really really large data sets. The upside of that is that no matter how much data throw at it so long as you throw on more nodes at it you query text about the same amount of time. The downside of it is that at that point the hive queries took about seven or eight minutes.

[00:11:30] So you need patience.

[00:11:34] So that is one particular kind of contextual rabbit hole and there have been a bunch of places and bits of development that have happened that haven’t necessarily panned out but that’s the nature of open source. And you know this is certainly some some quirky co-owners of Adobe analytics and Google Analytics that you can talk about as well.

[00:11:55] For example the semi colon delimited category of everything. The fact that everything is a stringin Adobe analytics. Oh the products and products and the like.

[00:12:10] Every time every time a tape someone had a two and product string and have to explain to them the first semicolon delimit the category which is deprecated.

[00:12:21] There you go. There’s your street cred like there is not a guy who didn’t do didn’t play around with Adobe before he went into though. What kind of I think we’re going to talk a little bit kind of open source more broadly.

[00:12:32] But what is snowplough I’ve heard referenced.

[00:12:36] I had heard of it when you and I were chatting about it months ago but I barely had Hurtig like I don’t I don’t come across that with any anybody that I’m working with no one’s asking about it and that may just because I’m not working with clients that are sophisticated to the point that they’ve hit challenges where they’ve got to find something else. What is their data on what kind of adoption. I mean 2012 was yesterday.

[00:12:59] Is it taking the world by storm or if you look at the bill with graphs for snowplough rollout’s built with a service that goes out and scans the web and finds basically what javascript beacons are in the pages and if you look at that for a snow plow it’s got a bit of a hockey stick that could be somewhat misleading. So there’s a whole bunch of different use cases around why you would use a tool like snowplowing one of them is when you’ve got a White Label product that you want to push out and have lots and lots of different clients using it. But you want to have an idea of what’s going on on those sites. But you also want to allow them to install their own analytics on those services as well. So to use it for that it has actually become a bit of a common thing. So there’s a few services that have hundreds or even thousands of sites that show up in that built with graph but they’re not necessarily that the person who owns that site one day went 0 1 to try out snowplough. It’s actually that the service that bill uses and embeds it as part of their product offering. So there’s examples of that smoke moggies one which is a photo sharing service. So every smegma professional site can have custom domain pointed at it and you’d never know it was a smug mug site unless you looked at the source. But all of those have a plan. They can firing on them. There’s a bunch of different places that are embedding it into this service.

[00:14:21] So it’s kind of really hard to gauge how much uptake this bit although that’s I mean running into those services that are embedded you know it’s so common right now for them to say they just run Google Analytics and often it’s usually one account and you sort of think they probably haven’t got premium like if their service takes off all the sudden they’re going to have this big line item plus they’re now going to be possibly getting into challenges with their their client you know your install on the site the turning Google Analytics and you also have us embedded on your site and that all seems a little dicey. So it seems like there is there could be the appeal of open source. So therefore the licensing in this case the scalability. Presumably you can very easily control even if you are running on a site that snowplough as their mainstream Web analytics presumably you’ve got a collector and it doesn’t really care your tracker. You can have two trackers I’m assuming probably you don’t have as many concerns as you do not be as careful as if you’re doing Google Analytics and trying to run two versions of the code.

[00:15:23] That’s right. It’s perfectly supported to be to have more than one track and running and sending stuff to different places. That’s that’s totally doable.

[00:15:34] Well well we don’t go too deep into I don’t think that the when we were talking about this and had thrown out Bewicke and that’s because I just a local company in Columbus that they have some kind of unique needs for data collection and they’re dropping their stuff on other sites. And part of what they needed was a greater minute level time granularity something they couldn’t get from GA. So those guys are kind of like PEU. Awesome it gives us exactly what we need. Now they have they’re much more of kind of a very specific set of things they’re tracking. But from that perspective and that’s maybe two data points and I’m trying to extrapolate that open source has an appeal to digitized or a component tied or service type things where it’s like we’ve got these weird unique things about our whole company is built because we do something that’s weird and unique and we provide that experience for a bunch of other sites to use. That does seem like kind of one natural way that you would you would use open source type tools that are flexibility in scalability and and free. So you have to tack on some licensing costs for everything.

[00:16:39] Yeah that’s that’s certainly one of the cases where where we want to do something a bit out of the ordinary and you want to have control over it and that that means control over costs control over exactly what’s collected and what circumstances and then what happens with it. So there’s a whole bunch of these cases around that and it’s showing up in embedded products all over the place. I encountered recently there’s an organization in the U.S. called the Alliance for audited media and they’re traditionally one of these circulation audit type organizations. So they put a stamp on circulation figures for magazines and newspapers and say yes we’ve audited this and it genuinely does have many copies in circulation. They’re attempting to launch a product around citified PayPal’s web analytics data and they’ve actually partnered with their colleagues in Brazil who’ve built their own platform for doing exactly that and they’ve built that with a collect and data processing pipeline built around snowplough solve you know that kind of account of these things. Fan you just kind of looking around and you find this embedded product and you got they’ve got like a little web analytics thing going on and you have a look out and you go wow that looks kind of familiar. And then you look at the car and you go oh that’s the snow flagellated.

[00:17:53] But it is interesting so not knowing where to look for it. Like I go to your snowflake data analytics dot com and Ghostery doesn’t pick up snowplough as one of the tools running on that site. So is it is it still is it kind of under the radar because you or you can customize it was harder for them to pick keys and to say we can identify this is it.

[00:18:14] Yeah I’ve actually I’ve actually talked to the guys three guys a couple of times to try and get them to snowplowing that I’ve tried to explain to them how they can look so that they can construct it but they seem to be fairly strictly only looking at Yoro components and so they are not able to detect it because the tracking code so the javascript that you load runs on your own server it can run on your on demand it could be any domain and it doesn’t look like you know analytic stuff. J.S. it just is a blob of JavaScript. So unless you know what you’re looking for it’s not so easy to programmatically at least find it. The guys however have worked it out so you know it’s doable.

[00:18:54] It’s just Ghostery hasn’t managed to do it well my impression is always that Ghostery is really more kind of around. They’re more interested in kind of add pixel type stuff than necessarily the analytics kind of gets thrown in with some crazy new type of retargeting thing they might be more receptive.

[00:19:09] Well there’s this thing there’s nothing to say that someone hasn’t already built some kind of crazy advertising retargeting thing based on snowplough.

[00:19:16] We just don’t know yet because it’s embedded of course. Now I’m looking at the. I’ve got the developer tools open and a milking it. Stuff like analytics dot com and trying to figure out if I can see where the actual beacon is.

[00:19:31] I was looking for the same thing.

[00:19:33] I really should be reading my story. I got put up in a day.

[00:19:39] Trust is there is no danger. Michael Hublin or Tim Wilson stand. You’re safe. It looks very elegant to me. I don’t know the fonts. Yeah looks fantastic. I’m determining that it’s not so. Give us a hint. What are we looking for.

[00:19:53] So so the we’ve got to allow pixels to play through Google Tagamet and just say you going to find anything in the Silus and we push everything and the data that we do best practice implementations which that website isn’t. But it comes close.

[00:20:07] So you have you have snowplough you have snowball integrated through Google tag manager in this case.

[00:20:13] That’s right. And that’s that’s that’s excellent when I work with any clients these days. That’s this table stakes. We start with have you got tag management if not that’s how that’s how phase 1 because it’s just too painful.

[00:20:26] That was one of my big bugbears and I’ve worked in a lot of publishers and media companies. One of my big bugbears with regional web analytics which still isn’t resolved is the way that they record time spent so you know time spent is just the difference between the two PICU events. You know that’s that’s what I do for time spent. But that means that the last page visit gets a big fat zero. No time spent on the page. Now the trend with publishers is that most of that traffic now comes from Facebook and Facebook. The app experience has its inexorable draw back to your newspaper. And so most publishers get one page view in each session from people coming from Facebook. Unless it’s like a gallery or something like that. Now that doesn’t mean that people aren’t reading the content. So when I was about we had one of their mastheads does really date had very high quality journalism and they have six and 7000 would articles and google analytics will show you that the average time spent on that is really really small. But when we started digging deeper on that we discovered that actually people were rating the whole damn thing. You know that was spending seven minutes reading the thing even on a mobile phone even in the Facebook browser still reading this and engaging with this content. And when you’re publishing what you’re selling these people’s attention and it’s a pretty important metric. Google Analytics Omniture all those products they just throw it away. They’re just that just don’t have their numbers. Wall except by the industry is basically garbage.

[00:21:55] So one of the things that’s not planned does that I’m quite keen on is the page pings. So that’s configurable you set it up so that after 10 seconds it will send Abakan every five seconds and then ping ping pong.

[00:22:06] And the problem is that it limited to one it’s an active tab mean it because I think I’ve seen people to say analytics ninja had written for Google Analytics and granted you’re inherently sort of hacking when you’re saying I’m going to write this pink thing to GA and how I’m recording it and how to not fuck up all my other data.

[00:22:23] But is it because the other and the other knock with time on page is you have no idea if it’s even the active tab. I mean yeah divide once a week I’m trying to explain to somebody to please for the love of God don’t put too much faith in that metric in adobe or Google. But one of those points is what’s great. You loaded it and you wanted to come back to it later but you went to another tab so does it have that. And again I’m not a developer so but you can detect when it’s an active.

[00:22:47] That’s right. There is some smarts in the code. There’s other products that do this chop bait has a very good attachment.

[00:22:54] Well that’s the one I’ve seen before and what I’m actually watching is little pings now.

[00:23:00] So he actually does it as well. And. And there are tools that do this and there are hacks to make it kind of work in google analytics but they have the downsides as well.

[00:23:10] But doing that in Google Analytics quite radically change you will bounce right metric. If you use an interaction ping like event and I actually think that that’s that’s probably a good thing and if you are setting up a brand new website I would strongly recommend that you do exactly that. If you are using Google Analytics but if you’ve got a site that’s got a whole history of data and you suddenly change the definition of bounce right. What’s the big deal. I think I think it’s probably the right decision to make especially in publishing. But it is a it’s a big change and you bounce right you know going to match in any other leagues as well.

[00:23:41] So it’s funny there’s a thought thread in the I think the DA forums where somebody is trying to figure that out. Well the old I’m comparing Google and Adobe. So that’s a Google. So if if you have got let’s say you’ve got you’ve got snowplough running on your home page you have the question of what is the average time on page for this page on my site. And that’s just kind of a specific example from where I’m still a little fuzzy is snowplough you’re not logging in saying I’ve said all this thing. Nohemi log into my fancy web interface that has my 75 canned reports in my nice little segment right. You’re you’re having. You need to know the data and then you’re querying it as opposed to working in a interface or know the interface for I such as these is generally a database query.

[00:24:31] So. So that means your Deiter ends up in either Lustick search or some kind of s called out of us or it can end up in files if the volumes are really really high you can end up in files and then you can.

[00:24:43] There are ways to query that as well. So there is no generic from Tende for Google Analytics out of the box if you don’t install sorry if you go to installs.

[00:24:54] I mean there it is. Let me tell you something that burns out there is a front for the whole lot.

[00:24:59] Tim I’ve got this one this is out of the box.

[00:25:06] That doesn’t really have a pretty from 10. You can do queries with. It’s not like Google analytics or Adobe analytics where you get something for free soon as you do it Bob is that an area that somebody will be building something for is that.

[00:25:19] No it just makes no sense to even try to come at it that way.

[00:25:22] There are there are actually a lot of opportunities in this whole space to build that a plug that plug into it and this is why you know we kind of encourage people to get involved in this stuff and stop selling the platform.

[00:25:35] So it’s open source. But the more people using it the more cool stuff gets built. And there are opportunities around this stuff so building a front tent is one of those opportunities that said in implementations I’ve done in the past we’ve built front ends for it but they’re not necessarily interactive analytical frontman’s they’ll be things like dashboards that show aggregated stuff or they might be exports that are sent off to some system that that then goes into a dashboard or inputs into power we built a recommendation system. So all of the behavioral signals that went into that recommendations engine came from this snowplough data. So it is not you it depends on your use case. We also have done lots and lots of deep detailed analysis using tools like Tablo and the like to explore it. And if you model the data right and then stick a front hand on it you certainly can give your end users front end. But it’s not going to be an open slather. The dimensions here your metric Slawson dos it will environment like Google Analytics or Adobe analytics. But then there are you know there are other advantages I guess but that’s an opportunity for someone if someone wants to build a front.

[00:26:45] We’d love to say okay cool but you can. It’s got a limited number. You know we talked about big query a few episodes ago and I don’t think it’s is that somebody could could replace that whichever one of those little components is to shove it into a big query. But it’s not part of the core or is that that would be a bad idea or is it not.

[00:27:07] Not there yet or what yet at the moment. The whole stack is fairly tightly coupled with the Amazon platform. But there is actually some work underway to try and decouple that a little bit and that has kind of two possible endings that I can say one is self hosting. So the self hosting is one of those one of those options that is going to be really important. So there are organizations that just don’t trust the cloud. They are not willing to shove their stuff in the cloud. There’s also organizations that are willing to have their data go off shore. So go international. And so Amazon has data centers in lots of places but there are plenty of places that are jurisdictions where there isn’t an Amazon data center. So you know if you’re in Malaysia there’s no Amazon data center there if you’re in New Zealand the local Amazon datacenter center is in Australia. So there’s lots of reasons why someone would want to sell first then it’s just you might have ended up storing lots and lots of personally identifiable information. Or maybe you had your CTO stand up in a conference and big note himself by saying no data will ever be stored offshore by our company and say you’ve got to deal with that. So that’s one of the options and the other is platforms like good will be query the Google stack is ridiculously cheap. Big queries come a long way in the last couple of years. A few years ago it had those Medupi kind of big big data problems of you would do a query and it would take about two minutes.

[00:28:28] These days it is much much faster. So a lot of the snow plow ecosystem across to be is definitely on the cards. There’s been a few people talking about doing that we might end up doing it ourselves that snowflake if we find the right kind of case because it’s kind of an exciting platform and it would be good to be able to mix and match those components at Wheal as well.

[00:28:49] Got it. Michael I feel like I’m dominating the what.

[00:28:52] Well you certainly are but I love the conversation so I’m just I’m just sort of soaking it all in. One thing I wanted to talk about because I want to I always liked as we have these kinds of conversations make sure that not just all about him it’s about the listeners. I’m ready now you can you can stop again I’ll continue with my questions. OK. I could turn this if they can. So yeah I’m just so one of the things to think about is like hey if I’m thinking about maybe an application for snowplough like what kinds of things should I be thinking about maybe a checklist for getting started. And then there’s concept as I was kind of reading through some of their documentation around their event structure and how it’s set up and that is sort of the concept of they have an immutable law concept Insec in terms of this data it’s not going to change. We don’t want to change over time which makes a lot of sense. But there are things that do change about you know visitors over time and those kinds of things and so they recommend kind of merging that with another dataset at the time you do analysis. So I want I want to hear from you kind of on both topics sort of how people get started and then as things change what have you and your experience been in terms of OK so this person is switching segments or cohorts. What does that look like for that visitor. We don’t want to change those values historically.

[00:30:09] So I’m going to answer the second bit first. Okay great. One of the core principles that the snowplough guys kind of wanted when they first started and what what they found as a limitation in the existing and women extolls was you know you should never throw away data because data storage is cheap these days.

[00:30:28] So back when just James and Colin were creating Omniture back in the 90s the data storage was really really really expensive so they took a model of throwing lots of data away.

[00:30:39] Nothing compared to what web trends were thrown away. That’s right web trends are stored in a very nice flat seems to be table limits. Would you like five or ten. It depends. You can Stedham them for different things. Let’s get it right now.

[00:30:56] So that will build in an era when data storage and data processing were expensive. And so for example simple simple example if you want to classify reclassify your user agent string to mean something new because some new user agent has popped up but your web analytics tool hasn’t started recognizing it as such. You’re kind of idle because the date has it’s gone. You know that user agent string is gone. So an example of that is the Facebook net browser which the which Google Analytics still doesn’t recognize as a distinct browser and certainly doesn’t allocate to the right acquisition channel. So you know those are the kinds of things you can do when you store absolutely everything. So that’s that’s kind of part of the thought process. So you store everything you keep it forever. That doesn’t mean that you can’t change your mind about how you analyze it later on. So you might have a model for my definition of a session or a visit is Bill let’s say it’s you know we match the the stand and won which is it’s no more than 10 minutes of activity and no longer than no longer than 24 hours let’s say we want to do that. Suddenly you work out that there’s actually some things Sims age cases in your application where people are seemingly not doing anything for 10 minutes but actually they are doing something and say your sessions are getting chopped off. Hey we want to make it 30 minutes. Well if you’re collecting things in google analytics or Adobe analytics you’re bang out of luck.

[00:32:23] All your historical data is now wrong in analytics I don’t think you even get control over what your specialization is with the snow plow stuff you’ve got the raw data so you can go back and re crunch it. It’s not necessarily going to be an easy process but you have that possibility. I did a lot of analysis around that Facebook browser when I was in publishing land for much the same kind of Rayson. There’s a whole bunch of weed bugs in the Iowa’s Facebook app where it doesn’t pass through a referral when it sends a user to your site. So we would have these messy of traffic and people would go what the hell is that. And we don’t know if we had the data to be able to find out that it was nice.

[00:33:02] How how is the raw data data enabling you to figure that out because you’ve got the full the full user agent string.

[00:33:09] You can say that it was the Facebook browser. So if it has the end user agent string data and it’s the only user agent that has that okay.

[00:33:19] Right. I think that sometimes logfile analysis is very useful. It might have a use case. I feel like I just jumped back 12 years. Log files versus javascript deployment.

[00:33:30] What is it. When. Because on the one hand this is kind of not necessarily understanding exactly the between the collection and the modeling and the storage. What decisions are you. You say you have the raw data but it sounded like there’s there’s modeling and transformation happening before it goes to storage. So is there not data loss in that when you go from tracker to collector goes tracker collector. You said than modeling and then storage the modeling got have some potential loss of fidelity or no in-between.

[00:34:02] So this is track collect. And then HDL and then out into storage and then modeling on storage. Okay so and you can see the role Deiter at every step of the way so that data that came roar out of the collector you don’t get rid of that that sticks around because it’s ridiculous shape to keep it.

[00:34:21] And you might as well like literally just push out into a flat table with the raw hits. I

[00:34:25] mean is that just policy and straight in fact. So that’s not a database it’s much cheaper than that it’s just S3. And and you know you just keep that around forever which means you can get back to it at any point and change your assumptions. If that was something that you did the whole process itself is actually pretty simplistic really all it’s doing is kind of to look up that it does want is an IP address look up using an IP geo mapping database and the other is looking up the user agent and classifying the user agent. That’s pretty much all that happens apart from just moving you know doing format transformations it’s really just yeah that’s kind of enriching it’s adding it’s adding descriptors.

[00:35:04] I mean it’s adding color. I mean it sounds like super simplistically it’s adding a column a couple of columns every hit to say what is the planar better name for this user resolving this user agent and geo information.

[00:35:16] That’s right. And otherwise it’s just basically pulling out the query string parameters from the bacon. So the the the the nine value pairs that is sent through and shoving them into the appropriate fields in some cases might do some string operations on those things so that it splits things apart really that’s all doing real estate so that you know that means you can do kind of all sorts of stuff so I can give an example of something I’ve done there that was really hard to modeling certainly in the Google Analytics environment. So publishers of you know entertainment Web sites get a lot of traffic from galleries and galleries are generally kind of xey type thing where that page doesn’t load on every slide. Instead you swipe through it and the page is just the image changes. Now if you did a standard Google Analytics install you would only record one page for that whole gallery.

[00:36:08] Instead what you want to do is record a gallery event happened but Google doesn’t really treat any event other than page view as a first class citizen in its UI. So you can end up recording it as a page view which is kind of crappy. So then when someone comes to you and says What’s the proportion of our page views that come from galleries. The answer is I don’t know. Sorry. You could go on. Also record another event into it as an event as a custom event. But then you’re getting perilously close to your limits or you are increasing your costs if you’re on premium so there’s kind of no nice way to fix that. So in snowplough implementations when I do them I’m Model 3 different events one has a page view and that page is the traditional concept of a page view as in a page loads. The whole thing loads That’s a page and then we have events for other things. So if you open a gallery we might recall that you opened a gallery so that means a page view FAI’s a gallery open fuzz and also the first slide shows. We also say our gallery view happens so there’s kind of three events to model it. That just doesn’t really map very well to Google Analytics because everyone wants to talk about page views.

[00:37:17] We thought maps it does actually map maps better to Adobe where you’ve got you can set up different events and different okay and they’re not.

[00:37:25] They’re not second class citizens in. Everything’s a second class citizen in the US so you know everything the custom reports.

[00:37:33] Are we going to get. We haven’t gotten to the what’s the checklist for somebody starting yesterday.

[00:37:39] Yeah. So if you want to stop if you want start using snowplough. First step is to go to the snow plow website and start reading some of the documentation. You do kind of need some background you probably would need to know how to set up Amazon stuff and there’s a few decisions that you need to make the choice of Tracker and collect. Pretty straightforward. There’s a couple of choices at each step but there is one really critical architectural choice that you can make there is whether you want to go for batch mode or whether you want to go for real time.

[00:38:10] So there’s two kind of processing pipelines the real pipes got to be real time we got real time we were at real time. Why is he talking about.

[00:38:19] Well yeah I have see that goes back to the guys back to the one they use cases as well though. Is that running snowplough on really large data sets can be ridiculous cheap. Really really really cheap bow. Where I worked at costs were in a couple of thousand a year for collection and processing the big the big cost was actually the database to store it all in. If you don’t need that. So for example the audited media guys they run in batch mode that can handle ridiculous volumes and not cost very much because their output is very tightly constrained. They don’t actually need to do much deep analysis because it’s embedding product. So once that processing is done to output the few columns that they need. Well that’s it. And that can shut down the processing cluster and wait for another day and then run it again tomorrow. So batch mode bashment definitely has its uses and certainly if you’ve got really high volumes it’s a lot cheaper. The batch mode is also a bit easy to set up for you so that’s just vote.

[00:39:21] It’s batch mode but is still batches running out. You specify the frequency of the batch.

[00:39:26] That’s right. That’s right. Mean it’s hard to go much shorter than about an hour because it uses the Amazon Hadoop and elastic mass produced that COLA which has about a 12 minute startup time to start up the cluster. So you can’t go much less than an hour.

[00:39:43] Because sometimes you stomp on each other matches and so the real time mechanism uses Kinesis which is Amazon’s event streaming platform. Basically you can stuff events at one end and out persist and be available for queering at the other end by multiple consumers for 24 hours. The. That’s kind of cool is that. It’s pretty fast it’s really real time but it’s also iconic. It gets expensive quite quickly as well. Oh that’s cool. Yeah.

[00:40:12] So essentially to get started. You just need to start reading the documentation and follow some of the steps. Sometimes the documentation will be a bit Crofty gets out of sync with the code. This is a problem not just with open source products but with everything you know you’ll find lots and lots of old Google Analytics documentation and forum stuff that doesn’t make sense anymore. If you have those problems. Welcome to ping me on measures slack. And there’s a very active community on discourse that the snow plow guys run. If you’ve got questions about getting up and running or you can just pay for it.

[00:40:50] That’s what that’s what people should do.

[00:40:52] So there’s one thing that we haven’t kind of talked about that might be important and that some of those use cases are around why you might use an open source.

[00:41:02] And that’s the one thing we think that’s important.

[00:41:10] We probably crap on about that don’t you want to ask me a question.

[00:41:14] Hey you know Simon the thing I think is really great for people to hear about it is sort of what are some of the used cases that you might leverage at open source analytics platform for.

[00:41:24] Yeah. So there’s a few reasons there’s. If your data just really doesn’t match map to the the web analytics space if you’ve got things that just just don’t you know I had to shoehorn into that.

[00:41:37] So an example of that is you know video metrics and stream metrics they really don’t fit very well in the web space. I know it Dobies on their 30 to of trying to tackle that problem and Google of just don’t we.

[00:41:48] Yeah I just got PTSD from the thinking about reading a Dobies documentation on the number of bars and events needed to follow one of those iterations. Yeah that’s right. I’ll be curled up the fetal position while you guys wrap this.

[00:42:02] What else we get.

[00:42:03] So you know this is just things that don’t model into mental space. If you want to embed it into a product experience you can create a custom product wrapped entirely around this open source product. So custom service custom product wrapped around it really easy to do. And there are people who have built kind of analytical plotlines that do you know stuff in particular verticals that just wouldn’t work anywhere else. So that’s a that’s a reasonable way to do it. So there’s a lot of places where costs can become a really big issue and this is actually a bit of a bugbear for me.

[00:42:39] One of the reasons that some of the more expensive analytics tools are less popular formate is is that they suck all the money out of the ecosystem. So I think that if an organization is going in and spending a chunk of money on digital analytics they probably should be spending a third on the tools and at least two thirds on people to actually run the thing. But in a lot of cases there are people out there who are spending well over Two-Thirds on the total and therefore they only have money for a junior analyst and they wonder why they can’t get anything useful out of it. So by flipping that and hiring smart people instead there are organizations out there who are doing quite clever stuff but also if you’ve got geeky uselessly high volumes and there’s a few snowplough installations in spices like online gaming that have really really high event volumes. And that’s the only way that can do that kind of thing affordably. So there’s a kind of the kind of really apparent use cases but there’s also just if you know you live and breathe this stuff and you love it and you love to get down into the guts of it doing having a table that just has every event in it is a really empowering kind of thing. People you know how easy is really like a platform like snowplough but also there’s I’ve had quite a few experiences in my career where companies go to business intelligence or data warehouse team and I say to the Web analytics guys hey we want to get your data.

[00:44:06] Can you give us your data and some years ago I used to fold them off by sending them and Adobe analytics dump which came traditionally with no headers which was really helpful of them and had something like I think about 800 columns with some of the columns repeated in weed ways. And of course no head. So you had no idea what they were. I’ve said that to them and it’s the one day’s worth and guy they go see what you can do with that and then I’ll turn on a fade for you. And I never hear from them again. Well with the two Alexiev you actually can give them a database and for a lot of use cases that can be really kind of useful. So you know if you’ve got to if you’re in a telecommunications company and you’ve got a bunch of guys who are building us Chern risk model inside SAS having a web analytics data in there is really useful you know someone who’s viewed more than 3 hope things and looked at the phone the call center page is a higher risk. You know those kind of signals are actually really valuable to have. So while you know I used to kind of try and get in the way of those guys because I knew that do bad things with it. There are cases where it makes sense for all of your web analytics stuff to just turn up in a database in some format that you data miners can use. Those kind of some of these cases may a lot of the cases is embedding into part into products. So getting into other products. That’s kind of exciting for me.

[00:45:25] Well there certainly are other companies that do that too like Keane Io and things like that. So definitely like kind of open ground.

[00:45:33] Yeah. So this is kind of one more really critical one when we touch on before we which is if you decide that you need to own your data so there’s all sorts of reasons you might do that you might actually want to still personally identifiable information in your data set so that it’s only one place rather than have to go through contortions and hashing and the like you might accidentally have circumstances where that happens and that’s something I’ve encountered a few times where referrer strings contain nasty stuff and web analytics can get really really antsy or you might have had someone who said no we have to cable stuffing house and if you can at the moment if you can stretch it has to mean. Now Amazon account than snowplough will suit you. And in the near future hopefully you know they’ll actually be on premise options which was one of the big selling points of tools like Itchen and web trends back in the day was the ability to make it an on premise.

[00:46:25] So it’s it’s like having log files stored in a computer that you have on premise.

[00:46:30] So yeah we have we have fully gone full circle circle back to one decade your web trends data collection server DC track.

[00:46:42] This has been fantastically informative. Yeah it’s actually been great actually. SIMON It’s just great to talk to you about this stuff and you can just tell by the you know the things you were mentioning.

[00:46:54] You’ve certainly seen quite a bit in the analytics community so it’s been great to kind of share that and laugh a little bit about the good old days check off another continent on our right on our world tour. Hopefully we’ll be popular in Australia after this. So one thing we love to do on the show is called last call we like to go around the horn and her buddy kind of talk about something that interests them right now or something it’s coming up they’re excited about. So I don’t know.

[00:47:21] SIMON If you want to kick it off you have a last call so I’ve got a couple of events that I’m involved in. Web analytics Wednesday runs every month in Sydney and Melbourne. I’ve worked on the Sydney one and next month will be the 14th of September it’s the second Wednesday of the month Melbourne one should be on the 21st of September but much more importantly is measured Cam Sydney which is coming up on the 10th of September which is really exciting. It’s the first time that measure cames come to Sydney. There was a mission camp in Melbourne early this year and it was really successful. If you’re not familiar with the measure camp concept it’s an unconference which means that instead of paying thousands of dollars to some Saos pitches there are no defined sessions. Everyone is a participant and everyone is encouraged to run a session and we turn up on the day and we put out what sessions we want to run and you can choose which sections you want to go to. It’s also held on a Saturday which means that those people who go solely to get a day off work don’t come into my house.

[00:48:20] I hear so many good things about them. Yeah there’s likewise I don’t know if they’ve cracked the United States yet or not but I’ve heard a lot of good things see this one coming up in New York how TNT.

[00:48:30] Well Tim what have you got.

[00:48:32] Well so I am going to I will do a quick little log roll because I as this comes out I will be about to dive into my fall conference mania being one of those people who shows up to I guess two sales pitches. Hopefully I’m not too in sales pitches but I will be at the Copenhagen until Wednesday on September 7th. I’ll be at the Boston D.A symposium in mid September. I’ll be at a senior care marketing summit in Chicago later in September I’ll be at the data conference in Sydney in late October followed two days later by the ABS data conference in Melbourne and then I will be in metrics Berlin in early November. So we’ve got it. If you go to the About Us page on and it looks to Mr. Buy.com we’ve got a speaking list so if you’re in any of those towns whether or not you’re coming to the bin or not I would happily try to grab a beer with you because I’m going to be exhausted my actual kind of last call just for fun is limited to force and this is our specific but it’s still kind of amusing. And this is credit to active active slack member Pavol capuchins. The X case CD package for R is absolutely hilarious.

[00:49:49] Outweight to use it so there is one of those cases where visualisations you really can’t do and excel it is damn difficult to make an Excel chart look like Xscape CD and that that is just just fucking awesome.

[00:50:02] So that’s one of our package I actually know about my.

[00:50:08] Been around probably for 15 years but you know just like movie references. I’m late to the game.

[00:50:14] Well you’ll get there if you keep trying to watch the castle. That’s right. Oh yeah. What classic movie. What’s your what’s your last hour right. So my last call. I’ve just heard about this and I was told I can share it so I’m gunna with everybody.

[00:50:30] So Adobe as everyone knows is launching their device co-op and they’ve launched a page showing what devices you have registered in the co-op so what devices they know you’re connected on and which companies are connected to it so far in the least in the US and Canada and the page I’m seeing so that the you are Ellas cross Dasch device Dasch privacy data Adobe dot com is had to work that privacy in there to make everybody feel better. No I don’t know. But anyways really cool little page. Need to check out. In any case I’ve been trying to figure out ways that I can get my phone and my laptop and my home computer connected so I can start seeing myself connected animal devices just for giggles. But you can also use that page to disconnect your devices from the co-op as well should you want to which I think is kind of a nice feature for folks who don’t want their devices connected. So old thing there a all right. Well hey. As you’ve been listening you’ve probably thought these are all great tips. Simon Rumball. But what about question. Question. Question And so that’s where you the listener can come in and the great news is Simon’s very active on the measure slack. He’s an active member of the analytics community so he is easy to get a hold of and ask a lot of these great questions. Him and stuff like analytics team probably all around and available.

[00:51:55] So if you have questions about the episode Tim and I are completely useless but we pointed to Simon who as you’ve heard is a wealth of information an awesome kind of insight on this tool and sort of similar tools of the kind. So thanks again Simon for being on the show. It’s been really informative it’s helped you know I think solidify probably a lot of people’s minds kind of where a snowplow might fit within their overall analytic scheme. There are certainly some very interesting and desirable things about snowplough both from the way it collects data to it so you know really low cost structure that could really potentially help a lot of companies so I think it’s really great. Thank you for being on the show.

[00:52:36] Thanks for having me. Thanks for putting up with my loud planes going out ahead. No one unheard here that because of awesome post editing production values that we have I hope.

[00:52:49] Anyway. So obviously for my cohost Tim Wilson and everybody keep analyzing.

[00:53:01] Thanks for listening. To join the conversation on Tuesday for half measures. We welcome your comments and questions. Which. Are. Certainly. To. Spark a made up.

[00:53:21] Word. Of. Course I’m sure when we put the outtakes together Tim will not put that part. And. That’s the power of editing. It’s not going to be that big of a deal. Woods. So have you ever seen Australian movie The Castle. The next time he finds Sydney to look down and you see my house. I guess it should have a. When I was with my Woody Woody response. Early Eric Bana Tim if you’re interested Oh I’m on Wikipedia right now. We’re not dealing with that right now at all. God damn it you got it. Oh. Now the serenity. I guess yes we are at. Oh shock. Rock and open source.


One Response

  1. […] #043: Open Source Analytics with Simon Rumble (Simon’s first appearance on the show) […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#247: Professional Development, Analytically Speaking with Helen Crossley

#247: Professional Development, Analytically Speaking with Helen Crossley

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_247_-_Professional_Development_Analytically_Speaking_with_Helen_Crossley.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares