You’re listening to this podcast, so you’re, obviously, well-attuned to the cutting edge of all things digital. But, in this episode, we’re going to discuss a couple (or countless) products/platforms (PaaS — Platforms as a Service! Who knew that was a thing?!) from a little upstart company based in California. Google wouldn’t actually return our calls (okay…we didn’t call them), so we went with an Even Better Option: Mark Edmondson — Data Insight Developer at IIH Nordic, Google Developer Expert, author of so many R packages he had to write a package just to count them, delightfully accented Brit who now calls Denmark home, and a guy who tried to solve Twitter political discussions through text mining (not kidding — it’s discussed in this episode) — joined the gang to do their First Ever three-continent simulcast.
00:00 Michael Helbling: Hi, everyone. Welcome to The Digital Analytics Power Hour. This is Episode 74. If you’re gonna do the data science with the big data, or if you’re an analyst who wants to explore some advanced tools, or you wanna build the next generation app to sequence the human genome, or… Well, the Google Cloud Platform has something for you. We wanted to talk about it even though this is specific to a set of products. All of us interact with Google, day in and day out, even if we’re not using it from Analytics there’s all kinds of tools inside the Google Cloud Platform. So we’ve assembled a crack team from three different continents to talk about it. From Australia, there’s Moe Kiss.
00:48 Moe Kiss: Hi, how you going?
00:50 MH: Hi, Moe.
00:51 MK: Hi.
00:51 MH: From the United States of America, it’s Tim Wilson.
00:56 Tim Wilson: Hey, guys.
00:57 MH: I added myself, Michael Helbling. Howdy, Tim. And from Europe we have the new British invasion, hey you, get off of my cloud, it’s Mark Edmondson. Mark is currently the Data Insight Developer at IIH Nordic. Just a little bit… I feel like that one sort of underdoes it a little bit, just title wise, just FYI.
01:22 Mark Edmondson: [laughter] Underdoes it? Note to self.
01:27 MH: Yeah. So, he’s also Google’s Developer Expert for the Cloud Platform, so he’s official. He is a well-known rock star in the R and the analytics world. He’s created a number of awesome packages for R, especially the Google Analytics R package. Many other ones, the BigQuery R package, googleComputeEngineR, searchConsoleR; and he’s the co-creator with Tim Wilson of dartistics.com, which the…
01:55 TW: So we’ve come full circle. So we’re back to the lowly level of the title of Data Insight Developer.
02:02 MH: Okay, perfect. Well, it’s certainly a pleasure, no matter how junior you are, Mark, to welcome you on the podcast.
02:10 ME: Thank you very much. Thank you for having me.
02:13 MH: Yeah, thanks so much for coming on. Any mistakes that we make on the podcast, especially Tim and I, it’s because we had to get up at a really early hour to do this so that we could accommodate all continents. So, cheers, howdy, g’day.
02:30 MH: Let’s get this show going.
02:37 MH: So Mark, maybe just to get us going, I always love for people to start with just a little more of their background, so that people listening to the show can get to know you a little better. How did you find your way into this exciting role as a Data Insight Developer at IIH Nordic?
02:53 TW: And if you wanna include how you wound up in Copenhagen, if that works its way in, that works, too.
02:57 ME: It kinda does, yeah. So, [chuckle] actually I started… I was actually in a band in Cornwall, [chuckle] and I needed some money. Yeah, I needed some money for my van which was a very expensive van to run. So I started this job as an Internet Assistant. So, if you’re thinking lowly job titles, that would be the first one. And… [laughter]
03:25 TW: Wait, were you actually helping the internet? Were you actually assisting the internet?
03:29 ME: Well, it turned out I was just doing…
03:34 MH: So it was him and Tim Berners-Lee.
03:37 ME: It was basically just looking at 200 websites a day and seeing if they would link to our clients in SEO. That was very glamorous, very glamorous. And did that for ages and ages. And then I did… Got into SEO. And then I got into SEO management. And then they needed us in Copenhagen. ‘Cause I was based in Cornwall, and Copenhagen looked like a very sophisticated place to go. So I went there, and it was, it was nice. I stayed a bit longer, and then I was gonna come back, but then I met my wife, my lovely wife Sana who we just got a kid with, Rose. Yeah, and basically I stayed in Copenhagen, stayed in Denmark. And then I got into analytics through that because the NetBooster offices were sort of the analytics place where I was working.
04:33 ME: Then I left to go to… Well, before that I started getting into R, and forecasting, and prediction, and all of this. And that’s where programming started to enter into the thing. Then I left to go to Wunderman, and eventually IIH Nordic. And then basically, through the R stuff and through the Google Developer Expert program I got on through the Google Analytics, which was through the R stuff perhaps, then I got into the Google Cloud. And it was definitely… The progression from going from Excel to R really kind of opened up my mind about what you could do with data. There’s no limitations of cell lengths or row lengths or things like that. And then I’ve experienced a similar thing going from R locally to R in the cloud and just the cloud computing, in general. Just the sort of things that you can do is so much exponentially more greater.
05:31 ME: Yeah, so then I just started doing that. And because of the GDE program I get some credits to play around on the Google Cloud Platform. And I think a big case for it was, ’cause we had lots of Google Analytics 360 or Google Analytics Premium clients at the time, and they were using that big query integration that they have. And just using big query and how awesome that was, was the sort of in to the rest of the Cloud Platform. And then when I was writing my R packages for Google Analytics, for that, I’ve basically written all the authentication stuff for that for search console and Google Analytics. And it was a very small step to start using that authentication method for the other APIs such as the Cloud Platform. That’s where ComputeEngine and BigQuery R and things like that came out of. And that’s where we are now really.
06:26 MK: So not really technical then, we would say.
06:31 ME: Now I would say I’m probably technical, yeah. But…
06:38 ME: But before that not so much, I guess. But I really like that I’ve had that transition, because I think part of the job is knowing how to speak to the clients and how to present and how to actually work on the problems that they’re wanting to be worked on. That is definitely something which is a plus compared to someone maybe who had just been coding all of their career or something like that. I feel lucky at that.
07:07 MH: We’ll get to the cloud questions here in a minute, but what kind of band were you in?
07:12 ME: It was a bit edgy, alt-rock, trying to do something different, all of that.
07:20 MH: Nice, very nice.
07:20 ME: It’s like everyone else. [chuckle]
07:21 MH: Alright no, but seriously, so if you go out to the cloud.google.com and look around, there is so much there. So maybe just, as a starting point, as digital analytics people, where should people start? What would be the first place where people should start dipping their toe in?
07:43 ME: Well, just from personal experience, definitely, I think BigQuery is a good place to start, because once you’ve got everything in there, you can do jobs that you’re already doing in Excel, perhaps, or R, but on a greater scale. So that’s definitely the first product that I came across that I was doing.
08:08 TW: What’s the actual line between the Google Cloud platform, when poking around… And so BigQuery, that was the first Google Cloud Platform tool that you were using?
08:17 ME: Yeah.
08:18 TW: Or were you…
08:19 ME: Definitely that one, yeah, definitely.
08:20 TW: Okay. ‘Cause Google Analytics is not part of the Cloud Platform. Data Studio, I think, is part of the Cloud Platform, or is there a hard line or is it kind of fuzzy like the edge of a cloud?
08:29 ME: Yeah. It’s kind of… [laughter] I get it. It is kind of evolving a little bit, but there’s a lot of services that were there before that are now being kind of integrated in. So you would actually say that App Engine was the first Google Cloud service offering. And App Engine is amazing. It was, I think, ahead of its time in a lot of ways. But the thing is that you had to be like Google to use it. You had to be using the Java or Python and take advantage of all the things. So what they’ve done recently is they’ve tried to sort of fill in that gap between what… The systems you have locally and then moving all of that into the Cloud, and then maybe evolving that into the deeper platforms, more platform as a service.
09:15 ME: So the actual Cloud Platform… Yeah, there’s Compute Engine, there’s Cloud Storage. And those are probably the two fundamental things ’cause they’re kind of… The services on top of that are all maybe built with those building blocks. So things like BigQuery, why it’s good is that there’s all these computers that are all kind of being launched underneath your query. You just see the end results, but what’s happening under the hood is that a lot of Compute Engine instances are being launched to service your query, say, and they’re pulling that data in from Cloud Storage, and then they’re giving you the results.
09:51 ME: And what Google are basically doing is that they’ve been working on this infrastructure for 20 years to service their Google search, and they’re now exposing that infrastructure to us. And because they’ve been doing this for so long… A lot of the big data technologies that are used in other platforms like Hadoop and stuff, they all started as Google papers and things like that. Hadoop was originally from a Google paper, and then Yahoo went on and improved it, I think. But in the meantime, whilst everyone was sort of getting to grips with that, they were working internally on their own systems, and those systems are getting released.
10:28 ME: So BigQuery was their evolution of Hadoop, say, that they were working on internally. And then we get the results afterwards. So what they’re really looking to do now is exposing a lot of that infrastructure. And Google are responsible for 40% of the world’s internet traffic. They have been doing this stuff for a long time. They’ve got dedicated infrastructure. And then they saw Amazon take over all of the Cloud computing business, and they’re thinking, “Well, we should get a piece of that.” And… [laughter]
11:01 ME: The GCP platform, the Google Cloud Platform, they’re not dominating that industry for a change. They only have maybe 10% of the industry. So they are really looking to do that because who knows how much longer AdWords is gonna be bankrolling all of Google. Because we’ve got all of this Alexa stuff and voice search and all that where it’s gonna be hard to squeeze in an AdWords next to your search. And that’s where they make, I think, 80% of their revenue still is from that. So they’re gonna have to have an alternative to that. Now, I’m think the Cloud Platform is one of their big bets of where they’re gonna keep making their money.
11:47 TW: So on the one hand, my perception is there’s all this Google stuff, and I’m using it all the time in all these different areas, but you mentioned the Google Cloud Platform, whatever their market share, it’s smaller if it’s 10%, 15%, whatever. When large organizations, if they are all about Amazon, does that become a real hurdle to say, “Oh, here’s the competitor”? Or are IT organizations moving to the point where they are very comfortable with saying, “Sure, we’ve got multiple cloud things”? There might even be some be some benefits to having a couple of different core backbones? Or I know they occasionally run into things like Snowplow, it’s still not… But doesn’t really play super nice with Google Cloud Platform, it’s more of a AWS thing, right?
12:36 ME: Yeah, that’s coming. That’s coming. I spoke to Michael actually at the Copenhagen Measure Week? MeasureCamp?
12:44 ME: Super Week? Yeah, MeasureCamp, Copenhagen, from… Yeah, I spoke to [12:50] ____ there. And they were talking about them moving it onto the GCP soon.
12:55 MK: Really?
12:56 ME: Yeah, but… Yeah, yeah.
12:58 MK: ‘Cause we, actually, we run both GAP and the Snowplow, and we use both. So we use Redshift and BigQuery. And it just kind of… So it depends on what I wanna look at, I would go to one of the two.
13:14 ME: Yeah. So what I would say about that is definitely Google know this. I would say… So Amazon were definitely there first, and they deserve all the credit and all of that for doing that. But I think, they’re kind of generation one. And in my opinion, the Google one is easier use, and the APIs are nicer, and things like that. So people are calling that generation two. And there’s certain features like pricing. If you spin up a Amazon Instance, you pay per hour, whereas in the GCP, you pay per minute. So you can make savings on that one. But I’m definitely biased in that regard, so I’ll reserve the judgment for other people to say. But I would say that if you’re gonna pick a stack, definitely, I would try and stick to one stack, because a lot of the costs, so that’s either Amazon or Google Cloud Platform, or Azure, or even IBM or something like that, because a lot of the costs that you have is for moving data, and if you’re moving data within the stack, so from BigQuery to Google Cloud Compute, you don’t pay any money. But if move it from ComputeEngine to Redshift, say, on the Amazon stack, then you will be paying money for that transfer.
14:27 ME: So if you’ve got this integration of systems on lots of different stacks, that’s probably more complication and expense than you need. Definitely keep it simple. So pick one that you mean, that your developers like. But I think a lot of the motivation behind Google going open-source and a lot of the stuff they’re doing, so Kubernetes and Docker and things like that, is what they’re doing is they’re making it much easier for people to move between stacks. So it is now possible to have all of your stuff in a Kubernetes platform with lots of Docker containers, maybe we’ll get into that later on, and then move that wholesale completely into AWS or GCP, and have the same services more quickly than… It means that you’re not locked into one of those stacks as much as you were before, basically. And the reason Google is pushing this so much is, obviously, because they wanna make the pain of switching from Amazon to them as smooth as possible, rather than, yeah, it being costly.
15:29 TW: That has it seems like little shades of the part of the tag management pitch of, separate and then you can switch out your Analytics platform, even though you aren’t really going to do that.
15:40 ME: Yeah.
15:43 TW: Because that does require… And I know my own experiences, and by my own experiences, I mean hitting up Mark and Jason Packer constantly, trying to understand the Docker and spinning stuff up, to make that portability, there is incremental upfront effort to actually include that, and another layer of complexity, even if it’s not a ton of complexity, right? It’s not…
16:08 ME: Absolutely, it’s another layer of virtualization. So you’re spinning up a virtual machine, then putting a virtual container within that machine that’s running codes, that may be spinning up other virtual machines, [chuckle] and things like that. So it’s definitely another layer of abstraction. But it’s definitely very popular. Kubernetes is just the kind of thing that makes it possible. I think it’s the most starred repo on GitHub, and with TensorFlow being second. So it’s very popular, and all the cloud people are embracing it ’cause that’s the future, that’s what people are realizing, I think.
16:49 TW: Again, the visual energy of embracing a cloud, it feels very empty, and sad, and unfulfilled.
16:58 ME: Yeah, yeah. But…
17:01 ME: It’s definitely the way it’s all going, if you read any of the more DevOps blogs, the cloud is… Because simply, you can’t invest as much money unless you’re in a massive, massive company, you can’t invest as much money into your own infrastructure than these guys can. And they’ve got it all kind of… They are ready waiting for you per minute. To run your own big query locally would cost you millions to do. So if you can make that money back… And it’s much nicer to use as well. And I definitely wouldn’t be doing stuff that I’m doing if it was all locally on servers and stuff like that. The bar has come so far down now, that one little analyst here can be sort of messing around with this stuff, whereas… And if it still happens, let’s say, you’re in institutions that don’t embrace the clouds, you’ve got a whole IT team, DevOps team with their opinions and all of that, that you have to negotiate as well to do the same jobs. And when I’ve been speaking to IT teams as well, they’ve been like, “Yeah. You go and do this rather than me having to do all of this stuff for you.” They’re embracing it as well. So it’s definitely a rising tide, I’d say, of where things are going at the moment.
18:25 MK: I’m just really curious to hear a little bit about… ‘Cause you kind of talk about how Google’s growing its share of the market, or will in time, and some of the moves they’re making. I’d be interested to hear a little about your own journey. And particularly I feel like I probably bother you enough about R stuff, and I constantly am talking to Tim about it, but it seems like the Google stack isn’t really compatible with R, and that seems to be the path that you were kind of going on. So how did that all kind of fit together?
18:57 ME: I think it is compatible with R now, [chuckle] now I’ve made these packages, so [chuckle] I did have a need to do that, to make those, yeah.
19:05 MH: Great job, Google.
19:14 ME: But definitely it’s not just me; bigrquery by Hadley Wickham, has been around for awhile. I think that’s the one you were using from… You mentioned earlier?
19:15 MK: Yeah.
19:17 ME: And yeah, that’s got sort of first class support. And the thing is, the reason I’m able to make these packages is ’cause they’ve got this JSON API for all of their packages. All of their services, basically. Whenever they make a service, there’s an API that comes associated with it. You can even generate all of your R code now using that API. It writes all the functions for you. So first they support Python and Java, I think are the major two, but then they also released this JSON API which you can use to build your own libraries in any language. And that’s what I’ve done. That’s what I’ve done in mine.
19:55 TW: I feel like you’re not representing this as well as I had the memory. I think I might have actually been in Denmark. It was overnight, and you just casually came in and said, “Oh yeah. I just generated a package for every Google API last night.”
20:29 ME: Yeah. [chuckle]
20:29 TW: So, right. That’s what you’re saying, is that Google has gotten… We can always complain about documentation and support, and the robustness of things, but they seem to I guess, correct me if I’m wrong, they seem to have… Because they’ve been doing this for so long, they get the infrastructure beyond just the extreme, back end to the point that you had a light bulb went off and said, “Oh. Well, in theory, if I write this one bit of code, it will generate these packages.” Not super well-documented, not super usable, that’s not what you actually published out, that’s not what all the packages are. And that’s partly to their credit, that they literally said, “We have sufficient, good practices, and an uber architecture in place, that we can do that.”
21:04 ME: Yeah.
21:05 TW: Is that fair?
21:05 ME: Yeah, exactly. For instance, all of their APIs, cloud APIs are published through a similar system. And even that’s been exposed now. Cloud end points. So you can actually that system yourself for all of your stuff. But then…
21:19 TW: So how do you feel about the movie Inception? ‘Cause I feel like you keep relaying the plot of inception by virtual within [21:26] ____… Virtual, and now you’re…
21:29 ME: That’s alright. Yeah, that’s alright. It’s not as fast, but…
21:29 TW: Yep. Sorry. Okay.
21:32 ME: Yeah, there’s a lot of iteration. Yeah, recursion, recursion, recursion. But yeah, I get the impression that the Amazon stuff is a lot more segregated, a lot more different teams working on different stuff, and it’s not all built together. Could someone correct me if I’m wrong about that, whereas the Google stuff, they always have this standard. And that’s why I could write a standard library for it. ‘Cause it usually always comes out in the same manner. So there’s not too much reunderstanding the stuff that’s coming out. Yeah, so it feels more evolved in that respect. Definitely.
22:07 TW: And does that also mean backward compatibility from… I know definitely within R, anytime I go in and say, “Update packages, ” there are a ton of my packages that need to be updated. So, in the open-source world there are lots of little updates that are happening.
22:25 ME: Yeah. That is the danger of open source. Yeah.
22:25 TW: Yeah, but the Google infrastructure, are they really pretty good about backward… Is it?
23:19 MK: I wanna ask a little be about the Google stack. I looked at it yesterday, and the first thought that I had, was that I was absolutely a little bit intimidated. There’s a huge list of products. I’ve mentioned that I dabble with Big Query, which to be honest, I kind of forgot existed after I used it at my last employer. And it wasn’t ’til someone joined the team that was like, “Why aren’t you using this?” And I was like, “Oh, yeah. I can do that.” But you mentioned that you had some credits, and you started playing around. Have you had a look at any of their machine learning packages or anything like that? And I guess for someone that wants to yeah, kinda test the waters a bit, other than Big Query, what are your suggestions? Because my understanding was that a lot of the stuff, you need Python to be able to access it, other than… You’re a very technical person so…
24:11 ME: Yeah. Well, Python is definitely the first purported language for it. And Java, but that’s… I don’t touch Java, it’s too verbose for me. But…
24:23 ME: But the machine learning APIs that are just coming out are really, really cool, and you can use those in the browser, actually, if you go to their page, about them. And that is one I’ve just put together into a R package as well, so you’ve got it in R as well if it’s needed, but…
24:44 ME: I’ve just been concentrating on the language ones, so it’s like speech-to-text, and then translation, and then entity analysis. So feasibly you can talk into a microphone in Danish, get it translated to English, and then use entity analysis to see what the person’s talking about, and then give it back to something. But from going back to your previous [25:08] ____, I think it’s very much you need to have the right, like all analytics, you need to have the right question to answer. You need to have the right… Yeah, you need to have a problem, and that is definitely how all of what I’ve been doing has come from, I’ve had a problem, and I’ve needed to solve it, and then I’ve looked around for what to do.
25:30 ME: So I think it helps if you know roughly what’s available, the capabilities, ’cause yeah, I think if you’re just browsing around on the things like Pub/Sub versus BigQuery versus Cloud SQL, what is actually the use of these? I’ve just done the Coursera course, in data engineering, just the fundamentals one, and that was really cool to just sort of give you an overview about all the services. ‘Cause I touched on probably about 25%, I would say, of all of the stuff that’s on there. But just knowing the potential of the rest of it, it means that when someone comes to me with a problem and says, “I need a globally distributed database that will handle credit card transactions in real time,” I can go, “Okay, Spanner, that’s what Spanner does.” And, yeah, I’ve not had that yet, I’d love to have a problem like that come in. But [chuckle] yeah…
26:25 ME: For instance…
26:28 TW: How strictly do you adhere to that… I feel like it’s always gonna be more productive to have a gun to your head with “I have to solve this problem and I have to go find the way to solve it,” but do you… The Google language, was that really one where you guys had a client who just came and asked… I guess sometimes I feel like maybe my clients, I’m not managing to tease the bigger, sexier questions out of them.
26:57 ME: Maybe I should mention my working arrangements, because I didn’t mention that. But I do have… I’ve consciously chose when I moved to Denmark to give myself “20% time,” the Google stuff? So I worked four days a week at NetBooster, and at IIH Nordic I work three days a week. And the rest of the time is spent on me working on longer term projects, and working on this. And this is definitely the reason I’m speaking to you today and why I’m doing well in my career, is that I’ve had time to sit there. So maybe what you were saying you just don’t have the time and bandwidth to grok all this stuff and get all of that… I have had like 20%, almost 40% of my time sometimes, dedicated to doing it, which sounds crazy when you think about… [chuckle]
27:47 TW: But I don’t want anybody to use that as a cop out to say they can’t do that.
27:50 ME: No, no, but I think it does…
27:51 TW: That’s awesome that you had…
27:52 ME: It requires dedicated resources, yeah. And…
27:55 MH: Right.
27:55 ME: But I think that is like, yeah, that’s definitely helped me a lot in doing… And it… I think I got a better job and better situation because of that time. Yeah, so thinking about it, and that’s half the reason I went to IIH Nordic in the first place ’cause they agree with that philosophy. Some companies when an employee, if you say you want a day off a week, obviously… But in IIH, totally, they’ve given everyone Friday off, so everyone here works four days a week, and I’m the exception working three days a week so it fits with the culture a little bit.
28:30 MK: Well, I think there’s a little bit of when you’re an analyst and you’ve got a tricky problem…
28:35 ME: Yeah, yeah.
28:36 MK: You do have a bias towards going towards something you’re comfortable with, and it sounds like a lot of your career has been exploring the things that you don’t know and it’s worked really well for you, but for lots of people out there, that’s a really scary thing to do and…
28:52 ME: Yeah.
28:53 MK: Yeah.
28:54 ME: So yeah, so going back to the Google language thing, the reason I did that was because of the whole fake news Trump stuff that happened on Twitter and all of that. And…
29:07 TW: You’re welcome…
29:07 ME: And that was the problem, and I was just like… And it almost ruined Twitter for me. I adore Twitter. But I couldn’t go on there because it was just too much… I think Twitter is completely the wrong medium to discuss politics ’cause it’s too clickbait, reactionary, all of that. So I wanted a way of giving people more knowledge about how much they can trust other people on Twitter, and how much are you regurgitating what other people in your sphere think, versus an actual original balanced opinion. And I didn’t want it to be biased. I have one… I think one of the major problems is that the camps are getting further and further apart, and it’s completely vilified over there and they’re just looney crazies as well, so it’s just a lot of… There’s less compromise and there’s less agreement and all this, and this is what politics needs. So that’s why I did the Google language thing is that it gives me the tools to do that, and I’m working on that on the back burner.
30:15 TW: So you’re not really a kind of person who tackles big problems.
30:18 TW: You just kinda go for these little niche little small.
30:20 ME: Well, what happens…
30:22 TW: Easily solved…
30:22 ME: What happens I go for the big problems; it fails, but then I’ve got the skills to handle smaller problems. I think that it is probably more accurate what happens. Yeah.
30:30 TW: That’s actually… I like it.
30:33 MH: Well, and I think actually that brings up something I think is really helpful to consider which is the best way to get in and after these things ’cause there’s so many tools you can potentially use here, and AWS, any of these platforms, is to think or have a problem you’re trying to solve first.
30:51 TW: Yeah.
30:51 MH: So establish… And this is how I talk to my team ’cause actually there’s a bunch of people in our group as well are using some of these, and we’re doing projects on Google Cloud Platform and it’s really exciting, but I always say, “What’s our business problem that we’re solving? What are we doing to drive something forward with this?” ‘Cause that will give us the guidance we need to go use whatever tools we’re trying to do. ‘Cause a lot of times I see people be like, “Hey, look what I can do with this cool tool.” Awesome, that’s a great start. Now, how do we apply that to the set of problems which are immutable or always the same?
31:28 ME: And that is the most difficult bit, definitely. But that’s… The tools are just like… Yeah.
31:32 TW: But there is… I feel like… And honestly, this podcast is part of the tool for me on that front is that it can be very, very hard to actually broaden the definition of what sorts of questions could be answered or what…
31:51 MH: Yeah.
31:51 TW: And so I think that…
31:52 MH: Like any tool, Tim. I agree with you; it will expand but you have to start, and then as you learn all these platforms and tools, I think it expands your role again, and you have a new set of questions and objectives you could go after.
32:07 TW: Well, but I think that’s where the data engineering course… I think there’s something to be said because of OpenCourseWare that there are reasons to say, “I don’t know exactly… I don’t have a problem, I’m not gonna go learn, take the data engineering course, Google Cloud Platform Coursera course. I don’t have a problem I’m trying to solve right now on that, but that is just a way to broaden my horizon so that then I can start to maybe up level… ” I just said “up level,” oh my God.
32:40 ME: You can make up new words, that’s fine. [chuckle]
32:42 TW: Hold on, a moment, yeah. That I can start actually realizing there are some bigger questions that I could solve. There is I guess a chicken and an egg sort of process where you do need to do some dabbling and exploration to then come and ask a hard question that then forces you to really build that out.
33:02 ME: But I’ve got an example of that if you… So yeah, recently I’ve been looking at the BigQuery streaming and things like that, and App Engine, all of that, and then we had an inquiry come in about Internet of Things analytics. So from that they were saying, “You’re doing digital analytics on a website. Can’t we just move that data source to being a stream of data coming from bins and light bulbs, and things like that?” And because I’ve had that experience of the BigQuery streaming and knowing the capabilities of that, I was like, “Yeah, yeah, that that is a way of doing it. We could do that, it’s feasible.” You’ve got a solution to the problem that is presented. So that’s one example.
33:48 TW: Can we define App Engine real quick for me? That was a little bit… Even though I’ve heard it I’m not sure I fully understand what App Engine…
33:54 ME: Yeah, and this is the problem.
33:56 TW: Early on you said that was the core thing.
33:58 ME: Absolutely, it was the problem. It’s the first thing they released, and basically what happened is that you had to upload Python or Java and this is one of the problems, you had to use Python or Java. You upload Python to App Engine, and then there’s a free tier, so if you only use that code a little bit then it’s free, and it gives you a website basically that will run that Python code every time you hit a certain URL, say. Okay?
34:29 TW: Okay.
34:29 ME: So that Python code can do anything you like. So you could connect it to other Cloud Platform stuff like launching VMs or downloading from Google Analytics, all of that stuff, and then also create a website on top. And it scales as well, so things like I think Snapchat is built on top of it. So you’ve got this free tier where it’s free for you to play around with, but then if you ever wanna scale it up and reach a billion users, all you have to do is a bit of configuration and then pay the money, and then the same code can be all of that.
35:03 ME: So it’s scaling underneath, and what’s happening is that you are just uploading the code, you’ve got no knowledge of the computers that are launching to satisfy the demands of that, and that’s where you’re getting into a truly Platform as a Service cloud thing ’cause you don’t have to worry about the service, you just let them take care of it. And that’s kind of what BigQuery does, ’cause it’s actually launching, say, 50 computers to service your SQL queries. And then why it’s so fast is they’re basically running a lot of computers at the same time. So that…
35:33 TW: But literally Snapchat is or was, all of the actual stuff was actually…
35:40 ME: Let me just check that. Yeah. It was one of those too, but a billion users is no problem on it. Yeah, it is Snapchat, yeah.
35:48 TW: So that gets you out of the world of, “We’re configuring our web server that we’ve got.” It is your web server but it’s inherently built to be running…
35:58 ME: And it’s built on Google’s infrastructure, yeah. Now, the problem with that, it was too specific to Google, and there’s still lots of stuff which is good, but yeah. So what they’ve done is filled it all in. So say, you’ve got your SQL database locally. If you wanted to move that to an App Engine thing, you’d have to reconfigure it into the App Engine’s data store thing, which is much too much bother [chuckle] to see the gains. So what they’ve done now is introduced intermediate services. So Cloud SQL basically is you take your SQL database that’s locally and you just move it into the cloud, and then you get exactly the same as your local one, but it’s in the clouds, and we manage stuff for you, a little bit. So that’s kind of the first step, that’s it, that’s all it is. And you can do that with computer engine, you can move all your laptops that you’ve got in the corner or your room, start up a GCE instance instead, all of that. And that is the first step, and that’s what they’re doing now.
36:58 ME: Another thing is Hadoop and Spark. You’ve got a huge Hadoop cluster running in your bank or something like that. You can fire up Dataproc, and you can move your Hadoop cluster straight into that. So that’s what they’re trying to do: They’re giving you that kind of first step in. And then from there you can start to separate CPU from storage and things like that, and start to move on. Because I was wondering, should I be using Dataproc, which is sort of Hadoop on the cloud. If you’re using BigQuery already, there isn’t actually much call to use Dataproc. And maybe you wanna use some of the Spark stuff, machine learning stuff, but if you’re just… Actually BigQuery is a kind of a superior product for a lot of cases. Say, you wouldn’t start with that but if you’ve already got legacy systems, that’s the thing, you need to… It’s a lot of pain and hassle to move it all into a new system. So they’re giving you a more of a root in to moving that into the cloud.
37:56 TW: So how, for IT organizations that are already managing a big budget for their IT infrastructure, presumably they’re working with Google, and the way things are being charged, they can try to estimate… I know when I had some very, very small things on the Google Cloud Platform and suddenly it was like $37 a month. I was doing nothing with it; it was just sitting there. And again, went back and forth with a few people, and they’re like, “You can do this, you can do that.” I think now I have this between $8 and $9 monthly charge, which I’m not really doing anything with it, but how does the… At times it seems like the cloud stuff is insanely free, like when I’ve had clients who go to GA 360 and they get their BigQuery credits every month, and generally what we’re doing actually totally covers that, so you’re like, “Oh, BigQuery is essentially free.” But that may also be like it’s the free sample of the crack cocaine.
38:58 TW: ‘Cause you mentioned a few times, you have these credits. Well, if you didn’t have these credits and all the stuff that you’re dabbling with, would you be spending $20 US a month or $200 US?
39:11 ME: Well, for instance, we don’t have credits at work, and we’re using $100 a month on that at the moment. And I think what… I’ve had a recent blog post about this, because yeah, if you just move your laptop into a Cloud Compute Engine Instance then basically you’re gonna be overpaying. It’s more expensive than just actually buying a laptop every two years, so what’s the point? Why would you do that? Yeah, so that’s why it’s only the first step of moving all this stuff straight in. And a lot of the time the cost is associated with keeping that instance running all the time, and just treating it like your laptop, but you have to have a sort of a different philosophy as well, a change of how you treat computing in your mind as well. And that change is to separate out the hard drive and the compute, and so to move all of your data into cloud storage, which is dirt cheap, that’s very, very cheap to store stuff, but then only turn on, say, the compute stuff when you need it.
40:19 ME: So when you’re charged per minute, when you’re logging into RStudio, say, and you only need it for between the hours of nine and five say, only have it on during that time and then turn it off, and then know that your data is sitting safely in a cloud storage which is cheap. And similarly if you’re… Yeah, and BigQuery is an example of that because it’s actually storing all of that data in cloud storage say, and so you’re getting charged a very minimal amount of money for when you have your data in there, but when you do queries when you’re actually using compute that’s when the charges come. And they can be quite substantial if you’re doing it, but it’s on demand, yeah.
41:00 MH: They store your punch cards for you, Tim, they just charge you to process them.
41:04 ME: Yeah, that’s kind of it.
41:08 TW: Well, so there are two challenges, is one I think that many if not most analysts are working in an environment where even if for them to incur a very small charge… The larger the company can be even that much harder, even if your manager says, “Sure, go do this.” But I’m just going to get a slightly varying invoice from Google. I’m gonna have to… There is a challenge with even paying a small amount on a monthly basis that varies that all of the sudden, forget IT, now you’re running through whatever your accounting infrastructure is to get a credit card, or you just say, “You know what? I’m gonna do this personally, and it’s not enough, I’m growing my career.” But that’s one weird area, and the other is, you started to touch on it, of saying, “Well, if you set this up smartly… ” And that was definitely my experience was I did not have the skills to say, “What’s the smart way to set this up so that I’m not paying for this stuff that I’m not actually using?” So on the one hand, their pricing seems awesome ’cause it’s really charging you minute-level computing and very cheap storage, but on the other hand it’s also very easy to say, “Well, the easiest thing is just to be very inefficient ’cause I’m not a DevOps guy, and we gotta find those people.”
42:27 ME: I think it’s like most things. If you go for the real Platform as a Service stuff which is BigQuery, Google App Engine then the free tiers apply ’cause that’s what they want you to be doing. Whereas if you’re getting a bit more fundamental and you’re using ComputeEngine and things like that, then it starts to get more… They start presuming you’re more of a DevOps person. So they’ve got this sort of sliding scale of technical chops needed as you going. And yeah, I don’t think you’re expected to know the full stack for anyone.
43:03 MK: But I have to say that’s… And Tim completely stole my thunder as per usual.
43:10 MK: That’s probably a lesson that I learned the hard way when I started playing around with BigQuery was like, “Oh,” I wanted to do more, I wanted to pull more data, I wanted to do things that are really big scale and then roll it up right at the end. And I normally save all my embarrassing questions until after the podcast finished, like I said earlier.
43:30 MK: But tonight, I’m just going for it right now. There is something a little bit dangerous about when you’re an analyst and you’re still playing around, you don’t kind of know what you’re doing, and then all of a sudden like, “Oh, yeah. You hit your BigQuery cap.” And everyone’s like, “How did you run a query that did that?” And it’s this really tough lesson.
43:50 ME: How did you run a query…
43:51 MK: I don’t… I still… Anyway. I’ll send you the code.
43:55 ME: Yeah.
43:56 MK: I’ll send you the R code, but I did it.
44:00 ME: Yeah. From my experience, we’ve never hit that limit. So it must have been a monster query.
44:07 MH: Hey Moe, that means you’ve done something that Mark has never done.
44:12 ME: Yes.
44:14 MK: I’m not sure that’s something to be proud of.
44:17 MK: I guess what I’m saying is the room for error for an analyst that’s trying to learn, that doesn’t have those skills yet. It’s a bit higher than when you’re just playing around, say, in Google Analytics.
44:31 ME: Absolutely, and it can get… I’ve had expensive mistakes as well. That has definitely hit me as well. What I would say is definitely set up the custom alerts on the billing.
44:40 MK: Yeah.
44:41 ME: That’ll send you an email when it says suddenly you’ve spent 50%. We’ve got one for 100%. Say, we typically spend $100 a month at work, so I’ve got a billing alert that should come at the end of the month saying, “You’ve spent $100 a month.” But then I’ve also got one set at 300 and 500% just in case something goes crazily wrong and something like that, and you come in the morning, and all that. Yeah, that is definitely something to consider when you’re going onto the cloud because you do have that scale, and it is all just sort of pay more money, get more stuff. You can potentially scale before you’re [chuckle] ready to do so.
45:24 MK: Before you’re ready. Yup. Like to share my lessons learnt.
45:28 MH: Well, and then there’s also…
45:30 MH: You look at the different tools and I think it’s also sometimes confusing to figure out, “Well, what’s the best one for the thing that I’m trying to do?” ‘Cause some of them really kinda seem to crossover each other a little bit too.
45:42 ME: Yeah. I would say that’s more of the case for AWS, actually, from what I’ve seen. But yeah, sorry, carry on with what you we’re saying.
45:49 MH: No, no, it’s fine. I’m looking, it’s like, “Okay. Well, Cloud Datalab, alright. Then BigQuery, alright. Which one would I use for what thing?” It’s like, okay, I kinda understand what BigQuery does but Cloud Datalab says you can visualize large data sets, but I though that’s what Google Data Studio was for. So it’s like…
46:08 ME: Yeah.
46:10 MH: Those kinds of things are where I think it’s hard for people to know like, “Oh, here’s what… ” And so I think honestly as a community we probably depend on people like you, Mark, to help guide us some of this stuff a lot of times.
46:22 ME: I think that’s what this program, the GDE program, is about. It’s about putting blog posts out there, putting out all the use cases for X and Y and things like that. ‘Cause they’re giving you the tools, and what you do on it isn’t defined. So yeah, there’s definitely a lot of blog posts that you can do. There are on the Google Developer Expert website there’s quite a lot of solutions-based articles and things like that, which kind of describe things like that. Yeah, I would definitely look out for those, and I could maybe do a few links to help in that regard. But yeah, that’s a lot of reading [chuckle] to keep informed.
47:02 MK: I wanna ask quickly. If you could go back in time before you started playing around with the Google Cloud stack, and you just wished that you knew one weird little thing about how something worked, what do you think… Yeah, what’s that one thing that you wish you’d known when you started out that after all this time?
47:23 TW: OAuth 2? [laughter]
47:26 MK: Don’t start your authentication this week, Tim.
47:29 TW: How to become a famous musician. [chuckle]
47:31 ME: Yeah. Yeah. That’s a good question. I’m gonna have to think about that one.
47:38 MK: Okay. There’s plenty of time, and if it takes some time, we can always tweet it out.
47:43 MH: To what extent are you using the Cloud machine learning engine today to develop the best and most popular rock songs?
47:51 ME: Yeah, but it wasn’t out then. Assuming that… It’s such an evolving thing as well ’cause there’s all the stuff that comes out new. And yeah, so the machine learning stuff I think is… That is their big bet. ‘Cause they are playing catch up in this, they are trying to be the machine learning cloud and all of this. So TensorFlow, I don’t know if you know, TensorFlow is a Google project that they… Like the Hadoop paper, they just released the paper; for TensorFlow they’ve actually released the GitHub repo, and it’s all open-source. So they’re kind of guiding it, but it’s a completely open-source; you can go and see the chat, and that was from their DeepMind acquisition, those guys have helped do that.
48:33 ME: And then you can deploy TensorFlow on Cloud ML, which is a sort of more Platform as a Service stuff. So that’s more like your BigQuery App Engine thing. So you have machine learning trains APIs in the cloud, and that’s something that we’re using at the moment here and it’s very exciting to do that. So you got these kind of pre-trained APIs, text-to-speech, videos, and things like that, they’re taken care of; but this is where you’ve got much more customization about what you wanna be predicting or using machine learning for. So that is, I would say get into that as soon as possible, [chuckle] perhaps, if it was around, back in the day.
49:18 TW: And now we reveal that we don’t actually have Mark. This is a Google Cloud singularity, and Mark wrote a little script that just…
49:25 ME: Yeah, I preprogrammed this talk earlier.
49:29 TW: Say something in that language you made up. No, I’m just kidding.
49:34 ME: Yeah, I can tell Dansk with you, yeah. I’ve got a Danish version, yeah.
49:39 MH: Anyway, Tim is really eager to wrap up for some reason, I don’t know why. No, we do have to start to wrap up. This is really exciting. Mark, thank you so much for coming on the show.
49:51 ME: Thanks for having me.
49:52 MH: For everyone who was gonna email us and tell us we were too salesy, deal with it. I think everybody needs to know about this stuff, and I think it was fine. We just get… Mark, you did a fine job, although I thought you were a little heavy on IIH Nordic. I don’t know. No, I’m just kidding.
50:08 ME: Oh really?
50:09 MH: No, not at all.
50:12 ME: I’m literally sitting at work, so I had to mention them at least.
50:16 MH: No, of course not. I am so kidding.
50:19 TW: All those people who were gonna send their resumes to Michael are now gonna send them to IIH Nordic ’cause…
50:22 MH: Exactly.
50:25 TW: Did you say four days a week?
50:26 MH: Well, but I hope you’re using the Cloud Jobs API from Google Cloud platform.
50:33 ME: Not yet, no. Yeah, good point.
50:38 TW: No, anyways. One thing we like to do on the show is called “Do the Last Call.” We go around the horn, just talk about something we’ve seen that’s interesting that we think might be interesting to the audience. Mark, why don’t we kick it off with you?
50:49 ME: Oh, on the spot. Oh dear, this might be… It’s gonna be very nerdy, obviously.
50:56 TW: Geez. We just spent an hour talking about a Google Cloud Platform.
51:01 ME: More nerdy than that, yeah.
51:05 TW: Please, Mark, think of our audience.
51:10 ME: But I’m gonna go with it anyway. But I’ve been using this new MailMate email thing and it’s amazing. And you can write your emails in Markdown. Which if you don’t know, Markdown is a way of styling… Dartistics is written in Markdown, for instance. So you can write emails in Markdown. So, it’s brilliant. Everyone thinks I’ve labored hours and hours on these emails, and actually it’s just Markdown, and it’s got lots of nice filters and things like that. So MailMate, a Mac Mail [51:47] ____ thing. That’s it.
51:49 TW: So it actually takes your Markdown and then will convert that to…
51:52 ME: Yeah.
51:53 TW: HTML and send that out? Or what’s the…
51:54 ME: Yeah.
51:55 TW: Okay, so it’s a…
51:56 ME: It’s a Mac mail thing.
51:58 TW: It’s Mac app or it’s a…
52:00 ME: Yeah, it’s a Mac Mail program.
52:00 TW: Mac Mail.
52:01 ME: Yeah. Replacing the other one, MailMate.
52:05 TW: Nice. I may actually get out of the browser, try that out. Michael, you wanna go second?
52:12 MK: Yeah, throw things out of order.
52:13 MH: What? Who, me? I would love to.
52:18 MH: Actually, this is right on the spot for this kind of episode, because certainly, a lot of people wanna get into this but they run into problems or they hit their BigQuery cap somehow and… Not anyone we know, I just… But recently I ran across a website called datahelpers.org. It was started by Angela Bassa, who’s the Director of Data Science at iRobot, I think, and she just gathered a listing from Twitter of people who are willing to help answer questions in different areas of specialty in the data science and data engineering community. So, there’s people on there, you can basically, they’ve got their Twitter profiles. I didn’t see yours on there, Mark, but maybe you should get on there. But there’s…
53:09 ME: That’s a nice idea, yeah.
53:10 MH: There’s a lot of really cool people, and so as you’re exploring some of these things, there may be people who’d be willing to help out, answer questions, things like that. So that was my last call, datahelpers.org.
53:23 TW: Oh, that is cool. Well, now I’m really confused. I don’t know who to go…
53:28 MK: You go, Tim.
53:28 MH: Yeah, Tim?
53:29 MK: You go, Tim. [chuckle]
53:30 TW: Tim?
53:31 MK: Go Tim. We’re saving the best for last today.
53:34 TW: I will chime in because it is also a website that also starts with data. So this came by way of… Adam Grecco pointed me to this a while back and it also relates to this episode. If you’re looking to dive in and you’re thinking, “But I need a meaty problem to try to solve.” So datakind.org, you guys familiar with that? It is basically… It may sound very similar to a little web analytics thing that used to exist. But it’s basically solving problems through data. So there will be problems thrown out or non-profits that are saying, “We’re trying to uncover potential corruption in various countries, and there’s open data, and we want to put together a team.” So it’s an opportunity to volunteer doing stuff with data to support pretty ambitious and nifty large-scale problems.
54:30 MH: That’s awesome. Bingo.
54:31 TW: That’s a way to spend some time.
54:33 ME: Well, if you’re looking for a problem to solve, yeah.
54:36 TW: Yeah.
54:38 MK: I really feel great about following that up.
54:40 TW: Moe, what’s your last call?
54:42 MK: Come on.
54:48 MK: Okay.
54:49 TW: So there’s this new episode of Bojack Horseman. No one wants to talk about data greedy.
54:50 MK: Well, when I explain how…
54:55 TW: A website where you just grab everybody else’s data.
54:58 MK: Geez.
55:02 TW: All right, sorry Moe, go ahead.
55:08 MK: So my sister got me onto this website, and now it makes my world look so small, because I spent much of yesterday plugging in different colors and seeing what other palettes you could use for data visualizations. And like I said, it’s really hard to follow up. But it’s coolors.co but it’s spelled with C-O-O-L-O-R-S.co, and I’m assuming lots of analysts have heard of it, but it’s the first time I’ve ever found it, and I was absolutely gobsmacked because it kinda makes it hard to get the color scheme wrong for a data vis if you’ve got that handy tool. So that was my really fun find for the week.
55:47 TW: I’ve looked for that, which also over the weekend, I discovered that there’s a Wes Anderson palette brewer for R, are you guys familiar with that?
55:55 MK: No.
55:55 TW: So it’s basically, you can name palettes that are all different Wes Anderson movies.
56:02 MK: Oh geez.
56:03 TW: So I was going down those R color brewer or whatever the main brewer one; I was looking for a good palette for some visualizations for a presentation, so I will be like that.
56:14 MH: Nice.
56:16 TW: Coolors.co.
56:17 MK: Oh, is that how you say it? Ah. Got it.
56:21 TW: Coolors? I don’t know. I’m not sure. I managed to type it in while you were saying it. I’m on it, I think.
56:27 MK: Can I also just jump in and say, Mark, I do often totally nerd out after the episodes, but it has been so awesome having you on. I use so many of the packages that you’ve written, and you’ve literally hacked them to help you get your job done, and now I use those things to help me get my job done. And I just wanna say, absolutely huge thank you, and thank you for the community because some of the stuff you’re building has just been amazing, and I really appreciate it.
56:55 ME: Thank you very much. That’s what keeps me making them, [chuckle] having appreciation like that.
57:00 MH: He’s okay. He’s a pretty nice guy.
57:02 ME: He’s alright.
57:03 TW: Come on.
57:04 MH: And remember, if you wanna meet Mark in person, come to Measure Week or Super Camp, either one.
57:15 MH: No.
57:19 ME: I think it’s Super next time, yeah.
57:21 TW: Wait a minute, I wanted to say that I pitched my session for Super Week as saying that I’m gonna talk about something for those people who feel like they can’t quite get to the point of Mark Edmondson, but I’m gonna try to get them. And Zoe responded with, he was like, “Oh, that’s a perfect description.”
57:37 ME: Not last night.
57:39 MH: I’ll be doing Mark Edmondson lite, and then my pitch was, “I’ll be doing Tim Wilson lite.” No, I’m just kidding.
57:49 MH: So Mark, you mentioned that you created a Slack channel, is that something that you can share with the community?
57:54 ME: Yeah. Just if you are using any of the R packages that are based on googleAuthR, which is pretty much all of them, then I’ve opened up a Slack channel which is called googleAuthRverse. I’m not very good at naming things, [chuckle] as we know. Yeah, so basically it’s a place to have questions and feedback, and I basically wanna try and build a community up around the packages, because as I mentioned earlier, it is actually quite a lot of work to keep on top of all of the issues and stuff like that, so if people can help one another in that, that would be awesome. I don’t know, you’re in there, Tim.
58:35 TW: To clarify, it’s a separate Slack team.
58:37 ME: Yeah. Yeah.
58:37 TW: But if you message any of us, Moe and I are both in it, and the way Mark set it up, anybody can invite other people, so I regularly tell people who are asking me a question, I’m like, “That would be perfect, tell me what email you want,” and it takes two seconds to shoot an invite. So it is a separate team, not a channel within the Measure Slack team.
59:00 ME: Yeah, yeah, it’s just…
59:02 TW: Just to clarify. Yeah.
59:04 ME: Just so it didn’t hit the message limit a lot, ’cause that’s so popular, Measure Slack now that you can’t find stuff.
59:12 TW: Yeah. If you’re only in the Measure Slack, and you’ve wondered what that little left sidebar that is all blank, this is your opportunity to have a second Slack team over in your sidebar.
59:21 ME: I’ve got it on at least once a day.
59:23 MH: I’m about to find out what happens when mine starts to scroll, so I don’t know.
59:28 TW: You get a bigger monitor.
59:29 MH: Anyway. I like that idea. Mark, thank you so much for coming on the show, it’s been delightful.
59:36 ME: Thank you all family. Yeah, it’s been great, thank you very much.
59:39 MH: And as always, for my two co-hosts, Moe and Tim, if you’re out there listening, we’d love to hear from you on our Facebook page, on our Twitter, or on our website. And remember, keep analyzing.
This site uses Akismet to reduce spam. Learn how your comment data is processed.