#106: SQL and the Digital Analyst with Pawel Kapuscinski

WHERE were you the first time you listened to this podcast? Did you feel like you were JOINing a SELECT GROUP BY doing so? Can you COUNT the times you’ve thought to yourself, “Wow. These guys are sometimes really unFILTERed?” On this episode, Pawel Kapuscinski from Analytics Pros (and the Burnley Football Club) sits down with the group to shout at them in all caps. Or, at least, to talk about SQL: where it fits in the analyst’s toolbox, how it is a powerful and necessary complement to Python and R, and who’s to blame for the existence of so many different flavors of the language. Give it a listen. That’s an ORDER (BY?)!

Episode Links

Interested in our newly posted Producer job that pays in friendship and whisky recommendations? Here is what you need to know.

Episode Transcript

[music]

00:04 Announcer: Welcome to the Digital Analytics Power Hour. Tim, Michael, Moe, and the occasional guest discussing digital analytics issues of the day. Find them on Facebook at facebook.com/analyticshour and their website analyticshour.io. And now, the Digital Analytics Power Hour.

[music]

00:27 Michael: Hi, everyone. Welcome to the Digital Analytics Power Hour. This is episode 106. Do you ever get the feeling that you’re missing something in your digital analyst’s skill set? With all the required skills, it’s like you need a database to store it all in. But how do you look around the database quickly and sort it all out? Well, a listener, Christine Tzeng, requested a show about SQL, and by golly, we’re gonna select the second star from the left and outer left join until morning. So let’s sprinkle some fairy dust on this topic and fly. Tim, you excited about SQL?

01:09 Tim Wilson: Is it pronounced SQL? I thought it was S-Q-L.

01:11 Michael: No, I…

01:12 Moe Kiss: Seriously, you stole my shit. Honestly.

[laughter]

01:14 MK: For once, I had like a witty opening.

[laughter]

01:22 Michael: That might be the best joke of the day. [laughter] Alright, Moe, this is a great topic and close to your heart as well. You’re pretty excited about it?

01:32 MK: Absolutely. I’ve been SQL-ing all day.

01:35 Michael: It’s weird that it’s not called SQL, S-Q-L, because…

01:39 TW: You said squealing. I wanna call it squeal. [chuckle]

01:41 Michael: Because it’s not called no S-Q-L. It’s called NoSQL, so like when you’re doing a NoSQL database. Anyways, alright. Now, we wanted to bring a guest on the show and we all like this guy so much and he happens to be an ace at SQL. He’s always the person Moe reaches out to for help the most. So today, he is an analyst at Analytics Pros. He also has a really cool job as a scouting analyst for the Burnley Football Club, and he’s held a number of other data science and analyst roles in our industry. But today, he is our guest, Pawel Kapuscinski. Welcome, Pawel.

02:22 Pawel Kapuscinski: Hi, great to be here.

02:24 Michael: How do you pronounce it, S-Q-L or SQL? [chuckle]

02:27 PK: Oh, I use both to confuse people.

02:29 Michael: Well, okay. Good. But more importantly, gif or jif? That’s actually a good perspective. How do they say it over where you’re at over in Europe. Is it gif or jif?

02:41 PK: Gif.

02:42 MK: Who says jif?

02:43 Michael: Gif, good. Thank you for settling that. Everyone, Moe, everyone. Alright, let’s get started. [chuckle] Talk a little bit about… I think, Pawel, maybe just a little bit about your background, how you got into SQL and where you’ve seen it used in your career as an analyst. And I think from there we can jump into just talking about how we use it in digital analytics or data analytics more generally.

03:09 PK: So I started with SQL because I had to, so we had some project of implementation, some transactions into GA, and you are looking at, “Okay, are we correct with our implementation? What’s missing there?” And the only way to actually know that was to check in transactional database if we have all we need in GA. So, yeah, I had to learn SQL to do that, to do this cross validation of tracking.

03:37 TW: So that was querying a backend database for transaction counts to match to transactions in Google Analytics?

03:44 PK: Yes, exactly. We wanted to know how off we are and how much we can trust our GA data.

03:49 MK: That’s pretty similar to what we had at Iconic where we had, like all of our transactional data sits in a database and then you would be using analytics data as well, and often you’re trying to compare or pull one source of truth which is gonna be your own database.

04:05 PK: Yep, exactly.

04:07 TW: Let me ask, did you know R or Python before SQL?

04:12 PK: I knew R. I started with R. R was my first programming language other than some tries to use JavaScript for some implementation. But for data, R was first. And then I started using SQL just because that was the only way to connect to database. So even if you wanted to connect with the database with R, you still needed to know SQL, or S-Q-L.

04:36 TW: I mean if the database had an application that was running the database that had an API, that had a package from R, do you think you would have… I mean it seems like inevitably you would have gotten to a point where you’re in a database where you couldn’t connect to it with R but… I mean you could have connected to it with R, I guess. You just would have been writing SQL through R and just adding a layer of coding.

05:01 PK: Yeah, yeah, yeah. Yeah, it’s basically is. Like most of the packages you mentioned that connect to the databases, it’s just you applying SQL code as a string and they translate this string and send this to us as some request API. So even though there is a package, it’s usually just accepting the string and the string is SQL that you’re going to pass to the database to get the data back.

05:24 MK: Okay. You and now across three languages, has everyone just been “I need to do this to do my job” kind of learning thing or have any of them been like a conscious choice, “Hey, I just wanna pick this up.”

05:36 PK: I think SQL was definitely the one that I needed to perform my job just because, yeah, as I said, getting some information from databases that are not available in other ways. This is the need. You cannot replace that by anything else. Some companies I know that there’s many financial industries that they actually provide some broke down dashboard when you can just build whatever you want and it returns SQL and queries the data that way. But that wasn’t the case for any company I worked for. So I never would say. But I actually had to learn a skill to start using it to do my job, to do my job. Yeah.

06:14 MK: I feel like it’s becoming an absolute necessity, like you can’t be an analyst in any capacity without knowing SQL. And I don’t know what the jobs are like in the States. But from our discussions, I imagine Europe is pretty similar where every company has some type of database. And yeah, I kinda did the same thing. I learned it on the fly just because you had to. You couldn’t access the company’s data without it. But Tim at SDI, is that a skill that most analysts would have or that they even know that they need to have?

[chuckle]

06:48 TW: Well, it’s interesting that Pawel if you came in through, or and Moe as well, that you started through a, call it a “traditional transactional database.” And that’s back when I did SQL more and I have not kind of really gotten the rust kicked off. It was kind of out of necessity, but it was what I would think of as a more traditional relational database. I’ve got these handful of tables. I need to join and it’s kind of a… To me, it was kind of an easy thing. You get with a pretty simple select statement from one table. Then, all of a sudden, you’re like, “Oh, I need data from another table. Let me do a join. Okay, what’s the difference between an inner join and an outer join?” What’s kind of, I grapple with, is when BigQuery comes along with GA data, it’s kind of just one big table. But the stuff that you need to do to sessionize and do other work with it, we definitely, the people we have who are doing Python and R are also doing SQL. I think we have plenty of analysts who really aren’t doing any of those.*

*Editor’s Note: Drew Goettemoeller of Search Discovery would like everyone to know that we have a lot of analysts who use SQL every day

07:52 TW: For me, as I’ve tried to sort of ramp back up, it’s a much bigger jump to get to the… Taking, diving into BigQuery with SQL and trying to basically replicate stuff that could be done, even just in the web interface for GA, requires a lot. Like that’s a bigger step than where I think diving into a database where we have a transactions table and a customer’s table and we’re joining those together. I don’t know if that’s answering the question.

08:25 MK: Do you think that’s just because of how Google stored data in BigQuery, which is freaking unique and weird, do you think it’s just… I feel like the learning curve is learning about how it’s structured and saved rather than necessarily the skill set being any more difficult.

08:44 TW: I mean I think it’s kind of awesome in that it’s basically one table, which is really hard to pull off, I think, and you guys can correct me if that’s a gross oversimplification of it. I think it’s more the nature of web data and the fact that it’s time-series data in the concept of a session, right? To me, fundamentally, if you look at an orders database or a products database or even a CRM database, not that there’s not some element of time in it, but you’re fundamentally looking at this entity and the attributes of that entity, whereas with web data, you’re looking…

09:24 TW: The data is fundamentally structured as, well, you could say it’s either content or actions, or I guess it’s users and there’s this time element. So having looked at Adobe’s data feed and what’s involved there, you basically have to build yourself a database and figure out a schema for it, and you have to do a lot more work to even get to the start line, and then you’re off to the races with SQL. I don’t know. I literally know the least about this.

09:56 MK: I have no idea what the point was there, like you lost me.

10:00 TW: Well, I think Google’s design of the BigQuery tables for Google Analytics data is actually pretty clean. Like having worked with databases where to do a simple thing, you really have to join five different tables, so it takes that… You’re looking at an entity relationship diagram. You’re trying to figure out where the joins are and what you can do and you have to do a bunch of joining.

10:25 MK: Yeah, but I actually feel like that’s easier. I found, and Pawel knows this, because I would be heckling him being like, “I don’t understand what the heck this unnesting is in BigQuery. Why do I need to unnest something?” And like it really, and I don’t know if it’s because I was learning… I learned in a traditional database first and then went to BigQuery that I found it harder. Pawel, which did you learn first? You learned in like a more traditional transactional database?

10:53 PK: Yes, I learned the traditional database first. And actually I created that BigQuery. So the structure, it makes sense, especially from processing perspective and like, you know, Google, they created it that way because they want to have it cheaper and more efficient and because you could create that. You could create a BigQuery like GA data created in BigQuery the same way as any transactional database is created having each hit a separate row or each information for each in a separate row. But it will just, the processing time and the size of the table and everything would be much more expensive and less efficient for Google.

11:28 PK: So the whole nesting thing they did was just for them to save the money, to save the storage. But yeah, actually, I agree that the unnesting part in BigQuery, it’s pretty complicated and it’s like you don’t actually need to understand what’s happening, but you need to know what to do to get what I want.

11:48 Michael: So just for those who maybe haven’t actually seen the data, the unnesting, is that… That’s happening because each row of the table is an entire session and that means if we think about web data in one session, you may view, take a simple one multiple pages in the way that the BigQuery schema is for the GA data, is that you’ve got basically in one, think of it as a cell, I’m sure that’s the wrong way to think of it, it actually has multiple page views, multiple events, and that’s what the unnesting is saying. If I have one row but I actually have multiple things that happened, I have to unnest it? Is that the… Did I just butcher that or…

12:28 PK: Yeah, exactly. So basically, after unnesting, each event or custom dimension, whatever you have, becomes the one row. Instead of one row being session, one row is the part of the hit that you’re trying to get information about. So for example, if within the one hit, you have 20 custom dimensions, then this one hit becomes 20 rows.

12:50 MK: And that’s where I kept getting totally stuffed up and I actually had this epiphany the other day, which, yeah, I should have realized months ago, the unnesting, ’cause I always was trying to like call custom, hit level custom dimensions with hits, and I couldn’t understand and I was doing it separately and creating all this really complex stuff, and also had to do with the order in which you unnest which I didn’t actually realize because you need to be like unnest at this level and then at this level and then it all works fine. But it really took me… Like me and my sister sometimes get on Skype and try and draw it out for each other, which is so stupid, but it, yeah, I feel like I had to understand it to be able to do it.

13:32 TW: If you guys both, ’cause you guys both then started with kind of a, call it a traditional flat tables, relational database learning SQL. If you wound the clock back and said you could start over, do you think it would have been faster or to your benefit to have had your first forays into SQL be with GA data and BigQuery or… To me it seems like most of the database that you’re gonna run along are kind of more similar to the databases you guys started with, and then it’s been the mind kind of twist to work with the GA data. What do you guys think?

14:10 PK: I think that it wouldn’t help that much. I just think because if you start, if you actually learn this unnesting part of the BigQuery GA feed, then actually, if you start adding some extra information from other tables, then this whole table, it becomes the relational database that I’d been using before. So it’s just, this is just extra layer of information that’s coming with GA fifth. So unnest, this is the one thing you need to do for some fields, not for all fields that you have. So if you want just session level information, then you won’t need to do any unnesting, right?

14:45 PK: But as soon as you have some more granular information about what’s happening in your sessions, then you need to add that part. So this is… I think this is just one small part of the whole process of learning SQL and it’s hard because it’s hard to imagine what’s happening there, and this is very like, if you draw actually through that piece of paper and draw a BigQuery database, then it starts making sense why we need to do that. So it’s like the best… I think Donal Phipps created some visualization of how the BigQuery data, how the GA feed to BigQuery is structured and it really helps to see the whole picture. And actually after that, it will come more natural to you to do some unnesting.

15:28 TW: Has somebody done that or you’re just saying once you start bumbling around, if you personally start doing it, like is there documentation out there where somebody has said, “Oh, this is useful.”

15:37 PK: Yeah, I think Donal Phipps did that.

15:40 TW: Of course he did.

15:42 MK: Now I feel like you have to know both, and I actually feel like maybe it was better learning with a more simple, yeah, I guess traditional, simple, whatever you wanna call it, like transactional database and understanding what was going on. And also I was using Snowplow, and Snowplow, it’s saved similarly where you have hits in one table and different, like 10 levels of data saved in separate tables. And like learning how to join them, I think, is really important. And then you just have to deal with the BigQuery nuances once you start. I don’t know. I feel like either way, for me I need to understand what’s under the hood in both circumstances. Because once I actually understand, “Okay, this is what it looks like and this is what this line means and that, and when I join, I do this,” then I understand it. If someone’s like, “You just need to do this every time,” I’m like, “I don’t get it,” and then I’ll really struggle. So either way, you’ve got to learn it, right?

16:39 TW: So this, now I’m curious because I feel like from the first time that I was exposed to sort of a relational database, it sort of clicked to me how the tables, I was like, “Oh, that’s really cool,” when I started understanding how you would join tables together. I sometimes feel like there are people, and I don’t know if I’ll say they’re necessarily analysts, that the intuition behind a relational database may be a struggle, right? Because you could come into Google Analytics or Adobe Analytics and rely on the web interface and never have that idea of joining.

17:19 TW: If somebody says, “Pivot tables confuse me or VLOOKUP,” those are to me kind of two SQL-like functions in Excel. A VLOOKUP’s kind of a join and a pivot table is definitely kind of a group-by-type functionality. Have you run into, I guess to analysts, that don’t seem to have, be able to develop the intuition behind a relational database? And if so, is there a way to get over that?

17:50 PK: I think if someone is analyst and it means that he’s actually analyzing data, not doing some kind of implementation, I’ve never met a person that actually struggled with that. And I think it’s just because it’s so broad like digital analytics is so broad so you have implementation there, and those people who are more worried about implementation, they probably don’t know the SQL and they don’t try to learn SQL stuff. And I think that maybe is this difference. But people who actually analyze data, I think it’s becoming quite common. As soon as you need some richer data and get some extra stuff from it, then it’s becoming more natural and natural. And if you think about this, even adding custom dimension is kind of like joining stuff, just simplified. But you don’t think about this that way.

18:41 MK: Yeah, it took me a while to get my head around the whole custom dimension hit level, like it wasn’t something that just naturally I understood.

18:50 Michael: What do you mean?

18:53 MK: I don’t feel like it’s another way of joining. I feel like it’s a different, in my mind, it’s just saved in a completely different way. So I just need to work with whatever way it’s saved in. I don’t know.

19:07 PK: But more, I mean is more about if you have some customer ID that you’re pulling back from your backend, you basically join this to the data that you already have, so giving this extra information to your GA.

19:21 TW: That’s why you would say I want a custom dimension so that I can join it to… I want this custom dimension so that I can join it to other data that you’re intuitively, one of the reasons you’d capture a customer dimension is specifically for a joining operation. Is that what you’re…

19:38 PK: Yeah, yeah. Yeah, exactly.

19:40 TW: Okay.

19:41 PK: Because why would they need that? Right? You want to have more information about this. And then, thanks to that, you have this custom… Yeah, you have this key, the one key to see through the sessions and I think that’s…

19:52 MK: But yeah, Tim, to your point though, or question, and my sister is gonna kill me for continuing to talk about her, she never ever worked with relational databases and went straight to BigQuery. And I think, I know from my own learning experience the reason it was faster is because I did have a bunch of data engineers sitting around that would talk me through, “This is how it’s structured. This is what it looks like.” I had that conversation and that support, whereas she just went kind of straight into it and had to figure it out for herself and didn’t have any experience with relational databases. So I actually think it’s more difficult maybe to grasp if you, one, don’t have the support, but two, don’t have that experience with a more traditional database.

20:33 TW: Well, let me call out. We’ll put this in the show notes that you did that kind of little bonus episode with Michelle and that was kind of the, “Hey, dive in or not dive in,” which interestingly, all, both Michelle as well as you and Pawel both did say it was out of necessity that you dove into it. It’s been interesting. This is like for me, for Python and SQL, I don’t have the burning use case, “I have to figure this out and I need SQL or Python to do it,” and it’s… I feel like it’s back to that struggle to, “Necessity is the, I don’t know, mother of education maybe.”

21:18 TW: What is the approach? Like talk to me. I wanna get back to being fluent with SQL. I have access to both free data that’s been captured through the App Engine pushed in GA-free into BigQuery. I’ve got access to customer data. So I have BigQuery data, web analytics data. I don’t know that I have any transactional databases that I have access to. What would you guys say?

21:46 MK: So how do you query BigQuery at the moment?

21:49 TW: I don’t do a ton of it. I just don’t have…

21:52 Michael: Tim has people for that.

[chuckle]

21:55 Michael: He just has other analysts who are just doing whatever he tells them to.

22:00 TW: I mean it’s weird. I feel like there’s probably a big blind spot just like when I learned R and could then hook into the reporting APIs for Adobe and Google Analytics. And once I understood what I could do, then I started seeing opportunities and doing things that were of value. So there’s part of me that feels like, “Is this just a big blind spot?” because I know the data’s there. I’ve actually gone through the schema and done simple queries enough to generally know what’s available in the BigQuery data. But for the BigQuery data, for GA, or actually we’ve got Adobe data, data feed data going into BigQuery as well, but I still haven’t had the absolute need. And is that just because I’m not identifying the opportunity? I don’t know. I can’t figure out how to get over that hurdle.

22:44 MK: I know how.

22:45 TW: Have I’ve lost you again? How?

22:46 MK: I can send you more code for you to review which has heaps of SQL in it, which does also lead me to my next question actually, Pawel. Someone was reviewing my code not too long ago, and one of the pieces of feedback I had… So this is in my quest to get less sloppy with what I’m writing. I’m learning all of these things. So what I have a tendency to do, which actually a lot of people at the Iconic do, is that they get code from either Redshift or BigQuery. They pull it into R or Python, do a bunch of stuff and then spit out stuff somewhere else.

23:24 MK: That’s like a pretty common practice. But one of the pieces of feedback I got was that I should be doing more in my SQL rather than just pulling like… Like my SQL often can be pretty bare bones and then I do all of the manipulation in R. And the feedback I had was that I should be doing more of the manipulation in SQL. So I’m really curious, like where do you think you fit on the scale? Do you reckon you do more of that data manipulation in R or Python or do you reckon you’re doing in SQL or like, and why?

23:53 PK: So I try to do as much in SQL and the why is just efficiency, I would say. And especially with BigQuery, this is so powerful tool that imagine if you try to download two gigabytes of data and trying then to run queries on your R, then it’s going to be madness. It’s going to be hectic. So the thing is that if you have this tool like BigQuery, another example, that can do this all processing for you, just you need to try more and find some way to do this in SQL than do it then, and then so as much, as many operations as you can do in SQL you should do there, and then export the data like as small a chunk of data as you can to R. Just make it more efficient and not to clog your machine with.

24:41 MK: But okay. So like, okay, what about the case where you have one column and you’re dividing it by another column? So that’s not actually an aggregated, make it a smaller data set. You’re just doing a calculation. Which would you lean towards in that case?

24:55 PK: Yeah, but if you… So it depends what you’re going to use in R. You want to use both of those columns in R or if you want to just use the final thing or their columns in R. Because if you want just to have for the final column, then you can just do operation SQL, and then your export will be much smaller just because it can be one column less in R. So you download your data faster and then your machine will less use memory to actually process that.

25:20 Michael: Well, but that’s… So that’s, you’re hitting on the facts. So there are things that could be done in either, and at some point if you’re using both for some task, you’re having to figure out what’s the point where I’m shifting from having done stuff in SQL to doing stuff in Python or R, and do you ever find your… I would think that somebody who is way more comfortable with SQL is gonna maybe push a little farther into the stuff that writing more convoluted SQL that would have been simple in R or Python, how soft is that line of when you say, “And here’s where it’s gonna stop with the SQL and start with the other.” Is that, you get 10 people who are fluent in both, they would all agree basically on the same spot? Or does it really… How much is personal preference?

26:11 PK: I think it’s coming from… I think it’s personal preference and it’s coming from experience that some manipulation you try to do would be just much faster in SQL. So imagine that you want to do a CASE WHEN statement, R is super slow with that. And then, if you try to replace that with If-Else function, then good luck with all the nesting you’re trying to do, because If-Else function is much faster than CASE WHEN.

26:37 MK: Did not know that. Did not know that. I prefer If-Else because I don’t like the squiggles in the CASE WHEN. But that’s good to know.

[laughter]

26:49 TW: For aesthetic reasons!

[laughter]

26:53 Michael: That’s good information.

26:56 PK: So good luck with nesting this If-Else, If-Else inside If-Else and then reading that or giving someone to read in R. It’s just madness. So instead of doing that, you can just write a simple CASE WHEN function in SQL that is pretty fast and, yeah, will give you exactly the results you’re looking for.

27:14 MK: So…

27:15 PK: So, it’s…

27:15 MK: You’re… Okay.

27:16 PK: Yeah.

27:17 MK: Basically, I’m on this quest at the moment to be less sloppy with code. And one of the other tips that I had, and I still have… I understand from an engineering perspective why this would be good practice, but from an actual, I don’t know, personal perspective, I find it really tough. So one of the things that was recommended to me, was that in a case… So let’s say you have a chunk of SQL that’s sitting in R or Python that instead of pasting in the SQL, you should actually be calling in a separate SQL file that has the code in it. Does that make sense?

27:48 MK: So instead of just having it as a string in your R code or your Python code, you would actually be referencing a separate file which has your SQL code just saved as a SQL file and that for like code reviews and all that sort of stuff, that’s actually better practice. But then I’m like, “Oh, man.” You could be reading one chunk of code in R that references five different SQL queries. So to actually get to the bottom of understanding what’s going on, you would have to go back and reference five other files. What are your thoughts on this? What do you think about it?

28:18 PK: I’m pretty sure that this idea came from some programmers, some software developers, because if you think about some Python code, it’s basically whatever you do in Python, you’re trying to save as some functions and then save as files and then access those files or packages or some files saved. So I think this is a similar idea. I would say that my code is not as clean as that person would like to. I’m trying to make it cleaner and I’m trying to get these things, yeah, these best practices. So I’m pretty sure that it’s just easier than for a person, ’cause imagine that you’re getting your file where you have this one big 1,000 rows, and inside these 1,000 rows, you have 500 rows of SQL. And then going through that would be so much harder inside the R file, especially, they have no formatting for SQL in R. So then, it would be so much easier to go to that file and then look at this file with formatting of SQL and then reading that through there.

29:22 PK: So I’m pretty sure that I think we analysts don’t have this QA experience, so we’re not actually doing, like keeping something because someone is going to read after we’ve finished our work. So we don’t have these things, and we think, “Okay, I did that. So this is easier for me and I will save these five minutes of not creating extra file for that,” when actually, if more people are using the same code, it will be easier for them to have this separate, in separate files.

29:49 TW: Is there the same concept, ’cause I’ve run into that. I mean, the scrolling, like it’s a funny thing with coding that’s come back to me that when you wind up with 1,000 rows even if it’s just R, it can start to be unwieldy, and definitely that’s one of the reasons of doing, like including files, saying, “Can I modularize this and put functions in a file and include those? And now I have more files, but none of them are that long and therefore I’ve got kind of a structure that makes sense.” Is there that concept in SQL where you can have nested SQL in separate files? Like I’ve never… Can you modularize SQL such that you don’t have 500 rows of SQL that you’ve got, you’re referencing multiple SQL files or is there not that concept?

30:35 PK: I don’t think you can actually have it in SQL. I think one file is one execution usually.

30:40 TW: Okay.

30:41 MK: Yeah, I have seen SQL that is like 5,000 rows.

30:45 PK: Yeah, yeah, they’re usually as big. You have some… If you have some extra things like insert a table, create a table, then usually it looks like you have five different codes to execute, but it’s still treated as a one file.

31:00 TW: So that’s the other thing. We’re opening up SQL, and I guess maybe that just comes along, understanding somebody else’s SQL, is that… If you’re fluent and they’re writing clean SQL, is it analogous to reading somebody’s R or Python code that’s been well commented and reasonably well-structured that you should be able to go through and fairly easily pick up what they’re doing and finding the part of that 5000 rows that you’re interested in? Or is that more challenging?

31:33 PK: I think it’s easier just because you have less functions and less things to actually know for SQL.

31:38 TW: Oh, you don’t have to say, “What does this do?” Okay.

31:41 PK: Yeah, yeah, yeah. So it’s rarely. Like SQL, I think, for analysts, SQL is mostly like some aggregating joining stuff. So, if you are aware, and you know the schema of the table that this query is going to work on, then you actually will be able to quite easily read that and understand that. So instead of like, you don’t have this many 1,000 libraries and then worrying to go to the computation and figure out what’s the input, what’s the output at every function of that library. In SQL, it’s usually much easier. And even if you have some advanced functions like some Windows functions, they are usually named the way that you actually know what to expect, like function is called “first value,” which means that you’ll get first value. Or function is called “count,” which you know that it will be some counting there.

32:26 MK: Yeah, I think SQL is easier to read. Like even just in the workplace, you see people, they don’t… In R and Python, people leave comments and that sort of thing. In SQL, it’s like you don’t need that ’cause it’s almost so logical and…

32:43 PK: Yeah, yeah, yeah.

32:44 MK: And I don’t know. It’s almost like you would speak, like, “Sum this and count this and join it to that,” like it actually, it seems more logical to me.

32:53 Michael: So that’s weird. So then what actually drives the different flavors of SQL? And in the BigQuery world, there’s definitely always the standard versus legacy. But my understanding is there are other… I’ve never really understood if it’s fundamentally just, select from where, group by count. What are the actual differences of the different flavors of SQL and how much do those actually come into… How quickly do you run into, “Oh, crap, I’m using a different flavor of SQL. I need to change the actual query.”

33:30 PK: So it’s obviously a vendor’s fault as always. Same as we cannot find vendors that will agree what session is, the same way vendors of databases they couldn’t agree how to structure their languages. So yeah, different companies started developing different things and that’s why there are some small differences between some languages. Sometimes how you call the table, but sometimes, like for MySQL, we actually don’t have Full Join, for example. So some function is missing. Then some…

34:01 MK: Full Join?

34:02 TW: You don’t have… You don’t have what?

34:03 PK: Full Join.

34:03 TW: You don’t have a Full Join.

34:04 PK: Full Join, yeah.

34:04 Michael: Full Join.

34:05 PK: You have just the Join.

34:07 MK: What?

34:07 Michael: Wow. Okay.

34:09 PK: So, in MySQL, instead of doing Full Join, you’re doing the Left Join, then you’re creating the second table where you have Right Join, and then you just unionize the table so you create a union and you’re just putting one table on top of the other.

34:24 MK: That sounds really painful!

[laughter]

34:26 PK: Yeah, I can imagine. That could be a problem for many people.

34:30 MK: Oh, man!

34:31 PK: So I think, coming back to your question, you’re learning that by trying to do some stuff and you see, “Okay, it doesn’t work. Why it doesn’t work? Okay, because I try to do this in different SQL version and now I need to switch this.”

34:43 Michael: So what’s the story behind standard and legacy in BigQuery specifically?

34:49 PK: So, that’s a sweet story because Google just created something that’s supposed to look like SQL but never was ever a SQL. And then, I think that was just because it was some internal tool and they had internal people that not always were some analysts and stuff. They tried to make it as easy as it could be for a final user, but then, over time, they learned that it’s actually not that efficient as it could be and it could actually, you can use standardized version of some SQL and actually combine that.

35:17 MK: But there’s also a lot of limitations, like there’s a lot of things like I started learning on Legacy SQL and then had to migrate, which was painful in and of itself. But now that I’m on normal or standard SQL in BigQuery, there’s so many functions that you couldn’t do, especially as it pertained to date functions and things like that in Legacy SQL that you would just be like banging your head against a wall because there was just no way to do it ’cause it just hadn’t been built.

35:44 Michael: So, can either of you actually think of one specific example of something that was in… Like when you had to do that migration, because I don’t know anything besides that. You just kind of referenced the date functions. Did it change the way that you’re working with dates in Legacy, then you had to change to something more powerful but different in standard SQL? Like what’s one example of, as you were having to do that cut over that you had to actually go and change the query?

36:12 MK: I know the tables, and Pawel will know this more than me, but there was an issue because they separate them by day. And so you had to change actually how you queried them by day. When I swapped over, I remember that being a pain in the butt.

36:25 PK: Yeah, the other thing, some cloud functions don’t work with Legacy SQL. So if you, for example… It’s not actually trying to get the data, but if you try to schedule your query, then it doesn’t support the… It’s not Legacy SQL anymore. So you need to use standard SQL for that. So there’s some functions of this whole cloud, Google Cloud’s infrastructure, that just limits the Legacy, something that they actually build. But it’s…

36:51 MK: I don’t think I actually… I don’t think I knew that that there were some things in cloud that didn’t support Legacy, which is actually kind of funny.

37:00 PK: This is because the scheduling course is actually pretty new. I think it’s still in, not fully released. So it’s still some kind of beta. But, yeah, this is the one thing that you actually, you cannot actually use the Legacy. The other thing is that if you want to create user-defined functions for stuff, also there’s no… You cannot do this in Legacy, you need to use standard SQL.

37:20 TW: Wait, so what’s a user-defined function in a SQL context?

37:24 PK: So, for example, let’s say that… Okay, so the way that Google stores the data, the money information, it’s one million just to not make it flow. So the way they want to store the data is if there’s any kind of money, instead of $2, it would be $2 million. This is how they store the data. And then if you want to write a function that every time you call the function on any of the money fields, it will just do this, just divide it for yourself instead of doing this whole calculation. So, yeah, by yourself, then we can use this defined function that will do that for you if you call it on that field.

38:00 TW: But you actually define the function?

38:02 PK: Yeah.

38:02 TW: There’s SQL that you actually define it within the query, kind of like you define a function in R or Python, and then you can call the…

38:10 PK: Yeah.

38:10 TW: Function. Okay.

38:11 PK: Yes. So actually, on top of the query, you can define the function. So you cannot reuse the function, so it won’t be… You actually need to have a function to query you, to try to run. So it needs to be on top of your query. And what’s actually pretty useful, that you can also have a function in JavaScript for standard SQL in BigQuery. So if you want to do some really complex stuff with your data. So for example, you want some loop over your data that you cannot loop in SQL, then you can loop in JavaScript, and then it will return with whatever you’re looking for there.

38:43 MK: Yeah, but Pawel and I were having a conversation about user-defined functions awhile ago, and I’m like, “When will I ever need to know this? This is like next level.” And then, of course, you come out with actually really a good use case for why I would need to know it. So…

38:56 Michael: So if someone wanted to start learning SQL, where do you suggest they start? Where’s a good place to get a sort of a basic understanding or some basic experience?

39:09 PK: So my favorite place that I always reference when someone asks me that question is W3Schools. They have a pretty good explanation of SQL. There is Mode Analytics tutorial that also covers all the material. You can test all your queries there. Then I would go to your BI and ask for access to your transactional database or whatever your company is doing and just start asking the business question that you can try to ask for SQL and then start thinking about the cases, how I can enrich my data, some extra information from outside of GA with that information that’s coming from different databases.

39:48 PK: So when I was working for a betting company, we couldn’t track all the information about the bets that were placed. We had transactional information, but actually having all information about the bets was impossible to keep in GA just because the infrastructure of GA couldn’t support that. So what we were doing… So what I had to do every time, I could just go to BigQuery, get information about transaction ID. And then if I wanted to know what exactly bets were part of this transaction, then you had to go to the transactional database.

40:17 PK: So if you think about your company, if you have some inclination that they don’t have in GA, and you want to start using it for some analysis, then I would recommend that. And also there is a very cool website. I think it’s called HackerRank, when you have actually tasks of SQL and just going there and it gives you answer if it’s wrong or right. So it’s a pretty good way to learn after you take all those tutorials and read the other information. I wouldn’t recommend going to books just because I think you can just bounce off them and not actually learn just because you’re going to read, not actually try stuff.

[laughter]

40:56 PK: And for SQL, I think the basics of SQL that analysts need, so not actually administrating databases, it’s just easier to try stuff, but not get this whole background and history.

41:10 MK: Damn, you are full of the resources.

41:12 Michael: Now they have this podcast. So there you go.

[laughter]

41:17 Michael: Well, this conversation has been so awesome, and it’s so great to have you on the show, finally, Pawel. All of us have been getting to know you over the years. You’re such a great guy. And one of the things we like to do on the show is go around and do a last call, something that we found that’s interesting and fascinating that we wanted to share with our audience. So, Pawel, you’re our guest. Do you wanna go first?

41:41 PK: Yeah, I can do this first.

41:43 Michael: Okay.

41:43 PK: I would like to recommend a book, which is pretty close to what I do a lot in my free time. The book is called Edge: What Business Can Learn From Football. It’s written by an English football writer, Ben Lyttleton. But basically, the book covers how some businesses get inspiration from football teams and how they translate the problems of football teams to the business world. I think that the problem that digital analytics community has is that we not try to learn from different communities and from what was actually achieved somewhere else, and we’re missing that exchange of information. So I think we should try to find as many inspirations from outside of our industry. So even for that, I think that just gives you a different point of view on some stuff would be something.

42:33 Michael: Pawel, you know that we don’t support that on this show.

[laughter]

42:37 Michael: We can not… No, I’m just kidding. [laughter]

42:40 TW: How much do you have to know and understand football to, you think, to get out… I mean, you’re kind of a sports, and you’ve been, even in this football data, I’m not really a…

42:49 PK: It’s not actually about… It’s not only about the football. So you don’t need to understand the game and what’s happening on a field.

42:55 TW: Okay.

42:56 PK: They’re also about the culture of the teams, and you can actually get, what we can learn from culture of the teams. So there is a team in Spain that they only hire players that are from the region that they are and they are like pretty successful over the years and there’s like explain how the business can learn from that. So many different examples of the different cultures of the teams and what we can get from that. So you don’t need to be actually a fan of the game to enjoy the book.

43:24 Michael: Yeah. And just for our US audience, we’re talking about not American football, but football the way that the rest of the world understands it.

[laughter]

43:34 Michael: Yeah, okay. Moe, do you have a last call?

43:37 TW: Do you have an anti-last call?

43:38 MK: I have an anti-last call. I am…

43:41 TW: Anti…

43:42 MK: Being a bit of a rebel. So I had this really interesting conversation towards… Yeah, oh man. I had this really interesting conversation towards the tail end of last year where I was talking to someone about how much we should all tinker in our spare time. And we’ve actually talked about it on the show before, like can you actually be successful in this industry if you’re a 9-5 person who finishes at 5 o’clock and walks out the door. And anyway, we had this really great discussion about it and one of the kind of, I don’t know, “aha moments” that we had chatting was that you also need time away. And the reason that you need time away from whatever it is that you’re working on is because often you need to think through and you need to process. So being the start, well, start of the year, I have actually spent the last couple of weeks taking a bit of a break.

44:31 MK: I have not picked up work books. I have done lots of cooking. I have done lots of hanging out with my dog. I have been to the beach. And I think sometimes putting everything down, putting the tools down and spending a bit of time away, you actually get a lot of clarity about some of the tricky problems you’re trying to work on. So the reason that it’s an anti-last call is also because when you actually pick up a book that you just wanna read for fun, which Tim never does, ’cause he’s an overachiever…

44:57 TW: That is not true.

44:58 MK: Notice like you just suddenly are excited to go to bed and read your book, because it’s not… It’s something to do with stats and R, which Tim just does for fun.

45:08 TW: No, I have… You cannot paint me. I have to… I have read more John Scalse I think than Helbling has at this point, and he’s the one who introduced him to me.

45:19 Michael: Yeah.

45:20 TW: So…

45:20 MK: Yeah.

45:21 Michael: Well, I guess, is that gonna change your last call then, Tim?

45:24 TW: No. [laughter]

45:25 MK: He’s gonna be like, “So I’ve read three books over the holidays. It was great. They were all about R.”

45:29 TW: So… [laughter] Fuck. So I’m gonna have an R one.

[laughter]

45:39 TW: Well, so I was gonna relate it to this episode and as I understand it, so one final little… Am I right, in SQL, there is really no visualization capabilities within SQL? Is that right?

45:50 PK: Yeah, I don’t think there is. If there is…

45:52 TW: Okay.

45:52 PK: There’s some tool that pretends that it’s in SQL but actually, there is some other things.

45:55 TW: It’s not, so…

45:56 PK: Yeah.

[laughter]

45:57 TW: So there is some code that’s pretty limited code that’s generative art with R, where you basically… So it’s kind of… It’s frivolous, but it’s using R to actually generate different patterns and many of them are kind of cool and mesmerizing patterns. So that’s my definitely not an idle, not industry-related thing, but the generative art code with R to make fun and interesting patterns.

46:28 MK: Bedtime reading.

46:29 TW: What do you have?

[laughter]

46:32 Michael: Fascinating.

46:33 TW: Bed time coding.

46:33 Michael: Bed time coding.

46:35 TW: What do have, Michael? We can’t wrap up fast enough in mine.

46:38 Michael: Well, actually, I’ve got a couple of things. Well, first off, I can’t help but mention, even though it might even be too late to get tickets, but go check it out. So just a couple of weeks, the Super Week conference is gonna be held in Budapest. It is one of the best analytics conferences ever of all time and our own Tim Wilson will be there doing all kinds of shenanigans, apparently. And so, if you can still get tickets, go check it out. There’s not enough time, but definitely go check it out and see if you can still go.

47:14 TW: I’ll be having an idle casual fiction book group that I’ll be conducting.

47:18 MK: That’s right.

47:21 Michael: Casual fiction corner.

47:23 TW: Yeah.

47:24 Michael: So, and then the other one is actually more about our show, which is we are hiring and we are trying to find someone to potentially join our show as a producer. It’s… We’ve actually got a job description and everything. So if you’re on the Measure Slack, check out job postings. By saying this right now, I’m now making sure that that’ll be there by the time this show comes out.

[laughter]

47:50 TW: And there’s a link in the show notes if you go to analyticshour.io.

47:53 Michael: There’s… Yeah, and it’ll be on our site as well.

47:56 TW: What else would we like to commit to? [chuckle]

47:58 Michael: Yeah, there you go. Well, but it’s something where we’ve talked about. It’s a very lucrative position, obviously, and so if you are someone who is interested in analytics and the podcasting thereof, check it out. Okay. Well anyways, thanks for doing last calls everybody. And this has been a really good conversation because unlike how I feel sometimes these conversations go where they’re very philosophical about these things, I felt like this was very accessible and had real tactical knowledge that was being shared. And so that I think makes this kinda nice. So, Pawel, thank you so much for coming on the show.

48:44 PK: Thank you for having me.

48:45 Michael: If you have more questions or you’re like, “Who’s this Pawel guy and how do I contact him?” Well, I’m glad you asked. You can easily reach all of us through the Measure Slack and also our Twitter accounts, and also on our website analyticshour.io. So we’d love to hear from you. Remember, it’s a new year, which means a new chance to just go over, log into iTunes, and submit a rating and review of the show. We would appreciate it. Apparently, people doing that does something to something else, which then makes us keep getting more exposure, which is probably good ’cause analysts need more exposure. Alright. Well, for my two co-hosts and myself, I want us to kick off this new year, and tell everybody, remember, even though it’s 2019, you still gotta keep analyzing.

[music]

49:50 S1: Thanks for listening, and don’t forget to join the conversation on Facebook, Twitter, or Measure Slack group. We welcome your comments and questions. Visit us on the web at analyticshour.io, Facebook.com/analyticshour or at Analytics Hour on Twitter.

[music]

50:09 Speaker 6: So smart guys want to fit in. So they made up a term called analytics. Analytics don’t work.

[music]

50:17 Speaker 7: Analytics. Oh my God, what the fuck does that even mean?

[music]

50:25 S8: Well, well, well.

[music]

50:27 MK: Okay. Pawel, this time, I’m gonna try not to forget your advice. And I feel like it’s gonna be more memorable because you said it instead of Tim. I don’t wanna mess with your creative juju or whatever.

50:40 S9: No, please do not mess with my creative juices.

50:44 S10: Know your fucking role.

50:46 S11: That’s alright.

50:46 MK: Jeez, he needs a coffee. Oh Jesus. We’re getting off to a bumpy start here, kids.

50:54 S12: Yeah. You’re the first person to record in one of those, I think.

50:58 S13: I was hoping for that, actually.

51:00 S12: I think it will likely be that you’ll be probably the best guest ever because of this.

51:05 S13: Wait, what did I miss?

51:06 S12: He’s wearing his Power Hour t-shirt.

[laughter]

51:12 S12: So right there, yeah.

[laughter]

51:14 PK: You can take a screenshot instead of making a photo with your phone.

51:19 TW: Pawel, you’re such a… You’re such a Tim.

51:21 Michael: Don’t.

51:22 TW: That was… Could you read the thought bubble over my head?

51:24 MK: Oh, shit. Oh I love him.

51:28 TW: Tim hates that guy. [chuckle]

51:30 MK: I’m gonna start sending you motivational posters, Tim.

51:32 TW: Yeah. Yeah. Can you tell me who that is again so I can just… Can I block people on LinkedIn, just to make sure that doesn’t crop up in my…

51:41 MK: See? Overachiever. I need to get people like in a good head space.

[music]

51:48 TW: Rock flag and select from where.

[music]

2 Responses

  1. […] If you are at the start of your journey on using SQL try to listen to this Digital Analytics podcast episode 106 dedicated to SQL. […]

Leave a Reply



This site uses Akismet to reduce spam. Learn how your comment data is processed.

Have an Idea for an Upcoming Episode?

Recent Episodes

#243: Being Data-Driven: a Statistical Process Control Perspective with Cedric Chin

#243: Being Data-Driven: a Statistical Process Control Perspective with Cedric Chin

https://media.blubrry.com/the_digital_analytics_power/traffic.libsyn.com/analyticshour/APH_-_Episode_243_-_Being_Data-Driven__a_Statistical_Process_Control_Perspective_with_Cedric_Chin.mp3Podcast: Download | EmbedSubscribe: RSSTweetShareShareEmail0 Shares