43 minutes
SDS 491: R in Production
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
In this episode, we bounded over powerlifting, discussed R as the production option for data models, Veerle’s favorite R tools, and more!

About Veerle van Leemput
Veerle is an entrepreneur, full stack developer and Data Scientist who gets excited about data and programming. She is Managing Director and Head of Data Science at Analytic Health, a UK-based company that develops intelligent and accessible technology which gives organisations the tools they need to accelerate innovation in healthcare. At Analytic Health, they are running all processes in R: from data gathering to deployment- showing that you can run R in production.
Overview
Veerle first came to my attention not because of data, but because of our shared love of powerlifting. After I posted an update on my deadlifting, Veerle commented and from there we got into Veerle’s own work as a competitive powerlifter in the Netherlands. We geeked out for a bit about powerlifting. From there we got into a discussion about how Veerle is running an entire business on R which is often stigmatized as only for academics but not for actual production.
At Analytic Health they aim to accelerate progress in health care by collecting and utilizing data. The data gathering, modeling process, and web production are all done in R. One of the aspects of what they do is juggle the data pipelines, which are primarily from the NHS but can have dozens of places to get data. They use R as part of their automated processes to gather, clean, and export data from the various data pipelines they utilize. This involves scheduling the process as well as managing the process through daily reports. This is also done in R. These daily email reports are discussed in their daily standup. For model deployment, they use R Studio where apps and APIs are developed. An important part of this is to separate development work from production work, their solution for this is to use two servers.
We talked a bit about their use of R Shiny where they develop user interfaces (the front end) and a server (the backend) to spin up applications. Veerle is an advocate of using R Shiny for more than just dashboards. Veerle uses it for applications development as well. Your end-users don’t care what tech stack you used to develop a product and R Shiny works perfectly well. I’m guilty of this thought process of just using R Shiny for dashboards. A really cool opportunity here is that I, personally, am not an expert in JavaScript or HTML but can make fully functional dashboards in R Shiny, so why not use it for an app if you’re in a similar boat? Thinking about the code in R Shiny is more similar to Python than those other languages.
In this episode you will learn:
- Our shared powerlifting passion [2:47]
- The stigma of using R [12:02]
- What does Analytic Health do? [13:55]
- How Analytic Health uses R [19:08]
- Tidyverse [34:44]
- Tools for API creation [37:09]

Items mentioned in this podcast:
- Analytic Health
- SDS Challenge - 99 Days to your first Data Science Job
- SDS 485: Financial Data Engineering
- cronR
- blastula
- R Studio Connect
- R Shiny
- R Studio Server
- golem
- Tidyverse
- data.table
- dplyr
- dtplyr
- R Plumber
- SDS 337: Hadley Wickham Talks Integration and Future of R and Python
- The Art of Thinking Clearly by Rolf Dobelli
- Javascript for R by John Coene
- Mathematical Foundations of Machine Learning
- Deep Learning Illustrated by Jon Krohn
Follow Veerle:
Follow Jon:
Episode Transcript
Podcast Transcript
Jon: 00:00
This is episode number 491 with Veerle van Leemput, Managing Director and Head of Data Science at Analytic Health.
Jon: 00:12
Welcome to the SuperSuperDataScience podcast. My name is Jon Krohn, a chief data scientist and bestselling author on deep learning. Each week we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today and now let's make the complex simple.
Jon: 00:42
Welcome back to the SuperDataScience podcast. I am absolutely delighted to have Veerle as my guest on the program today. Hailing from the Netherlands, Veerle has held a number of data science leadership roles at Dutch companies. She now serves as Managing Director and Head of Data Science at Analytic Health, a London-based firm that builds data-centric software for the healthcare industry. On the side, Veerle is an impressive podium level weightlifter on the Dutch national stage.
Jon: 01:13
Beyond bonding over powerlifting, in today's episode Veerle details for me how R is not only an option for production software, but may in fact be the best production option for you if data or data models are central to your application. Specifically, Veerle runs down for us her favorite R tools for data gathering, model development and deployment into production systems. Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best to break down the technical concepts. And we do have a lot of laughs in the episode, which could make it appealing to anyone who enjoys a good giggle. All right, you're ready for another awesome episode? Let's go.
Jon: 02:00
Veerle, welcome to the program, I'm so excited to have you on. Where in the world are you calling in from?
Veerle: 02:07
I'm calling in from Leiden in the Netherlands, it's just below Amsterdam.
Jon: 02:12
Yeah, we need to know. Everyone wants to know how is that in relation to Amsterdam?
Veerle: 02:18
Amsterdam is the only place in the Netherlands apparently.
Jon: 02:23
Is there a really good football team in Leiden? I think there is. If I heard better in that context.
Veerle: 02:28
I wouldn't know, they most of the time [crosstalk 00:02:30].
Jon: 02:32
Oh, no kidding. Did you ever play hockey?
Veerle: 02:34
Yeah. It is styling at least. No, I don't. I don't play hockey.
Jon: 02:37
Ah, you don't. I grew up in Canada, so I must play hockey. It's a part of growing up there.
Veerle: 02:43
Oh, really?
Jon: 02:46
Yeah. So, well, I guess we won't have ice hockey to talk about, but we do have powerlifting. So that's how you originally came to my attention. So because we're both data scientists, we're in each other's LinkedIn network, however that happened. And then I think you commented on a post. So we're recording at the beginning of July and about a month ago... Actually, I think it was exactly six weeks ago because I have six week weightlifting cycles. And yesterday I again had a [inaudible 00:03:16] for my deadlift.
Jon: 03:18
So I think six weeks ago, I posted my all-time deadlift PR, which was 405 pounds. And probably after some initial confusion about kilos and pounds, you commented on the LinkedIn poss. I can't remember what you said, but then it... Oh yeah, you said something about, "Have you ever thought about competing?" And I said, "Well, I've done an Olympic weightlifting competition. I've never done powerlifting and I'm definitely interested." Then I was like, "Well, why would somebody ask this? Veerle, do you do powerlifting?" And you said...
Veerle: 03:52
Of course. Yes, I am a powerlifter. And in fact, a couple of weeks ago I participated in the Dutch Nationals Powerlifting.
Jon: 04:00
No way.
Veerle: 04:01
Yes. And I came in second. So I'm vice champion [inaudible 00:04:05]. Yeah.
Jon: 04:07
Wow. That's incredible. This is really exciting. I didn't know that you were that into it. So, okay. So we should let the audience know exactly what powerlifting is. So I think there's always three movements in a traditional powerlifting competition, right?
Veerle: 04:21
Yeah. There's three movements, there's the squats, there's the bench press and there's the deadlift. And those three you need to get the highest weight, and combined it's your total and the total determines your ranking basically.
Jon: 04:33
And so you just add up across the three, back squat so you've got a barbell on your back, you squat to below parallel and then back up-
Veerle: 04:42
Below parallel. Yeah. Yes.
Jon: 04:46
Exactly. Deadlift is the [crosstalk 00:04:47] one of those. Oh yeah, bench press. So is that the order? Is always in the same order in competition? You always do...
Veerle: 04:52
Yes.
Jon: 04:52
... back squat, then bench press. Bench press, I think a lot of people know that one you're laying on a bench horizontally.
Veerle: 05:00
And you press the weight. Yeah.
Jon: 05:01
Yeah. Absolutely.
Veerle: 05:02
The important thing though, in powerlighting you need to pause at the chest. So it's not like touch and go which you see normally in the gym [crosstalk 00:05:08] you have to wait for the judges to say, "Okay, press."
Jon: 05:13
Oh, really?
Veerle: 05:13
And then you can go.
Jon: 05:16
Oh, geez. That makes it a lot tougher.
Veerle: 05:18
Definitely.
Jon: 05:20
I've got a really bouncy rib cage. So that's my key to bench press success.
Veerle: 05:23
Oh, really?
Jon: 05:25
Really bounce it off of there [inaudible 00:05:27].
Veerle: 05:28
[inaudible 00:05:28] really pausing.
Jon: 05:31
Just drop it, and catch it on the way back up. And then the third movement is the deadlift, so that's the video that I posted six weeks ago at time of recording. And so that is... It's kind of the simplest idea. You've got a barbell on the ground and you need to lift it up. You need to stand up straight shoulders back. And of the three movements that's the one that typically people can lift the most of.
Veerle: 05:58
Same for me.
Jon: 05:59
Yeah. It would be surprising if it was otherwise, if you benched more than you...
Veerle: 06:02
It would be epic. If I benched more than my deadlift I would be very, very good.
Jon: 06:11
And that's how you become the second most powerful powerlifter in the Netherlands. So would you mind telling us what was your combined score? What was your combined weight across the three?
Veerle: 06:22
My combined score... Geez then I'd have to do the math really because it's 115 kilos squats, then we had 67.5 bench and 152.5 deadlifts. And combined I think that's 324?
Jon: 06:42
I don't know, I don't have a calculator out. But...
Veerle: 06:47
A lot.
Jon: 06:47
It is a lot. And then so for people that want to do it in pounds, you need to multiply it by 2.2 and that'll give you the weight in pounds. And I think we can conclude Veerle can lift a lot of weight. And so this is really cool. I didn't know that you were so actively into it. And so what are you doing now? So you had the big national competition two weeks ago. Are you back in a training cycle now training for something else?
Veerle: 07:10
Well, actually I started to focus a bit more on Olympic weightlifting now because I just [crosstalk 00:07:15] that is. So, yeah, I'm now into snatches and clean and jerk. But that's just the... Yeah, I don't know, I really liked it. So I don't know if it's temporary yet, but I'm still a powerlifter but now doing a sidetrack into weightlifting. And I'm the proud owner of a fully equipped gym at home as well, both powerlifting [inaudible 00:07:37] weightlifting up. It's like a giant hobby. It's like getting out of control a bit.
Jon: 07:43
I understand. I'm super fortunate to have a very well-equipped gym across the street. It's basically unheard of, I'd have to be absurdly wealthy to have a fully equipped gym in my apartment in New York. That would be incredible. Maybe that's something to aspire to. [crosstalk 00:07:58].
Veerle: 07:58
That is expensive hobby to have your own gym. Yeah. [crosstalk 00:08:04].
Jon: 08:06
And so, yeah, so Olympic weightlifting, that's the one I'm much more familiar with. And I've only done it once. I competed once, and I was okay for my... I've lift a fair bit if you don't consider weight or gender.
Veerle: 08:23
Okay. That's important [crosstalk 00:08:28].
Jon: 08:30
I know. So once you put me in my weight class, it's not so impressive, but for the audience, there's only two movements in Olympic weightlifting. There's Veerle already mentioned them, the clean and the jerk and the snatch. But they are more technical, I hope you don't mind me saying [crosstalk 00:08:48].
Veerle: 08:49
Yes. It's very true. It's much more difficult really. Because powerlifting is more brute force and weightlifting is like... If you're one inch off, then you miss your attempt. It's very different road. But that makes it kind of a challenge.
Jon: 09:05
It is. It's nice. It's like there's so much more on positioning and timing, accuracy. I enjoy it a lot.
Veerle: 09:13
Yeah, me too.
Jon: 09:16
And yes, people can probably look up videos to get a sense of how a clean and jerk and a snatch works, but it's the same. So it's the same barbell that you use for all of the powerlifting movements. Although, I guess technically speaking you could have it different. You might even given that you've such a well-equipped gym, you probably have two different barbells.
Veerle: 09:34
Yes. I have two, women's weightlifting bar and a powerlifting bar. Because the weight differs between bars [inaudible 00:09:40].
Jon: 09:41
Wait, what?
Veerle: 09:43
Women do have a lighter bar than men. So the standard bar is 20 kilos, but for women it's 15. So I have two bars. Yeah.
Jon: 09:52
But you also have two 15 kilo bars. You have one for powerlifting and one for Olympic.
Veerle: 09:57
No powerlifting is 20. It's always 20.
Jon: 10:00
Oh, it's always 20. Oh, interesting. I didn't know that. Wow. Okay. Oh, yeah, so in terms of the idea though the barbell it looks... It's a barbell, at a distance it would look the same whether it's powerlifting or Olympic weightlifting. And so with a snatch, you have to in a single movement get the bar overhead.
Jon: 10:25
So it's on the ground like with a deadlift, but then you in a single movement, boom, it's over your head and you have to stand up with it over your head and show control. And the clean and jerk, you get to do it in two, so up to your chest and then overhead. And so you can do more weight that way. Anyway, so do you have a specific date in mind for that, for your first [crosstalk 00:10:46]?
Veerle: 10:46
The competition? Well, perhaps in October, but yeah, it depends. I really want to get a good snatch and then I might be ready to get to the platform, but I'm not there yet. I can tell you, my snatch is like basically a very segmented lift at this point in time. It's not smooth at all, but yeah, we learn every day. Right? You can only get better.
Jon: 11:14
Well, very exciting. I look forward to watching the journey. I'm going to say that again, because I just hit my mic with my hand. I don't know if that worked. Well, very exciting I'm looking forward to seeing how this journey unfolds for you. And I do hope that you stay in touch, not just about data science, but about this as well. It's really cool. And I hope... Do you ever post about your weightlifting stuff on LinkedIn?
Veerle: 11:37
Not on LinkedIn. No, I keep that separate. But I do have a very dedicated Instagram account to all my lifting [crosstalk 00:11:44].
Jon: 11:45
Oh, okay. Well maybe you can share that at the end as well in addition to your LinkedIn details. All right. But let's get away from Instagram style chat to LinkedIn style chats. So yeah, so we know each other from LinkedIn. And so yeah, you came to my attention because of this powerlifting posts that I made. And then shortly thereafter, I noticed that on June 22nd, you gave a talk on the R-Ladies of Amsterdam and the talk was on R in production.
Jon: 12:21
And my mind was blown. I'm constantly on this show and in life in general saying I like R and I genuinely do. I was a R user for years before I started using Python. And so a lot of my statistical programming knowledge came from using R, and I really like it. But in the last five, six years, I don't use it very much because I started working at a startup where we're putting models into production. And I've always had this idea in my head, and I don't think I'm the only one who goes around spreading this propaganda. But there's this idea in the data science community that especially if you're looking at putting models into production, you need to be using Python. And well, today you're going to tell us why I'm wrong.
Veerle: 13:18
Yes, definitely. Because you said that you worked at a startup and needed to put models into production and that's why you didn't use R. But let me tell you, I'm working in a startup as well, and we have models in production but we are using R. So it definitely is possible. We're running a whole business on R basically. So the stigma around, R is only for academics, and it's only for statistical programming and it's only good enough for quick prototyping it's not true. And that's what I told them.
Jon: 13:52
So let's dig into that in a second, but first give us a little bit of context about your startup. So it's called Analytic Health. It looks to me like it's headquartered in London.
Veerle: 14:02
Yes. Correct.
Jon: 14:04
But yeah. So tell us a bit about what the company does, what you do there.
Veerle: 14:09
Yeah. So I'm Managing Director and Head of Data Science at Analytic Health and together with my business partner, Greg Mills, and our Head of Operations Jana, we develop web applications for the healthcare sector. And what we do is we gather and we analyze healthcare data in order to retrieve value from the data. Because we believe that we can accelerate innovation in healthcare by getting insights from this data. And we gather data from the United Kingdom, health care data on a daily basis. And we also use internal sales data from pharmaceutical companies. And we build web apps around it, and those web apps are made in R, as well as the data gathering process and the modeling process.
Jon: 14:58
Wow. This episode is brought to you by SuperDataScience. Yes, our online membership platform for transitioning into data science and the namesake of the podcast itself. In the SuperDataScience platform, we recently launched our new 99 day data scientist study plan, a cheat sheet with a week-by-week instructions to get you started as a data scientist in as few as 15 weeks. Each week, you complete tasks in four categories. The first is, SuperDataScience courses to become familiar with the technical foundations of data science. The second is, hands-on projects to fill up your portfolio and showcase your knowledge in your job applications. The third is, a career toolkit with actions to help you stand out in your job hunting. And the fourth is, additional curated resources, such as articles, books, and podcasts to expand your learning and stay up to date.
Jon: 15:54
To devise this curriculum we sat down with some of the best data scientists as well as many of our most successful students and came up with the ideal 99 day data scientist study plan to teach you everything you need to succeed. So you can skip the planning and simply focus on learning. We believe the program can be completed in 99 days, and we challenge you to do it. Are you ready? Go to superdatascience.com/challenge, download the 99 day study plan and use it with your SuperDataScience subscription to get started as a data scientist in under 100 days. And now let's get back to this amazing episode.
Jon: 16:32
Okay. So it sounds like a really cool company and it sounds like you have an amazing job there.
Veerle: 16:36
Yes.
Jon: 16:38
So what kinds of products do you have? I like this idea of retrieving value from healthcare data, and it's cool that you have these different kinds of sources. So I guess if it's public data coming from the UK, is it from the National Health Service from the NHS?
Veerle: 16:54
Yes. It's from the NHS, mainly. Yeah. Also from other sources, but mainly from the NHS.
Jon: 17:01
And then also-
Veerle: 17:02
But imagine you have like 10 different data sources from the NHS, which is great, but you need to access them all separately. And that's the issue, getting that data from all these different data sources and maintaining that is a full-time job. And what we do is we basically do that. We gather it, we combine it, we clean it, we validate it, bring it all together and then put it into a web application to easily access and to analyze. So we're basically doing all this pre-work so that other people don't have to, and that they can focus on what really matters, namely doing enough of these things with those data instead of gathering it.
Jon: 17:44
Very cool. So this reminds me a little bit in a recent episode, in episode 485 with Doug Eisenstein, Doug was on the show talking about engineering data pipelines for the financial sector. So there he was talking about... In that case, it was people, investment managers working at big financial companies in order to be able to make good investment decisions, they need to have many different data sources. It could be dozens of data sources that need to be integrated together so that you have this one big perspective of the situation. In that case, like the economic situation so that you can make the right trading decisions. So it sounds like what Analytic Health provides is an analogous kind of system in healthcare, where you have many data sources together. You engineer systems so that those data sources become integrated and then you create, I guess a user interface or an API, so that users... Who's a typical end user of this product can make better decisions?
Veerle: 18:51
Typical end users work in pharmaceutical companies and are sales managers or brand managers who are trying to optimize their supply chain processes, for example.
Jon: 19:02
Nice. That is super cool. So yeah. So tell us about how R can do all of these aspects for us? So I guess the first piece is going to be data gathering. So ETL processes, for example, extraction, transformation, loading of data. How can we do that at a production type scale in R?
Veerle: 19:29
Yeah, so imagine that we have all these data sources and they are coming from all kinds of different things, are not excel files that you're getting everywhere. So some data come from PDF documents, other data comes from emails, other data comes from API endpoints. And all these data comes available at different times. For example, some data is released on a monthly basis, and other data is released on a weekly basis, or biweekly or daily. So you need an automated process to basically check all those data sources. And whenever there's new data, the process needs to kick off and then start gathering it, cleaning it, merging it and sending it back to the database, our data warehouse where we then can well look into the data warehouse to gather data. And a key point here is that you need to have these processes scheduled.
Veerle: 20:23
So you need a way where you can schedule all your ETL versus to kick off at appropriate times. And we do that with an R extension which is called CronR. And it's basically a very simple R native tool that allows you to schedule scripts or in our case whole projects to kick off automatically. And so that it can start the data gathering process when it's time to do that. And this CronR tool is very simple to use because it even has a RStudio interface. So is basically your click and play solution. And for the people who know Linux, it's obviously working on Linux because it makes use of the prompt, that functionality there. So that's a great tool where you can actually automatically, especially if you have a server which is always on where you can ultimately schedule all your R jobs and also manage them.
Jon: 21:20
Veerle, the tool that you've mentioned this sounds really interesting. I haven't heard about it before. So it sounds... Yeah, you mentioned how it builds upon crontab, which is a familiar tool for a lot of people who are scheduling processes on Linux systems. But this tool is called Chrome. Is that right? Like a Google it's the same as like Google Chrome, like the kind of medal?
Veerle: 21:40
No. Cron.
Jon: 21:42
Oh, it is Cron.
Veerle: 21:43
C-R-O-N. Yeah.
Jon: 21:44
So, its CronR? Oh, I see just like crontab? Okay. Okay. Okay.
Veerle: 21:48
Yes.
Jon: 21:48
Very good.
Veerle: 21:49
Just to give you an idea about how many processes we're talking, we have 32 ETL processes running daily. So how are you going then to manage all these processes? Because obviously we can't look at R all day making sure that everything kicked off because life happens, hours prevail, stuff goes wrong. So one thing that we also do to manage the process is to monitor it. And we have a beautiful data pipeline report coming into our mailbox every day, telling us exactly what processes did kick off, at what time, if there were errors, if there are other noteworthy messages also dealt with R. So what we use for this is blastula, which is a package which can help you send emails. But you can schedule those. And we use RStudio Connect here to deploy it on. So what we do here is we scheduled this data pipeline report.
Jon: 22:54
Can I interrupt you for one second again, just to get the name of the [inaudible 00:22:57]?
Veerle: 22:57
Blastula. So it's B-L-A-S-T-U-L-A.
Jon: 23:08
Blastula? Okay. I gotcha. And so that's... Yeah, it's kind of this idea of maybe like an email blast. I don't know. Whatever, we don't need you to figure out-
Veerle: 23:16
Yes.
Jon: 23:16
... where this name came from.
Veerle: 23:17
It is. It is. And it makes beautiful emails. So these are not the emails that you would expect from R. Like these techy emails with only texts which look awful. No, these are beautiful HTML emails with beautiful tables and graphs. So this is coming into our mailbox every day at eight o'clock, which we discuss then as our daily standup to see, okay, all of these processes did they run accordingly. And I think that is key when you are managing in whatever language, even things in production you need to monitor it. Because you can set up all these processes, do it on a production scale, have many processes running on servers, 1, 2, 3, 4, 5 servers. But you need a way to actually make sure that everything runs accordingly. So monitoring is key here. And I think what a lot of people don't realize is that that stuff is also something that you can do with R. for example, with these email reports and with organized projects.
Jon: 24:15
Yeah. I didn't know that. Okay. So before I interrupted you, after you finished talking about blastula, you were then talking about another tool that also sounded really cool for the same kind of data quality process checking kind of thing.
Veerle: 24:30
Yeah. So the question is how are you going to schedule these email reports? And there we already touched a bit on how you're going to deploy these things. And what we use for deployment is RStudio Connect which is basically an enterprise level tool, which you can purchase from RStudio, which helps you to deploy your Shiny apps, your email reports, markdown reports and APIs even. And that's what we use for deploying everything that we have.
Jon: 25:02
Nice. So let's talk a bit more about that. So I guess I'll... Maybe model deployment is the last step. Maybe before we get to model deployment, we need to be talking about model development in general. I guess you use RStudio as your main tool?
Veerle: 25:21
Yes. We use RStudio server actually it's a Pro version which we have running on a server as well. And where we have multiple accounts on so that we can work together on the same projects. And yeah, we develop there, so we develop our apps there, we develop our APIs there and we develop our markdown reports there. And one important thing that might be noteworthy to mention is that, in production it's very important that you separate your development processes from your production processes, right? You don't want to do development work where your production is and the other way round. So how we solve that at a relatively small company is just to set up two servers, on one server we have RStudio server running on the other one we have RStudio Connect and those are basically our development and production servers. So whenever we develop something on our development server with RStudio server, we then push it to RStudio Connect and that's our live environment. And on RStudio Connect is also the place where our customers go to, to access our web applications.
Jon: 26:36
Nice. So I guess RStudio Connect might also make it easy then it sounds like if RStudio Connect allows you to deploy a Shiny user interfaces. So maybe I could do it a little bit, but you can do it better than me. Tell us a bit about R Shiny.
Veerle: 26:55
So R Shiny is the tool to make web applications of your R code. And R Shiny is basically very simple, you create a user interface and you create a server part. And with a user interface you obviously make the front-end of your application. And with the server part, you will do the back-end loading. And it's very easy to spin up applications, but I also would like to correct something or talk in favor of Shiny. Because Shiny's often said, "Okay, Shiny is like this dashboarding tool. You can make great dashboards with it." Yeah, sure. You can make dashboards, but you can make fully professional applications as well. And I think that if you're saying that Shiny is only for dashboards, then you didn't understand it correctly. Because you can do so much more with Shiny than just the dashboard tool.
Veerle: 27:50
You can really make applications that you can distinguish from few and note applications, for example. So I develop in other languages as well. We also have few applications which has a JavaScript framework. It's not different. So I think it's definitely, well, I can safely say that you can have an app in production that is running purely on R. Because your clients won't notice, in fact, your end users don't care which tech stack something was developed as long as they [crosstalk 00:28:23].
Jon: 28:23
No they really don't. And yeah, I'm probably the same kind of person who goes around saying not only that we should only be using Python for production, but also that if we're going to use R Shiny is for dashboards. That's definitely something that [inaudible 00:28:42].
Veerle: 28:41
Yes.
Jon: 28:43
So [crosstalk 00:28:45].
Veerle: 28:44
You should never do that again.
Jon: 28:46
No I won't. That's why I wanted to have you on as a guest. And yeah, you're changing my life here. Now I can go back to R which is all I wanted to be doing all along. And I think a really cool idea here, based on my experience I've built R Shiny apps, most for dashboards. But I've done that and I don't have expertise in HTML, CSS, JavaScript really like I'm quite bad at those things. And I can make a fully functional application in R Shiny. So I think that there's probably a really cool opportunity here for a lot of listeners to now with the conversation that we've had today. If they have some experience with R or even if they don't. So if they are doing their data science with another tool like Python but they want to be making apps, now it seems like the most obvious thing to be doing is learning R and using R Shiny to develop and deploy those apps. Because the way that you are thinking about your code is going to be a lot more similar to how you do it in Python than relative to say learning JavaScript or HTML and CSS. So that's really cool. There's a huge opportunity there for...
Veerle: 30:09
And the beauty of... Because a web application is exactly those three things, HTML, CSS, JavaScript. The beauty about Shiny is it has all these amazing wrappers around JavaScript libraries, which make all the cool JavaScript stuff easily available for you as an R user. And that's the thing that I love about Shiny, the development is so amazing there. So every day new stuff comes out, which allow you to make these awesome applications. And in order to use the fundamentals of Shiny, you don't need to know JavaScript yourself because you have these nice wrappers around it. And obviously if you want to do more customization later on, it would be handy if you know JavaScript. Because that you can do even more amazing stuff, but then the basis you don't need it to get started. I started somewhere as well, like just building a Shiny app while I had a bit of R experience. And yeah, you learn along the way, but that's with everything with every tool that you choose.
Jon: 31:14
Nice. Super cool, Veerle. This is really exciting, I think that, yeah, not only for me, but for a lot of listeners, I think this is been a podcast that has potentially changed their life. Not only will people be doing more powerlifting and more Olympic lifting, but they'll also be using R in production. So [crosstalk 00:31:35] are there any other particular tools in your ecosystem that you recommend we check out either for model development or deployment? I think we've talked a little bit more about ETL already, but for development we've focused mostly on RStudio server. Maybe there are particular packages that you use a lot that you highly recommend in R?
Veerle: 31:55
Yeah. What I would recommend for Shiny apps is checking out the Golem package. The Golem package is basically a nice [inaudible 00:32:03] framework even, or way to organize your project in order to make it a production great Shiny application. So it provides you basically with the infrastructure you need to set up a professional application, which is also very scalable. So I would definitely recommend to check out the Golem package here.
Jon: 32:24
How do you spell that? That's like the Lord of the rings Golum?
Veerle: 32:28
Yes.
Jon: 32:29
G-0-L-U-M
Veerle: 32:32
Yeah, only E-M at the end, but yeah.
Jon: 32:37
Oh, yeah, yeah.
Veerle: 32:41
Yeah, but that's a great tool. And yeah, as I said, check out the email reports bastula so that you can keep yourself updated about what's happening in your processes. And one other tip I can give you when you are setting up R in production is to make sure that when you set up a new project, that you choose a structure and make it the same across all projects. You can imagine with us having 32 processes running it would be a pain if we go from one project to the other but if the project looks totally different every time. So make sure that you have a base structure, and you can even make a package out of it on your own that can easily create a structure for you, but make sure that you do it standardized.
Jon: 33:26
Cool. How about the Tidyverse? Are you a Tidyverse fan or an [crosstalk 00:33:32].
Veerle: 33:32
I'm a Tidyverse lover. I'm a Tidyverse fan, yeah. But the thing is my business partner with whom I work very closely obviously is not.
Jon: 33:41
No.
Veerle: 33:41
He's a [inaudible 00:33:42]. Oh, it's awful. It's a data.table lover basically. So we have this clash of data.table and...
Jon: 33:52
And [inaudible 00:33:53]?
Veerle: 33:53
Dplyr. Yeah. [crosstalk 00:33:56]. But did you know there is a great package as well, which is dtplyr, which basically combines dplyr and data.table together?
Jon: 34:04
I did not know that.
Veerle: 34:07
So there you can make use of the advantages of the speeds of the data.table. Because that's why I have to be [inaudible 00:34:12]. In most of R Shiny apps, we use data.table because of its speed and we want our apps to be as fast as possible. So we prefer data.table over dplyr. But the dplyr syntax is just beautiful. So what you can do now with this package dtplyr, you can actually use the dplyr syntax, but under the hood it uses data.table. And you have a bit of overhead there, so it's not as fast as using pure data.table, but it's considerably faster than dplyr [inaudible 00:34:42].
Jon: 34:44
Yeah. And so I guess we should back up a little bit and let audience listeners know a little bit more about the Tidyverse. So the Tidyverse, I think it was created by Hadley Wickham. He's definitely the biggest figure in the space. So Hadley Wickham was actually on the SuperDataScience show last year, so episode 337. And Hadley Wickham works at RStudio which is the biggest player in commercial development of R software. But they open source tons and tons and tons of things. And so we've been talking about RStudio server and RStudio Connect, which are tools that you can purchase from RStudio. But there's a free IDE which I've used for over a decade that is I think, the leading IDE. So for developing within an R environment. Same thing R Shiny, which we've talked about a lot is free. it's completely open source and dplyr... So dplyr is a part of this suite of R packages in the Tidyverse. And the reason why it's called the Tidyverse is because all of these packages are based on the idea of having tidy data. Which is a particular way of structuring your data. And then it allows you to pipe functions. So you can take a given data frame, what's called a table in the Tidyverse, right?
Veerle: 36:29
Yeah. Yeah.
Jon: 36:31
And then you see you take like a noun, you take an object and then you can pass it through a series of functions, a series of verbs. And so you can form a series of a pipeline basically, of operations and it is such an intuitive and easy way to write code. I absolutely love it. And I must admit Veerle, that it is something that I miss about that.
Veerle: 36:56
Yeah, I can imagine. It's awesome.
Jon: 37:01
Okay, Veerle, this is all super cool if we are building a user interface, something that somebody can click and point around with. But what if our production application needs to be an API that people can program against and make calls against? Is there a tool that we can use in that case?
Veerle: 37:20
Yes, definitely. You have the plumber packaging R and the plumber package in R allows you to turn your R code into an API end point. So your modals can go directly in there. And nice thing about that is, is that it's stacking like three extra lines of code to turn your script into an API, and then you can deploy it basically anywhere, you can dockerize it and put it somewhere into AWS or Azure. Or you can deploy to RStudio Connect. And that's a great way to actually productionize your R code.
Jon: 37:55
Cool. Well, we've learned a lot. You've completely changed my perspective on R as a tool we could be using in production. I can't wait to check out blastula, CronR, dtplyr. And I'm sure there's lots of audience members out there who can't wait to get started as well. So do you happen to have a book recommendation that might be related to R? It might be unrelated to R, it doesn't matter either way, but...
Veerle: 38:27
I can give a recommendation for R and not for R. But for R I would definitely recommend JavaScript for R by John Coene, that is. It's a great book on how you can use JavaScript and you can access it online, so it's pretty cool. And a none related R book I would say The Art of Thinking Clearly by Rolf Dobelli. It has these fairly short chapters which you can read in like two minutes and it'll give you fresh insights on life in general, which is also very useful to use in a business, for example.
Jon: 39:05
That sounds cool. I want to check that out. That sounds like... So something with such short chapters could be ideal for a commute or just waiting in line for something briefly. It sounds like the perfect way to get some extra philosophy in life.
Veerle: 39:25
Yeah. At our company we have a weekly meeting where we discuss each of us one chapter of this book and talk about it then what it can mean for business. Just to do something else than programming and getting involved in day-to-day business. We step outside once per week and talk about these general things.
Jon: 39:43
Beautiful. All right. I'm looking forward to checking that out. And then so I've learned so much from you in this episode. I'm sure lots of people have, and I'm sure lots of listeners are wondering how they can keep up with your latest thoughts perhaps on the art of thinking clearly, but maybe also on R. So how should people follow you?
Veerle: 40:03
They can follow me on LinkedIn. I share regularly tips about R and also things that we're doing in the business. So give me a follow if you want to know more.
Jon: 40:14
Nice. All right. So we will definitely include that in the show notes, making it easy for people to follow you. Veerle, this has been such a fun episode. I've really loved having you on, and hopefully we can have you on again sometime soon.
Veerle: 40:26
Thank you, Jon.
Jon: 40:34
That was a lot of fun and wow, did Veerle ever blow my narrow Python centric mind. In today's episode, Veerle filled us in on lots of specific tools for using R as your production software language. Specifically, she mentioned the R Tidyverse as a tidy way to manage your data and data operations, particularly dplyr for piping operations into each other's sequentially and intuitively. She told us about dtplyr for obtaining dplyr style piping with computational efficiency that is near R's data.tables. She talked about CronR for scheduling processes to run automatically. Blastula for beautiful automated emails. RStudio server for model development in the cloud. R Shiny for designing user interfaces of any ELT. RStudio Connect for deploying R Shiny apps and golem for professional grade scalability of those Shiny apps. And finally, she told us about R Plumber for creating API end points.
Jon: 41:38
As always, you can get all the show notes including the transcript for this episode, the video recording any materials mentioned on the show, such as the list of R packages that I just rifled through and the URL for Veerle's LinkedIn profile, as well as my own social media profiles at superdatascience.com/491. That's superdatascience.com/491. If you enjoyed this episode I'd of course greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel where we have a video version of this episode. To let me know your thoughts on the episode, please do feel welcome to add me on LinkedIn or on Twitter, and then tag me in a post to let me know your thoughts on this episode. Your feedback is invaluable for figuring out what topics we should cover next.
Jon: 42:24
Since this is a free podcast if you're looking for a free way to help me out, I'd be very grateful. If you left a rating of my book, Deep Learning Illustrated on Amazon or Goodreads. Give some videos on my YouTube channel a thumbs up or subscribe to my free content, rich newsletter on jonkrohn.com. To support the SuperDataScience company that kindly funds the management, editing and production of this podcast without any annoying third-party ads, you could create a free login to their learning platform at superdatascience.com. You can check out the 99 days to your first data science job challenge at superdatascience.com/challenge, or you could consider buying a usually pretty damn cheap Udemy course published by Ligency an affiliate of SuperDataScience, such as my Mathematical Foundations of Machine Learning Course.
Jon: 43:11
All right, thanks to Ivana, Jaime, Mario and JP on the SuperDataScience team for managing and producing another amazing episode today. Keep on rocking it out there, folks, and I'm looking forward to enjoying another round of the SupeDataScience podcast with you very soon.
Show all
arrow_downward