Podcasts SDS 671: Cloud Machine Learning

63 minutes
Career Tips, Data Science, Machine Learning

SDS 671: Cloud Machine Learning

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

This week’s guests are mainstays of the data science online community, but where have they been this last year? In an exclusive for SuperDataScience, we can confirm that Kirill Eremenko and Hadelin de Ponteves took their sabbatical to launch CloudWolf, a cloud computing education platform that prepares students for certification in AWS (Amazon Web Services). Jon Krohn speaks with his guests all about CloudWolf and why accreditation in cloud computing could be the safest investment for your data science career.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Kirill Eremenko

Kirill is the founder and former host of the SuperDataScience Podcast. He is the founder and CEO of SuperDataScience, the platform hosting this podcast, as well as featuring a lot of data science and analytics courses ranging from tool-based courses such as R Programming, Python, Tableau to over-arching courses like Machine Learning A-Z and Intro to Data Science. He is also a well-known instructor on Udemy. His courses have been taken by over 1.7M students worldwide. Kirill is also the Founder of DataScienceGO Conference, created for data powered minds, where experts, mentors, and friends come to enlighten, click and inspire each other and skyrocket their careers. Kirill is absolutely and utterly passionate about Data Science and delivering high-quality accessible education to every human on this planet!

About Hadelin de Ponteves

Hadelin is the co-founder and CEO at BlueLife AI, which leverages the power of cutting edge Artificial Intelligence to empower businesses to make massive profits by innovating, automating processes and maximizing efficiency. He is passionate about helping businesses harness the power of AI. He is also an online entrepreneur who has created over 70 top-rated educational e-courses to the world on topics such as Machine Learning, Deep Learning, Artificial Intelligence and Blockchain, which have already made 2M+ sales in 210 countries. Hadelin is an ex-Google Artificial Intelligence expert and holds an Engineering Masters degree from École Centrale Paris with a specialization in Machine Learning.

Overview

When a company has an IT infrastructure on its premises, it must invest a great deal of time and money in purchasing the equipment, providing adequate space, tightening cybersecurity, and updating servers every few years. Such capital expenditure can put a real dent in a company’s budget. When companies use the cloud, however, they can become much more flexible to operations needs, and they also have access to considerably more models for training and analyzing data. Among the companies that have, in Kirill’s words, “outsourced the headache” are AirBnB, Netflix, Coca Cola and McDonald’s.

With that in mind, Kirill and Hadelin’s estimations that well over a third of data science and machine learning jobs require cloud skills shouldn’t come as any surprise. Kirill and Hadelin aim to give CloudWolf students the confidence to get AWS accreditation in just 21 days.

Listen to the episode to find out how CloudWolf came to have such a cool name, why it makes sense to learn AWS as opposed to other cloud providers, and AWS essentials that every data scientist needs to know, from databases to storage.

In this episode you will learn:

About CloudWolf [07:04]
Why learning the cloud is important for data scientists [09:12]
Is learning cloud computing complex? [22:30]
Essential AWS services [28:31]
Database options on AWS [33:47]
How to run analytics on AWS [40:58]
Why an AWS certification is so helpful [56:35]

Items mentioned in this podcast:

Follow Kirill:

Follow Hadelin:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon: 00:00:00

This is episode number 671 with the renowned data science educators, Kirill Eremenko and Hadelin de Ponteves. Today’s episode is brought to you by Posit, the open-source data science company, and by AWS Cloud Computing Services.

00:00:17

Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:48

Welcome back to the SuperDataScience podcast. Today we’ve got not one, but two data science rock stars back on the show. Kirill Eremenko is one of our two guests. He’s the founder and CEO of SuperDataScience an e-learning platform, and he founded the SuperDataScience podcast in 2016, and he hosted this show until he passed me the reigns, two and a bit years ago. Our second guest is Hadelin de Ponteves. He was a data engineer at Google before becoming a content creator. In 2020, he took a break from data science content to produce and star in a Bollywood film featuring Miss Universe Harnaaz Sandhu. Together, Kirill and Hadelin have created dozens of data science courses, and they’re the most popular data science instructors on the Udemy platform with over 2 million students between them. They recently returned from a multi-year course creation hiatus to publish their Machine Learning and Python Level 1 course, as well as their brand new course on Cloud Computing.

00:01:42

Today’s episode is all about the latter, so we’ll appeal primarily to hands-on practitioners like data scientists who are keen to be introduced to or brush up upon analytics and machine learning in the cloud. In this episode, Kirill and Hadelin detail what cloud computing is, why data scientists increasingly need to know how to use the key cloud computing platforms such as AWS, Azure, and the Google Cloud platform. And they dig into the key services, the most popular cloud platform AWS offers, particularly with respect to databases and machine learning. All right, you ready for this super useful episode. Let’s go.

00:02:21

Kirill, Hadelin. You guys were just here. What brings you back? Where are you guys calling in from?

Kirill: 00:02:26

Same places I think. Oh, no, I’m still in Australia. Hadelin is in France now.

Hadelin: 00:02:32

Yes, I’m in France. I’m in Paris.

Jon: 00:02:35

Nice. Paris. Hopefully everyone’s heard of it.

Kirill: 00:02:40

It’s unpredictable to know where Hadelin is, is like, he’s been between Paris, Mumbai, and Dubai probably like 50 times in the past year. Like, every time we get on a call, I’m like, where are you today? And it’s a surprise.

Jon: 00:02:53

And it’s critical to ask if people, if listeners don’t know where you’re calling in from, they can’t, you know, they can’t enjoy the podcast episode properly.

Hadelin: 00:03:01

For sure.

Jon: 00:03:02

So awesome. Welcome back to the show. So, last time you were on the show, we did an episode, it was episode number 649, and it was focused on your Machine Learning Level 1 course. And so we did like a Machine Learning 101 episode that introduced the key concepts in machine learning. And we talked about how in the future you would have a Machine Learning Level 2 course, but that’s not why you’re here today. So you have other things going on. For our listeners that aren’t already familiar with you, you’re two of the most productive data science education content creators out there. So of course you have more than one iron in the fire. So tell us about what you’ve been working on besides your machine learning course.

Hadelin: 00:03:49

Okay, so basically last year we made a bit an important decision, which is to extend our teaching to another industry, which is the cloud, cloud computing. So basically that doesn’t mean we’re not going to you know, move away from data science and machine learning. We are going to continue teaching machine learning and data science. But you will see in this podcast’s episode, we will talk about this, that there is this serious convergence between data science and cloud computing. And so it’s not only a plus to teach the cloud computing, it’s going to become a necessity. And that’s why it makes total sense for us to, to make that decision and and teach cloud computing along with data science. So yes, we made that decision last year. We’ve been, we’ve spent a whole year learning working super hard on you know, becoming experts in the cloud. And now we’re very happy and ready to you know, extend our offerings and teach about the cloud.

Jon: 00:04:54

Nice. For sure. I, I’m not, I don’t want to step on your toes too much cause I know we’re going to have lots in this episode on why this is so relevant to data scientists, but just really quickly upfront that with data science data sets getting larger and larger and larger and the models getting exponentially larger, cloud is a no-brainer for people to be learning about as well, because you need to be able to scale up your infrastructure to be handling all these data and these enormous models. So I, it makes a lot of sense to me.

Kirill: 00:05:21

Absolutely, Jon. And I also wanted to say there that that’s why we got curious about it. Like we were, you know, like it’s a technology, we just know it kind of happens, but we wanted to learn more, and as you said, we’ll talk more about the why data scientists should learn the cloud. But I also wanted to add on what Hadelin’s saying is that I wanted to be very clear and upfront with the audience that we’re not here as data scientists who are dabbling in the cloud or who want to, you know, like be advocates for data science in the cloud. We’re we, for the past year, we’ve actually shifted, we’ve pivoted while, while still, you know, being here for our data science audience, and we are releasing courses in data science from time to time, like the Machine Learning Level 1, but we’ve actually pivoted and we’re completely immersing ourselves in cloud.

00:06:04

So what we are doing now, this new project is relevant not just to people in data science who want to learn cloud, but actually anybody who wants to master cloud, who wants to get AWS certified and go through that. So in this episode, our goal, we are not here to, you know, make sales pitches or anything like that. Our goal is to educate and show, like, give a preview of what we’ve learned in the past year to the SuperDataScience audience. And, you know, hopefully people can walk away and be able to have some level of beginning level of conversations about the cloud with their peers and colleagues. And if anybody at the end of the episode is curious about what we’re doing now with the project we’re working on and wants to join us in this journey of learning the cloud, then we would be very happy for that. And there’ll be some great exciting things we’ll be sharing towards the end of the episode about this as well.

Jon: 00:06:55

Yeah. So I know you’re not here primarily to be pushing anything commercially, that this is an educational episode on cloud technologies and why they’re relevant to data scientists, but you guys have also just launched a new platform, right?

Kirill: 00:07:08

That’s right. Yeah. So exciting. It’s launching this, this week as the podcast is going live. It’s called cloudwolf.com. You can find at www.cloudwolf.com. And yeah, super, super pumped about it. So Hadelin can tell us more a bit why it, we chose the name Wolf, it was his idea.

Hadelin: 00:07:28

Yeah, yeah, absolutely. So first yes we think that the wolf is a fascinating animal, but also it has some you know, symbolism around it that can be described with a few words which are intelligence, strong family ties, loyalty what else? Education, communication, community, you know, with the wolf packs. So we thought it, you know, it’s a perfect description of the values and principles we want to have for a CloudWolf. You know the wolfs of the cloud. And yes, that’s exactly what we, how we see CloudWolf and how we see the community that we’re going to build. We see you know, a lot of education, a lot of intelligence, because indeed the cloud is very technical, so you need to have the right intelligence of the cloud which we will teach, of course. And and also, you know, the strong family ties, loyalty, the community, because we see in cloud wolves, you know people helping each other people you know yes, giving advice, supporting each other, so you know, so that they all get to a great level in the cloud.

Jon: 00:08:42

Nice. That’s really cool imagery. Nice. So really exciting platform. Why should data scientists be learning cloud? I gave a couple of examples upfront that, you know, we have lots more data than ever before, and those data sets are getting exponentially larger over time. Model sizes are getting exponentially larger over time, but why can’t we just entrust that to other people? I guess like other kinds of practitioners, why should data scientists themselves be capable of handling their own cloud infrastructure?

Kirill: 00:09:18

That’s a good question. To paint a picture, let’s start with what is cloud of, you know, like as a first stepping stone. Because what, like, what, what is cloud? Because just to make sure everybody’s on the same page of like, the benefits of using cloud. So basically a company can have IT infrastructure on-premises, but then it has to worry about buying the physical servers, maintaining a physical space for them, maintaining security, managing those servers, servicing those servers and like investing upfront a lot of money. And that’s called capital expenditure. Whereas with the cloud, you don’t need to do any of that. You basically rent servers or rent storage or databases, whatever you need from a cloud provider such as AWS, Amazon Web Services, Azure, Microsoft Azure, Google Cloud platform, or there are several others, but those are the three main big ones.

00:10:07

You rent those things on an as-per-needed basis. So if you need some servers today, you rent the servers today, and you don’t need them tomorrow, you release them tomorrow, you de-commission or you stop using them tomorrow, and you are paying, basically, you’re paying an operational expense, you’re no longer investing capital expenditure upfront, you’re paying an operational expense, and it’s very flexible. It’s very agile. You have access to way more different options. You don’t have to guess your capacity in advance, your capacity or your needs, you need to train a big machine learning model this week. You have big servers this week, and then you don’t need to train, you don’t need to pay for them. So it’s a much better costing type of model. The infrastructure is shared, but at the same time it is very secure. And so your data is not seen by the other companies using the same infrastructure.

00:10:54

And basically you outsource a lot of your headache. And the other big part is that there’s economies of scale, because other companies such as Airbnb or Netflix or Johnson & Johnson, huge companies, Coca-Cola, McDonald’s, are all using cloud. And because there’s so many companies using the same infrastructure, the cost goes down. So the prices for cloud are very low. And that’s the attraction for business. So that’s cloud in a nutshell, you know, the benefits that you get there. And in terms of data scientists, it’s more mostly about where data scientists are going, and maybe Hadelin can talk about this a bit.

Hadelin: 00:11:34

Yeah. So basically today to build a machine learning models and data science models, you need more and more compute-intensive resource to train them simply because the models are, you know, on the one hand, more advanced, but also sometimes you have a bigger, much bigger amount of data. And so you know, cloud resources is not only a plus to train your machine learning models, it has become a necessity. And that is why there is this very strong convergence happening between data science and the cloud. Because indeed now, data scientists, in order to train their machine learning models, they will need the cloud resources, which include, you know, two main types of things. First, the compute resources, which are the virtual servers with high and you know, strong and powerful GPUs. And also storage because we will talk about this later in this episode, you will see that in order to build machine learning models with the cloud, you can connect to the storage services of, for example, AWS and the compute services to build your machine learning models. And we will talk about that in a few moments.

Kirill: 00:12:49

And in addition to that, like if we even think, for example, of ChatGPT, which is completely, entirely cloud-based, it’s using Microsoft Azure. It’s got 175 billion parameters that gets a hundred million users per month. That is a massive scale that is just very hard to maintain something like that, on-premise infrastructure, you’d have to buy lots of servers and you have to be scaling all the time, buying new servers. And then what if the demand drops? Like they’re paying millions of dollars for this I think it was like a hundred thousand dollars per day to maintain ChatGPT. And that’s thankfully because they’re using cloud, if it was fixed infrastructure would be much more expensive. So we’re just seeing this, you know, one of the biggest examples, but that’s where the world is going. Like we’re going towards much heavier compute-intensive models and much more users that you will need to be servicing with these models and products that are being created. And that’s why data scientists have to learn, it’s like, I think Hadelin already mentioned this learning a cloud for data science is not just an advantage. It may look like an advantage today, but it is actually becoming a necessity. And the first people that jump on this train will be prepared for the future of tomorrow, which is coming really fast.

Jon: 00:14:03

Nice. Every company wants to become more data-driven, especially with languages like R and Python. Unfortunately, traditional data science training is broken. The material is generic. You’re learning in isolation. You never end up applying anything you’ve learned. Posit Academy fixes this with collaborative, expert-led training that’s actually relevant to your job. Do you work in finance? Learn R and Python within the context of investment analysis. Are you a biostatistician? Learn while working through clinical analysis projects. Posit Academy is the ultimate learning experience for professional teams in any industry that want to learn R and Python for data science. 94% of learners are still coding 6 months later. Learn more at Posit.co/Academy.

00:14:46

That makes perfect sense to me. And this explanation of the size of the ChatGPT model, and that’s now orders of magnitude larger with the GPT-4 release that happened recently as well. And so this is something, these, these models, these large language models are getting bigger and bigger and bigger. And while there are also research efforts to try to make ways of pruning away aspects of the model that aren’t contributing overall, or aren’t contributing to specific tasks we nevertheless have this ongoing trend of bigger and bigger models. So how prevalent is it out there that data scientists need to have cloud skills?

Hadelin: 00:15:34

Yeah. Yeah, Jon, it’s very good that you asked this question because there must be a realization from people that cloud skills have become a necessity. It’s not just you know, a plus, an advantage. It has become a necessity. For the very reason that we just mentioned, which is that machine learning models, data science models are becoming more and more complex, and therefore, now to train them, well, the only solution is cloud resources. So, so yes, it’s a necessity and I think that, you know, this convergence between the data science and the cloud is going to become narrower and narrower. It’s going to converge more. And the more it converged, the more cloud skills will become absolutely necessary to, to train the models. So so yes, I think every data scientist now should at least know how to, for example, train a machine learning model with some AWS services, which we’ll talk about in a few moments.

Jon: 00:16:37

Yeah. And so AWS recently became a sponsor of the show. They actually, they had no influence on the content in this episode. This is, purely, it’s a coincidence. But I understand that because AWS is the leader in cloud, so the three main providers are AWS, Google GCP, and Microsoft Azure. And actually, I got the order wrong there cause it’s I, you guys dug up these stats, but you looked up from statista.com that AWS has 34% of the cloud market, Azure, 21%, and GCP 11%. So I guess it makes sense. Maybe this is your recommendation that given AWS has more market share than the other two combined, if you’re going to start with one of these platforms, you should start with AWS.

Kirill: 00:17:27

Yeah, that’s, that’s that’s a good, that’s about correct recommendation. And I also wanted to add that in terms of number of data science jobs that mentioned cloud skills, we did our own research and we check, make sure to check it for statistical significance or make a proper statistical research. So we wanted to come into this podcast saying that if you learn cloud skills, your salary’s going to grow by, you know, X percent. And we did find a slight increase in the average or the median salary of a data scientist who’s learning cloud versus the one who’s not learning cloud, but unfortunately wasn’t statist statistically significant. So we wouldn’t be comfortable sharing that. But what we did, what definitely is a fact, is that the number of jobs. So right now we’ve found, we looked at a sample size of about 190 jobs on Glassdoor, literally yesterday.

00:18:21

And we found that 37% of them, these are data science and machine learning jobs, 37% of them mentioned cloud skills, whether there’s a requirement or as a you know, a preference. And based on the sample size and doing a statistical test, we can say that overall in the world, not just in our sample size of 190, overall in the world, this number is somewhere between 34% and 40%. So between 34% to 40% of data science and machine learning jobs already now, already today, at the beginning, in the first quarter of 2023, mention cloud skills as either a requirement or nice to a, or a preference. And that’s, you know, that’s a big number already. It’s already growing, so it’s over a third. And another thing to mention about AWS, [inaudible 00:19:12] Jon, of those jobs, of all data science, machine learning jobs, about 20% of them mention AWS skills specifically.

00:19:19

So not only AWS has the highest share of the market, but also it’s a safe bet. Like, unless you know that a company that you want to work for or that you are working for is using Microsoft Azure or Google Cloud, then sure learn those. Those are very good tools as well. And we are planning on learning them as well. Like, you know, once we are, we are confident, like we’ve done our first few certifications with AWS, we want to move on to those as well. But if, if you don’t know which cloud platform to learn, start with AWS, it’s a safe bet. There’s lots of companies using it, from large businesses, like mentioned before, Nike, Coca-Cola, Disney all of Netflix. People might not know this, but all of Netflix is on AWS. They don’t even have their own service at all. All of their videos, everything, their websites is running on AWS. Airbnb is running on AWS as well. So big companies are, and also startups because it’s so agile to spin things up. So whether you are looking to work for a big company or a startup, AWS is a safe bet to add to your skill set.

Jon: 00:20:18

Yeah, from my perspective, getting startups going for the last eight, nine years, it’s a no-brainer. Like there’s, I have used, I’ve built a couple of servers for doing a lot of computation prototyping for my data science team, where I knew I was going to need really big GPUs and that these instances were particularly expensive. Like, I did the math and was like, okay, to have a couple of these on-premise, we can save a little bit of money cuz we assume that we’re going to be using one or two of these all the time. But for the most part, like any production infrastructure, you need your machine learning models. You don’t know how many users are going to have at any point. So you need to have your infrastructure, your production infrastructure scale up. And then even with this example of us having these data science servers on-prem, well, because models are getting larger and larger all the time when we’re doing intensive computation, we need to be turning on cloud instances anyway.

00:21:13

So we might do a little prototyping on like small amounts of data on our local instances, on our on-premise instances. And then use the cloud when we’re like, okay, we’re going to use this full Google T5 model now on a huge amount of data. And this T5 model, I couldn’t possibly fit it onto the one or two GPUs that I have on this one server on-premise. We’re going to need lots of compute in the cloud. So yeah, it’s a no-brainer to me that data scientists need to be making use of cloud infrastructure for training models today, and especially for production inference, because yeah, you can’t, you don’t want to buy all of the infrastructure yourself to be able to handle the maximum number of users at any given time point that you want to be have using your application.

Kirill: 00:21:59

Exactly. Exactly. And also what you mentioned about these servers, they, they get outdated, you know, technology, new technology is coming out all the time, and you buy these servers, you put them on-premises, and then what, two years later, you need to decommission and sell them. Whereas on the cloud, the cloud provider just automatically updates, some releases new versions, you can just switch to those with a click of a button and you’re not having these kind of like sunk costs that you have to deal with all the time.

Jon: 00:22:28

Yeah, for sure. So our listeners might be out there now convinced if they weren’t already, that they need to have cloud skills, should they be intimidated? Are, is this like a really, like, it’s, it sounds complex, it sounds like there’s lots of language that we might not be familiar with. You know, as data scientists or as machine learning experts, this cloud realm has all kinds of new vocabulary, new concepts. Are they like complex to learn, or is it relatively straightforward?

Hadelin: 00:22:59

I think Kirill and I can agree that, yes it is a bit complex at first, but Jon, you just said that it’s a no-brainer. I agree with you. You know, at some point it becomes a no-brainer because we get the intuition of how things work. But even if at the beginning you know, it can seem quite complex, well Kirill and I agree that we make the complex simple. We’ve done that with artificial intelligence. So there is no reason why we shouldn’t do it, shouldn’t be able to do it for the cloud.

Jon: 00:23:28

What a great answer. Makes perfect sense.

Kirill: 00:23:30

Yeah. We’ve spent like a whole year learning cloud, like in and out like we normally do. And that’s, that’s one of the reasons why you haven’t heard from us, our listeners, our students haven’t heard from us quite a lot. We wanted to do this incognito. So, you know, like we would, we would only talk about it once we were confident and spending a whole year of doing it. We’ve, we believe we can teach it very effectively, and we’ve already created our first course and we think that you can, like, our estimate is that with this educational content, we can help people get their first AWS certification, which is the Certified Cloud Practitioner, which means you’ve learned the basics of cloud, you understand all the vocabulary, you understand how to learn it, how to use it, and it’s an actually a badge, certification badge you can add to your CV and LinkedIn. We think, like we estimate that people can learn, we can help people learn in 21 days. So if at two hours of study per day, within 21 days, it’s three weeks, yeah. We’re confident that we can help people learn AWS and get their first certification in 21 days. Within basically three weeks, you can be AWS certified with the very first certification and add that to your skill set.

Jon: 00:24:41

Nice. Yeah, that doesn’t sound too tough at all. And yeah, so I guess another big advantage, you’ve, you’ve talked a lot in this episode about how these cloud skills are really in demand, that a large portion of data science jobs specifically mentioned cloud technologies. But I guess another reason why data scientists should be interested in this is because it just, adding this to their toolkit allows them to broaden the impact that they can make.

Hadelin: 00:25:12

Absolutely. You know, I’ll make a comment on that from my you know, personal experience with AWS for data science. To be honest, now, I practically only use AWS when I want to do data science or build machine learning models. And it’s not because the data that I you know, train the models with is much bigger, you know, as we said before, which is one of the reason why we should, you know, have the cloud skills with us to, to train a machine learning model. It’s not because of that. It’s because even with small data, well the machine learning models scores, you know, performance scores that I get are better and higher with machine with AWS and SageMaker in particular, we’ll talk about that. So, so, yes. And that because it, it has the ability to, you know combine an ensemble of models you know, including the Gradient Boosting models, the Neural Nets models, and do a lot of hyper-parameter tuning at the same time very efficiently because, you know, it uses you know, good, good resource to do that. And I get the best score at the end.

00:26:21

There is a good example. There is this data set that I use as a benchmark and many of our learning courses. And for this dataset, we obtain, you know, when we hard code the models, or, you know, when we use the classic libraries with our own resource, we obtain an accuracy of 94-95%. And with SageMaker with AWS, I obtained 97% simply because it was able to you know, combine this ensemble of models while doing a bit of parameter tuning, hyper parameter tuning very efficiently. So so yes, that’s my personal experience, and that’s why now I practically only use AWS for machine learning.

Jon: 00:26:57

Nice. Well, as you get deep into the technical bits Hadelin, you start to sound like somebody that I should trust on cloud technology. You start to sound like you know what you’re talking about, but you guys have only been diving into this for like a year. Why, why should we trust your opinions?

Kirill: 00:27:12

Well it’s, there’s a couple of things. First of all we love teaching, and we know we’ve been teaching different topics for eight, seven years now. And we bring our teaching skills to, to the table, and that that is very, like, we find is very portable. We can port that to, to cloud and use the same teaching methodologies, which is really cool. And our research skills, understanding. In addition to that, we, we’ve also are following this learning. We’ve also become certified ourselves. And the final thing is that sometimes when you’re a beginner, when you’re learning something for the first time, you’re better positioned to explain it to other people, because as an expert, you kind of get lost. You forget what it is like to be a beginner. And especially coming from the data science field, we know what data scientists pain points they have, what kind of like pitfalls are to be expected along the way. And so it’s, for us, it’s a no-brainer that the content we’ve created is by far superior to everything else that we’ve seen out there.

Jon: 00:28:18

All right. All right. So maybe you’ve convinced me, but let’s do a quiz. So AWS is the biggest cloud platform out there. So tell us about the basics of AWS. What are the essential things that data scientists need to know about AWS?

Kirill: 00:28:35

Okay, sounds good. So we’re going to talk about four different types of services. There’ll be lots of services. So AWS has a total of over 200 services. There, we’re not going to talk about all them. We’ll talk about four main types that are relevant to data science and we’ll mention a couple. First one will be compute. Then we’ll talk about storage, we’ll talk about databases, and we’ll talk about machine learning. So the first one is EC2, which stands for Elastic Compute Cloud. And because there’s two Cs there, Compute Cloud, that’s where the “2” comes from. So, Elastic Compute Cloud is basically a way for you to rent a server like a type of, like a device. Like imagine, imagine like on-premises it would be like a server rack or a computer that’s doing the processing. It can have certain number of CPUs, whether 2, 8, 16 CPUs, however many you want, and what number of, what amount of storage you want to an 8GB, 16GB and so on.

00:29:28

Same thing in the cloud. So there’s these big server racks that they have, and they’re split into these virtual machines. So virtual instances, they’re called. You don’t need to worry about the fact that the’re virtual, you just say what you need and it’s completely isolated from other clients. And then you’ll get your 16 CPUs and, I don’t know, 80GB of RAM or whatever 80GB of memory, whatever you need to be running a model. So you can, and the benefits of doing this through the cloud, we’ve kind of mentioned a few of them, is that you can select the right size for the right job. You can use it today and not use it tomorrow. You only pay for what you use, you only pay for the computer resources that you use. You don’t pay for it if it’s sitting there idly and not doing anything. It’s very agile. You don’t have to buy anything and install in your service, and you get access to the latest technologies out there, so they’re constantly getting updated. So that’s a service to know. So whenever you hear [inaudible 00:30:21] EC2, that is the Elastic Compute Cloud service of Amazon Web Services. And that’s how, where most cases you will be doing your compute, even though we’ll talk a bit about other ones when we talk about machine learning.

Hadelin: 00:30:33

Yes. And now I would like to talk about S3, which is another very popular AWS service. So S3 stands for Simple Storage Service, and as the name suggests, it’s storage service offered by AWS. And it’s completely insane because it basically offers virtually unlimited storage. You can store virtually unlimited amount of data within S3. So it can be any of your files. It can be a CSB files, you know, for machine learning. It can be even images, videos anything, anything you want in what we call buckets. And also the very powerful thing about S3 it has, is that it has 99.999 11 times data durability, which means that even within a billion years you will not lose any item within your S3 storage. Plus it is super cheap. It is only 2 cents per GB per month of storage. So so, and you even have, you know, a free tier option, which gives you 5GB of storage that you can have for free. So yes, it’s mind blowing that all the things you can do and with such a power you know, for this storage service industry. But, but you know, by the way I’m talking, it’s, it’s like I’m selling AWS, but not at all. I’m just in an awe, I’m just very impressed of how this storage service is so powerful and how, you know, you can use it in an unlimited fashion.

Jon: 00:32:08

This episode of SuperDataScience is brought to you by AWS Trainium and Inferentia, the ideal accelerators for generative AI. AWS Trainium and Inferentia chips are purpose-built by AWS to train and deploy large-scale models. Whether you are building with large language models or latent diffusion models, you no longer have to choose between optimizing performance or lowering costs. Learn more about how you can save up to 50% on training costs and up to 40% on inference costs with these high performance accelerators. We have all the links for getting started right away in the show notes. Awesome, now back to our show.

00:32:49

Yeah, it is a good case in point on how you can do things with cloud storage. You can’t possibly have that kind of reliability. I wasn’t aware of that 11 nines of data durability. So 99.9999 and whatever percent where you wouldn’t lose a bit of information in a billion years, that’s wild to think about.

Kirill: 00:33:11

Yes, indeed. And like Hadelin said the awe is like, that’s the feeling we’ve been getting throughout the learning. Like every time we learn a new service or about a new service, we’re like, wow, that, that gets possible, that can be done, like on AWS. You can even control satellites if you really wanted to. Of course, like most of us will, I never need that, but [crosstalk 00:33:29], that’s crazy, right? Satellites, blockchain, whatever, you know, like this, or it’s interesting, very interesting how, how they’re updating these technologies all the time. And speaking of a variety of services, let’s talk a bit about databases. So we’ve talked about computing, we’ve talked about storage. Well, another kind of important storage for data scientists is databases. And we’ve got quite a few interesting points here. So first of all, whatever flavor of database you like, whether it’s Microsoft SQL, or Oracle PostgreSQL, MySQL, MariaDB, all those databases are available in Amazon Web Services through a service called RDS. So whenever you hear RDS, that stands for Relational Database Service. And you know, because all of these databases mentioned are relational databases you can spin up any one of those. And of course there’s more. I’m just saying that the ones that you’re used to working with are available in AWS. You don’t have to learn something new if you don’t want to.

Jon: 00:34:23

You actually, just really quickly, there’s a verb you use there that I’m not sure if we clarified what it means, but you said “it’s very easy to spin up an instance” and that kind of phrase, what is, what does that mean to our listeners?

Kirill: 00:34:38

What does an instance, what is spin up? All of those things. Yeah. So in instances, we, what we talked about is before we spoke about EC2 instances, here, we’re talking about databases. They also need underlying resources. So when you, we say, when I say spin up an instance, I don’t know if that’s a term generally used in the industry, but that’s how I use it. I see it, you’re basically launching a database instance of a database and you’re able to put things in there, you’re able to store it, and whenever you don’t need it, you just spin it down, spin it off, I don’t know, turn it off, so you [crosstalk 00:35:10].

Jon: 00:35:12

I’ve never heard anyone say spin down, but yeah, definitely spin up. And I think it’s something to do with this idea of like, I don’t, I’m guessing-

Kirill: 00:35:21

The disc spinning, yeah?

Jon: 00:35:22

A disc actually spinning that you’re like, when you press an on button, it’s like, “woo”.

Kirill: 00:35:25

Yeah, yeah.

Jon: 00:35:26

It’s spinning. But yeah, I don’t think anyone says spin down. You just a-

Kirill: 00:35:29

Okay. And that, that’s a really cool transition – close down, yeah – a really cool transition to what I wanted to talk about next is there’s also a really cool database on Amazon Web Services, which is Redshift. And this is a data warehouse. And so in order to understand the beauty of Redshift, we have to talk about something called OLTP versus OLAP databases, data storages. So OLTP stands for Online Transaction Processing. OLAP stands for Online Analytics Processing. OLTP, all of those databases mentioned before Microsoft SQL, Oracle Postgre, MySQL, they’re designed for OLTP. You can use them for analytics. Sure, you can go in and create averages of your columns. You can find out, you know, the medians, the stamp deviations, whatever, build your visualizations and so on, run machine learning models.