Podcasts SDS 657: How to Learn Data Engineering

70 minutes
Career Tips, Data Science, Machine Learning

SDS 657: How to Learn Data Engineering

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

We talk a lot about data engineering on the podcast, and this week, our favorite topic is getting the attention it deserves. Data engineering educator Andreas Kretz joins Jon Krohn to cover everything you need to know about the field and how its must-know skills can help improve your role as a data analyst or data scientist. Tune in if you’re ready to future-proof your data career.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Andreas Kretz

Andreas is a Senior Data Engineer and Trainer, tech enthusiast, and a father. He has been passionate about Data Engineering for over a decade. First as a self-taught Data Engineer and then Data Engineering team leader in a large company. When he realized how much need there was for training in this area, Andreas followed his passion and started his own Data Engineering Academy in 2019. Since then, he has helped over 1,000 students achieve their goals. He loves to exchange ideas and discuss with like-minded people. And he also enjoys sharing his knowledge via his YouTube channel.

Overview

When we think or speak of data science, it’s often the algorithms that attract much of our focus. But as data engineering educator Andreas Kretz reminds us this week, data engineering processes are critical aspects of the data lifecycle, and it’s becoming increasingly more important for data scientists to learn these skills.

In smaller organizations, the data scientist may take on the role of data engineering, making it all the more pressing to develop the independence one needs to work with large amounts of data –to clean, structure, and funnel it to build datasets that are more effective. This process is what Andreas calls the “plumbing of data science.”

With more data being recorded than ever before, these skills have never been more important to learn. If you’re looking to get started, Andreas highlights four core areas that anyone learning data engineering needs to master to be effective. Namely, relational databases, APIs, ETL tools, and data monitoring tools.

What’s more, Andreas points out that people often think of data engineering as a one-time service that simply involves building a pipeline. But it’s often much more complicated than that. With some aspects of APIs changing without any warning, it’s one of the major cases for why organizations need data engineering “constantly.” This is where monitoring systems come in, says Andreas. Whether it’s a simple notification service, a dashboard integration, or a Slack alert, monitoring is an essential part of the role.

Tune in for more from Andreas, including how junior engineering roles differ from senior ones, and whether data engineering certifications are worth it.

In this episode you will learn:

Why learn data engineering? [06:55]
What is data engineering? [08:08]
What sets Senior Data Engineers apart from junior ones? [13:57]
The must-know data-engineering tools [20:26]
The right path to learn data engineering [44:24]
Are certifications worth it? [51:46]
The future of data engineering [55:24]
Andreas’s career challenges [58:48]

Items mentioned in this podcast:

Glean.io
Pandata
Keith McCormick’s ROI Course offer (follow #SDSKeith)
Learn Data Engineering Academy
Learn Data Engineering YouTube Channel
Plumbers of Data Science Podcast
Relational Databases
Apache Spark
Databricks
NoSQL
Kafka
Amazon Kinesis
Elasticsearch
JSON
Snowflake
DBT
ML on AWS
Azure
SDS 653: Efficiently Glean-ing Insights from Vast Data Warehouses
SDS 623: Data Analyst, Data Scientist, and Data Engineer Career Paths
Zero to One by Peter Thiel and Blake Masters
Jon’s Podcast Page

Follow Andreas:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 657 with Andreas Kretz, founder of Learn Data Engineering. Today’s episode is brought to you by Glean.io, the platform for data insights, fast. And by epic LinkedIn Learning instructor Keith McCormick.

00:00:18

Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:37

Welcome back to the SuperDataScience Podcast. Today, I’m honored to be joined by the brilliant data engineering educator, Andreas Kretz. Andreas is the founder of Learn Data Engineering, an online platform, through which he’s taught over a thousand students the theory and practice of data engineering through his YouTube channel. Andreas has over 10,000 subscribers, and so he’s provided countless more folks with data engineering tips and tricks. Prior to becoming the data engineering content creator, he worked for 10 years at the German industrial giant Bosch, including as a data engineering team lead and data lab team lead. He holds a computer science degree from the Technical University of Applied Sciences Würzburg-Schweinfurt, and with over a hundred thousand followers on LinkedIn, he’s twice been recognized as a top voice for data science and analytics on the platform. Today’s episode will appeal primarily to technical listeners, particularly to data scientists that are keen to develop evermore critical data engineering skills.

00:01:44

In this episode, Andreas details what data engineering is and how it relates to adjacent fields like data science, software engineering, and machine learning engineering. He talks about why data engineering skills become increasingly essential to data scientists and data analysts each passing year. What sets Senior Data Engineers apart from junior ones. His general process for tackling data engineering problems and the must-know data engineering tools of today, as well as the emerging ones you should not miss. All right, you ready for this excellent episode? Let’s go.

00:02:21

Andreas, willkommen, welcome to the SuperDataScience Podcast. It’s an honor to have you here. Where in the world are you calling in from?

Andreas Kretz: 00:02:29

Hi, Jon. I’m calling from Germany from Northern Bavaria where I live with my family, wife, and the two kids, in a small town.

Jon Krohn: 00:02:40

Yeah. We had leading up to recording with you, we almost had this amazing coincidence that I was going to be in Bavaria and able to record with you in person. But we just, we narrowly the time that I’m there, you’re not available for recording, and so we’re doing a remote session as usual. But yeah, it’s really interesting. I mean, I haven’t even, I haven’t traveled off of North America since the pandemic started. And so, you know, these incredibly unlikely odds, but, and it didn’t work out anyway.

Andreas Kretz: 00:03:16

Yeah, that happens sometimes.

Jon Krohn: 00:03:17

Well, I love that part of the world. I’ve, I’ve visited Bavaria many times and yeah, I love the food. I love the beer. I love the mountains. I haven’t yet bought a pair of lederhosen, do you have a pair of lederhosen?

Andreas Kretz: 00:03:32

I actually don’t. I actually don’t, I had one pair for going to the,

Jon Krohn: 00:03:41

To Octoberfest.

Andreas Kretz: 00:03:42

No, well, not Octoberfest, but we have these smaller fests here in our area.

Jon Krohn: 00:03:48

Septemberfest, the quiet runup to Octoberfest. Yeah, I’m joking. I’m joking.

Andreas Kretz: 00:03:56

Yeah, yeah.

Jon Krohn: 00:03:57

Nice. So we know each other through Kate Strachnyi, who’s been on the show many times. Most recently she did in episode number 651, which aired earlier this year, and she’s become a friend of mine initially through having her on the show. But I see her in person at conferences and sometimes in New York. And I understand that you’ve known her for a while as well.

Andreas Kretz: 00:04:20

Yeah. Kate, I know Kate for, I don’t know, four or five years, we actually got to know each other through a group when we did a YouTube channel that was called at Data Science Office hours. And there was just like people on LinkedIn at that time. LinkedIn wasn’t that big, and it was, there were a few influencers there, and I was one of the small ones. And like we met there. Kate wasn’t the data scientist. I wasn’t the data scientist, but we had fun. And at some point that then ended and we stayed in touch. And actually I’m usually meeting her Monday my three o’clock. And then we…

Jon Krohn: 00:05:03

Oh, really?

Andreas Kretz: 00:05:03

We talk about…

Jon Krohn: 00:05:04

You guys meet weekly?

Andreas Kretz: 00:05:06

Yeah. We have a weekly…

Jon Krohn: 00:05:07

For, how long? Oh my goodness.

Andreas Kretz: 00:05:10

It’s fun, so.

Jon Krohn: 00:05:11

Wow. That’s so cool. I had no idea. That is a close relationship. There are not that many people I have weekly calls with. I like, yeah. You know, I feel the people that I do, I feel extremely close to you. Wow.

Andreas Kretz: 00:05:23

I think we never missed one since, since we started this. Like it’s, I don’t, it always works out and it’s, it’s really nice.

Jon Krohn: 00:05:32

And it makes sense to me that you two would both compliment like at Data Science Office Hours because you have expertise on the, like, edge of data science. So for her data visualization, I mean, this is, I mean, data visualization, this is, and data presentation is absolutely critical to being a great data scientist. But it, I guess it isn’t like a core skill like Scikit-learn or something. I don’t know. It should be a core skill.

Andreas Kretz: 00:06:00

Yeah, it should be a core skill because I think most of engineers, most of scientists need to present stuff and do slides and stuff. So yeah, visualization is great, is very important.

Jon Krohn: 00:06:13

Get buy-in. It’s important. And then you sit on the edge of it as a data engineer expert. and data engineering, as we’re gonna talk about a lot in this episode, is critical increasingly more and more for data scientists to have these skills. And we’ll get into why shortly. So you teach data engineering to thousands of students through your Learn Data Engineering Academy, and then your YouTube channel has tens of thousands more people on it as well. And there’s a Discord channel as well, a Discord server. Sorry, I’m not that big into Discord, so I sometimes get the jargon run. So why is data engineering worth teaching and worth learning? Maybe particularly if you’re a data scientist?

Andreas Kretz: 00:07:02

Yeah. The thing is, and I get that I’ll ask a lot because from a, from a standpoint where you look at data science from the outside, it’s always seems like machine learning and like all the algorithms and stuff, that’s the only thing that you need. But actually, when you talk to a lot of scientists, they need to do also the engineering. They not only analyze the data, they need to come up with some kind of automation for processing the data that leads up to the actual doing the science or, or automating the science, which is not perfect. I think for me, it’s always, when I, the scientist, I know, I think they’re brilliant scientists, but they’re, the engineering is not their expertise. And so for a proof of concept that it’s very important, especially for a data scientist, but at some point, then you need, you need engineers who take over and basically, yeah.

Jon Krohn: 00:08:07

A way that I often think about data engineering is, and I would love to be corrected on this by you or get your opinion on this, but a way that I often describe it is it, it allows data scientists to be getting refined data. So we have very, very large amounts of data being stored more data than ever being recorded. Every kind of 18 to 24 months. The amount of data being recorded at any given time point, it doubles. So this crazy exponential amount of data, but the data are often unstructured or so vast that it’s difficult to work with. Or there could be a lot of noise amongst valuable data. And so I think of data engineers as being able to work with very large amounts of data and process it, maybe clean it, structure it so that it can be often in a tabular format you know, well labeled columns and maybe a smaller amount of the data, data that have been selected from the vast amount of data that you know, we think based on the model that we’re trying to develop you know, these particular rows are more likely to be valuable to the model.

00:09:32

So we’re creating from potentially large, very vast amounts of unstructured data, or very large amounts of structured data or getting down to well-defined columns that maybe have been cleaned up. And the rows are cases that we believe will have a large amount of value to the data science model that we’re training. What do you think about that explanation?

Andreas Kretz: 00:09:56

Well, yeah, that’s basically the essence. I think you hit a lot of really good points to explain this whole, what data engineering is about. I think it’s, or how I usually explain this is you need to look at it from a journey that goes from from left to right. And on the left is where your data sources are. That might be an API that you have from another system or a database or a data source that is sending you data in large amounts. And on the right side is where your data scientists or your data analyst is. And somehow you need to now connect the scientist, the analyst with the data that is coming in, what everything that happens in between. That’s a lot what you mentioned that is making sure the data is clean, making sure the process is run uninterrupted, making sure it, everything scales, transforming the data it’s coming in as a, in a, as a chase on format, for instance. And then you take it and you make sure everything is right, and then you put it into a destination in a tabular form or something.

Jon Krohn: 00:11:15

Right, right.

Andreas Kretz: 00:11:16

And that is a large part of what an engineer needs to do. And for me, that’s also as said, that’s where data scientists can learn a lot, because if there is no engineer, then the data scientist must usually take that place and start doing that.

Jon Krohn: 00:11:33

Right. Exactly. So in smaller organizations, the data scientists have to learn some data engineering skills in order to be creating the data for themselves. You know, this kind of structured data, I guess another kind of application area. So you mentioned that the data engineer could be preparing data for a data analyst or a data scientist. I suppose another application area could be for automated dashboards even. So it isn’t necessarily going to some, to an analyst to be creating bespoke charts with bespoke analysis or a data scientist to be creating a model with, but in addition, the data engineer could be preparing data to flow fully automatically into some dashboards or reports that non-technical people will be viewing.

Andreas Kretz: 00:12:20

Absolutely. Yeah, that’s absolutely true. It could also be that you’re working on a transactional system, right? We’re actually the end customer. Think about Amazon. You have somebody makes a transaction and then something happens within Amazon, and processes are getting started, and then the updates need to get somehow, again into the front end so the user can actually access the data. And this most likely is not going to stay within one database. It’s a larger process, and multiple systems are, are in included. And to actually then get that into the transactional database or, or database in the end, that’s where the engineers need to do a lot of stuff.

Jon Krohn: 00:13:09

Business intelligence tools are too complicated and take too much time to manage, and your team then still ends up frustrated that they don’t have access to data. It’s 2023 people, and it’s time to rethink business intelligence. Glean.io is the lightweight visualization tool that lets you define your metrics just once and then empowers everyone in your organization to explore your data visually. In episode number 653, we caught up with Carlos, the founder of Glean.io, and the super technical entrepreneur detailed how he’s built a platform to democratize data analytics. Sign up and get started for three months free at glean.io, when you tell them you heard about Glean from the SuperDataScience podcast.

00:13:56

Nice. So within the field of data engineering, surely there are kinds of entry level data engineers, junior data engineers, senior data engineers. How do the, like, what kinds of competencies do these various levels have? Yeah, what’s the difference between entry level and senior data engineering?

Andreas Kretz: 00:14:14

For juniors, we have to think of juniors. You are most likely very limited on what you can do and what overview you have of, of all the tools that are required and like the experience of working with them. So usually juniors, you most, you come in and you have a few spec or a few core tools that you are working with, and that your, the chops that you are doing are very narrow. Like, okay, we have this bigger project. Now you as a junior come in and you take the first transformation step and you do this, and then the next one, somebody else does the other stuff. And so as true, you are most likely very limited in what you’re doing from the whole pipeline. And as you progress then to having more experience, knowing more tools being more hands-on with the actual data that is coming in, that’s where you then make that transition into a associate or full professional role, right?

00:15:25

Where you have, where you maybe have a full pipeline that you, that you manage, or multiple ones where you are doing more of the, of the conceptual phase of everything that you, okay, we have a new, we have a project, this needs to be done. The data looks like this in the beginning, this is at the end. How can you model the data? Or you don’t even know the, how it look, how it’s going to look at the end. Then the professional is going to know, okay, this, we’re going to model it in that way and put it there where the junior usually can’t do that. Right, right. You, you still need to learn. And it’s, it’s, it’s good if you have somebody who is actually or who has experience and can learn.

Jon Krohn: 00:16:09

So I guess it’s similar to almost any kind of role where these senior people are able to take an abstract problem or something that’s described to them by a data scientist or management or something, or maybe even identify proactively some kind of solution and they can be architecting it, and they know from their experience how to break this into different parts and roughly how long each part will take, what the complexity is, how to approach it. Whereas the junior data scientist is more getting handed those pieces of work and saying, okay, junior data engineer, here’s your part of the problem for today. Let’s see how you get on with that today and let me know tomorrow.

Andreas Kretz: 00:16:46

Exactly. Yeah. Yeah. Also the, as you mentioned, the communication with other stakeholders in the whole process. That’s where the professional then needs to do a lot of work.

Jon Krohn: 00:17:00

Yeah. The more senior one.

Andreas Kretz: 00:17:01

Yeah. Or the more senior one. Yeah. Yeah. And then when you look at the senior, senior data engineer, that’s usually where you then get into the role of, okay, it’s not just you. We have a team. You basically help the whole team work on your projects, work on your goals.

Jon Krohn: 00:17:20

Nice. So what kind of person, becomes a data engineer instead of say a data scientist or a data analyst? Are there, like, in your experience, are there innate skills or interests that make somebody more likely to do this kind of data engineering role than other options in the field?

Andreas Kretz: 00:17:38

I usually, there’s a question I get asked often, like, should I become a data scientist or should I become a data engineer? And I think that fits very well to that question. And I usually try to answer this pointing people towards, what’s your interest? Are you more interested in computer science coming from the software development side, or are you more into analytics, statistics, the mathematical stuff, you know, then that, then most likely then you’re going to become a, or want to become a scientist. If you’re coming from the, like, I’m coming from classical computer science background, and that’s usually the role that the engineer is working a lot with tools, a lot with configuration a lot software development. And so that’s where you want, where you know which direction you should go or where.

Jon Krohn: 00:18:37

Okay. All right. Then here’s a twist on that question. So if it’s these kind of people with computer science backgrounds that make data engineers makes perfect sense to me, but then why choose say, data engineering over machine learning engineering? Is there a difference between these two kind of people? Cuz both of those cases, they’re, that’s more like the computer science. So where the data engineer, and again, feel free to correct me if you don’t think these definitions are appropriate, but in kinds of broad strokes, the data engineer can be working on the data pipelines to clean up the data and provide those to the data scientists. But then once the data scientist has created the model, then these model weights are passed off to the machine learning engineer to productionize them and make sure that they’re performing in production. So yeah.

Andreas Kretz: 00:19:24

So yeah, for this, and I did this, when was that, end of last year, I did this this sheet where I actually put in the roles like from junior to senior, and I also added machine learning engineer there. And how I, how I said, or how I see this is at some point you most likely are going to make that shift. So let’s say you are a professional data engineer, you have learned data engineer, you have some experience, then maybe you wanna make that shift towards ML engineer or towards architect data architect or platform architect. Right? So it’s that it’s more of a specific, is it called specification? No, it’s..

Jon Krohn: 00:20:09

Yeah, a specialization.

Andreas Kretz: 00:20:11

A specialization, yeah. It’s more a specialization, I think then immediately going out wanting to become an ML engineer.

Jon Krohn: 00:20:21

That makes a lot of sense. That’s a nice definition. All right. So back to data engineering proper, what kinds of tools should data engineers be using regularly?

Andreas Kretz: 00:20:32

So when you look at the pipeline of how the data is going to be processed or where is it coming from, where is it going, a lot of stuff that a data engineer needs to know is relational databases. That’s, I know it sounds very boring, man, these relational databases happen here for 30 years, but when you look at where the data is coming from, very often it’s still coming from relational databases. APIs is something that an engineer needs to know and needs to get comfortable with because either you are going to use an API from an external source, or you need to create an API to actually serve the data internally. That’s for the source. And the, basically the data ingestion. Sometimes people work a lot with in data ingest tools like or ETL tools like head talent or, or hevo or stuff, or also tools on cloud platforms.

00:21:41

And then coming back the processing frameworks very often you find then something where you need to process then the data. Either this is something serverless you find very often for simple chops. You find something like on AWS Lambda functions where you just create a small function and the data isn’t that big that is coming in at once, so that’s absolutely fine. Or if it’s something bigger where you have a framework that allows you to parallel process data like Apache Spark, or you are going on to a platform like Databricks not… Yeah, Databricks, sorry. Where you can then actually leverage that and at the end maybe, maybe a NoSQL database or data warehouse depends on your, your goal. But these are usually the categories is something. Data ingestion may be a streaming tool like Kafka, I don’t know, or, or Kinesis. Then something for processing and for storage. And that’s usually what you see.

Jon Krohn: 00:22:46

Nice. So relational databases, of course. And that means things like SQL?

Andreas Kretz: 00:22:50

Yeah, yeah, yeah. Well, I always say relational databases cuz people have, whenever I say SQL, people come back to me and say, it’s just a query language that’s not a databases. Yes, I learned my lesson.

Jon Krohn: 00:23:07

But yes. We’re interacting often with the relational database using SQL. So yes. So that makes sense. Storing the data in relational databases, and this kind of goes back to the point that I was making near the beginning of this episode, where in ideally for a data scientist to be most easily able to work with data or to be creating data dashboards or a data analyst, being able to work with the data, having the data structured into clear columns in this kind of relational database is key. And yeah, we’re, you know, like you’re saying.

Andreas Kretz: 00:23:37

Well it necessarily doesn’t need to be a relational database. When you look at the tools that are out there right now, like let’s say you have your data and data is coming in as, as chase on files or as chason, and then you put it into files and you make these files bigger and bigger and bigger. And then you take a tool like Snowflake and then basically turn these onto the platform into tables. So somebody who’s coming in could actually work with SQL on top of unstructured data or non-table tabular data. So that’s what you see a lot.

Jon Krohn: 00:24:19

Nice. Excellent. And so yeah, so relational databases makes perfect sense to me. Obviously seems like a core skill in data engineering. APIs we can talk about that a little bit more for our listeners in case they aren’t aware of what they are. So application programming interfaces, this is like a way of interacting with computers in the same way that a click and point user interfaces, but with an API, instead of it being like click and point in your browser it’s at the command line. And so you can make requests to an API so that the API’s kind of, it’s often online and you’re just kind of sitting there waiting for you to make some request. And then so you provide a request to it in very, in a very specific way and it can return some information to you.

00:25:08

And it’s interesting you highlighted there for me something that I don’t think about myself often enough, which is that data engineers are often tasked with creating these APIs for internal use. So APIs for external use, these have to have like a lot of documentation, be very robust. So for example, somebody might want to take advantage of the GPT-3 model, and so they can go and use the OpenAI API to send some query to GPT-3, ask it to do some language task, and then the OpenAI API returns back to you the results of that task that you asked GPT-3 to perform. And so similarly data engineers, like you mentioned, are often tasked with internally building these same kinds of ways of interacting in the code so that the internal data analyst or data scientist or dashboard or whatever can can write this standardized request to the data engineer’s API and get back some results.

Andreas Kretz: 00:26:11

Yeah, yeah. That’s exactly what happens a lot. The first case we’re using an external API, it’s very convenient, but a lot of people know that, especially if you use Apple APIs, I’ve had a few students at work, they were requesting data from Apple APIs, and Apple is just like, okay, we’re changing this today, we’re telling nobody that stuff changed. And then oh my goodness, you’re at the wrong end. But yeah, sometimes you need to maybe some management tools, like just have a database for the data scientist for some model statistics or stuff, right? So they want to use that from their code. So let’s spin up a database, let’s build up a simple data model and sit it in front of an API. So the data scientist does not need to work with files and stuff. So, but just the code is automatically then sending the data via HTTP request into your, into your database, right? And then on the other side, maybe a user interface it and that uses other APIs to query that data again from that store and visualize it.

Jon Krohn: 00:27:25

You mentioned there how APIs can change on you, like the Apple API could change without them giving you any notifications. So it is another part of the data engineer’s job to have monitoring systems in place.

Andreas Kretz: 00:27:38

Absolutely. That’s one of the big things that you need to do. And, or that you also why you need data engineering constantly, because it’s, sometimes people think this is just a one time thing. You built that pipeline and then that’s it. Especially…

Jon Krohn: 00:28:00

Management forever.

Andreas Kretz: 00:28:01

Yeah. And especially in management. Well, 20 years ago you had that software development and you built that software, and then that’s it, right? That was it. It it worked for years.

Jon Krohn: 00:28:13

You ship the CD.

Andreas Kretz: 00:28:14

You shipped the CD. Exactly. And then that was it. But nowadays, a big, big problem is actually that the data is going to change, like I mentioned with the API, but also it could be on the, on the source within a relational database. Somebody does some model change and tells nobody, and then you have another, have pipelines that actually attach to that data, and then something’s going wrong. You have no idea what has going wrong, and you need to start debugging and figuring out. And for that, you need the monitoring systems. So monitoring metrics, good, good metrics. And also stuff like elastic search or something where you send in errors and or errors or warnings, send in the chase on into elastic search and then do some, some research and do, do simple search queries and then get to that, to that lock very quickly. And that’s,

Jon Krohn: 00:29:15

So then, so then the data engineer might have something like Slack notifications or something that comes up? If there’s some kind of issues, yeah.

Andreas Kretz: 00:29:25

Could be some Slack, could be some, it could be that you’re firing of emails through, you create a, man, how’s it called on AWS? I think SNS, simple notification service where once something critical is happening, you send that to the notification service and that sends out a high priority email to a few people. And then could also be simple dashboards where you build dashboards and look at the data and keep that on track.

Jon Krohn: 00:29:56

Nice. Recently, in Episode #655, Keith McCormick and I discussed how to get a profitable return on an AI project investment. To allow you to learn about Keith’s profitable project process in detail, he’s kindly providing listeners of this podcast with free access to his LinkedIn Learning course on ensuring ROI. All you have to do is follow Keith McCormick on LinkedIn and follow the special hashtag #SDSKeith. The link gives you temporary course access but with plenty of time to finish it. Getting a profitable return on your A.I. projects is the very definition of success. Check out the hashtag #SDSKeith on LinkedIn to get started right away.

00:30:39

So that was a great, yeah, introduction to APIs. And, and then, yeah, so altogether, the kinds of data engineering tool areas, this kind of, the specializations that a data engineer needs to have related to relational databases, APIs including use and creation of APIs, ETL tools, data streaming tools like Kafka you mentioned, and data monitoring really important as well. What is this? I hear a lot about Kafka. It seems like a very popular tool. What is data streaming? Like, what does that really mean? It’s like, I had this idea in my head, you’re like, okay, well it’s this, it’s data that’s constantly flowing in a stream. But how is this, when you’re talking about data streaming, how is that different from other kinds of data flows? Or is it something to do with the volume of it or the speed?

Andreas Kretz: 00:31:32

No, it’s, when you think of how data has been processed for a long, long time, it’s been processed in a way of chops where you had, you created some kind of transformation or some kind of processing, and then you scheduled that processing, okay, this is going to run once every minute, once every hour, or every two hours or once a day. It does a backup job and stuff. So that was what you had for a long, long time. And what, what tools like Kafka or Kinesis on AWS or Event Hub on on Azure are bringing you, is that you are starting to get event driven. So that means once something comes in, once an event gets fired off or sent off to your platform, the system automatically is going to react to that event and is going to basically feed that through your whole pipeline.

00:32:27

So there is no scheduling. Whenever something new comes in, it’s going to go through the whole thing, and that’s it. And that is a, for me, that was a, a big shift to actually understand that is a very strong concept, very powerful concept. It’s not as simple because it’s usually a bit difficult to debug with a simple chop. You can schedule that and like very easily run locks, but if something is constantly and very fast and there’s a lot of data, it’s getting difficult. It’s also another thing is that people tend to go for streaming when they don’t need it, just because it’s cool. Yeah. We have streaming, but,

Jon Krohn: 00:33:10

Right.

Andreas Kretz: 00:33:11

It’s on another…

Jon Krohn: 00:33:12

Yeah. Dig into that a little bit more. What, what are the kinds of circumstances where, so there’s just, you know, not enough data. It’s like there’s a, there’s a, there’s a hassle, there’s an overhead associated with getting a data streaming system set up. But then if there’s not a lot of data flowing through, it’s that, that all of that overhead was, was wasted.

Andreas Kretz: 00:33:30

Yeah. Exactly. That’s the thing. People set up these, these streaming systems because they, they think it’s cool. Maybe they want to like communicate something to the management that they have something new. But when you look at the consumer, then how is the consumer reacting? That might not be a live dashboard or something that needs really reacting in seconds. Maybe they run a report once a day or, and then, or once an hour. Let it be once an hour. Where if you simply create a chop for it, it’s the same thing. But on the other side, they create the streaming pipelines. They set up the tools, they set up the monitoring for it. And because you bring in a, a message queue like Kafka, you then need also a processing tool that can process the data. And so it gets more complicated, although you don’t need it, which should be the chop for a data engineer, right? Say, oh, right, listen, we don’t need this. Don’t do this. Keep it as simple as possible. Let’s stick with a, with a simple chop for now. If this, if we see from the scaling it doesn’t going to not going to work in half a year, then let’s figure out a solution for that later.

Jon Krohn: 00:34:53

Cool. Yeah. So it’s starting to become very clear from everything that you’ve been saying about data engineering, that it is kind of like plumbing of data science. So your podcast is called Plumbers of Data Science. And from everything that you’ve described so far including things like the Kafka stream you’ve been talking about just now, you know, in imagining these data flows the plumber of data science, the data engineer is yes, building these pipes. And many of the pipes are doing processing or cleaning. There’s metrics, reporting, monitoring happening on these pipelines to ensure that they’re healthy, the right data are flowing through in good quality. Yeah. Is there anything else that you’d like to say about that name, Plumbers of Data Science or anything about your show you’d like to talk about?

Andreas Kretz: 00:35:39

When I created this, I don’t, I would need to look it up when I, it’s years ago I just felt like this is the plumbing because it’s super important. Nobody sees it. It’s usually super underrated, “eh, we don’t need this”, but it’s like little plumbing. It’s a huge mess if you do it wrong or something messes up, right? So that’s how I got to that name. I actually started this just out of fun because I was bored driving for work every day, and like, it was 20 minutes to work and back and so, so 40 minutes, and I just got a recorder and I just recorded in the car.

Jon Krohn: 00:36:23

You recorded in the car?

Andreas Kretz: 00:36:24

Yeah. I was just recording my thoughts while driving home.

Jon Krohn: 00:36:30

Wow. I’ve never heard of a show like that. I thought you were going a completely different way. You were like, oh, I have this 20 minutes in the car. And I was looking for a good data engineering podcast and nothing existed, so I made one. But no, you’re using the commute to create the episodes.

Andreas Kretz: 00:36:46

There was also, I don’t think a really good podcast about it, but like I was, I had a lot of stuff in my mind that I saw that is missing. And so I just, okay, now this, today I thought a lot about key value source, then okay, let’s, let’s just chat about it and what I thought about it, and that’s how they started. And then I started the YouTube channel. And then,

Jon Krohn: 00:37:13

You, do you do the YouTube channel while you’re driving?

Andreas Kretz: 00:37:19

No. That, that is, that’s usually this setup here where I do most of the time. I do live streams helping people doing Q&As. Sometimes do a debugging session if I have a problem. Or last week we were analyzing, basically I was analyzing, but the viewers asked questions, then I was analyzing platforms. AWS has a nice YouTube channel My Architecture. And so I’m spinning up a video and going through the video and my thoughts, and where, people have questions. Why did they do that? Why did they do that? And then we try to come up I’m trying to tell them what my thoughts are and why they might have done this. And it’s, it’s really fun.

Jon Krohn: 00:38:06

Do you know who Mr. Beast is?

Andreas Kretz: 00:38:08

Yes.

Jon Krohn: 00:38:10

Oh, why? Seems like the most like successful YouTuber of all time. He recently valued his, he wants to sell parts of his company or raise funds or something, and was valuing his company at 1.5 billion. You reacting to those videos, it reminds me of how he has this separate channel, Mr. Beast reacts. Actually, it’s like Andreas Kretz reacts to, to AWS engineering.

Andreas Kretz: 00:38:35

Maybe I should bring on that, that channel data engineer reacts or something. Hey, let me write that down.

Jon Krohn: 00:38:43

There we go. You heard it here first on SuperDataScience.

Andreas Kretz: 00:38:47

I actually, before, before we were chatting, I watched the video, I watched a Mr. Beast video, the one where he cured a thousand people that they could see again, the newest one. I was actually watching this before, before we met.

Jon Krohn: 00:39:02

I’ve never actually watched it.

Andreas Kretz: 00:39:05

I’ve maybe I watched two, three videos, but this, yeah,

Jon Krohn: 00:39:08

I just, I read about it and lots of people have said, so I know that he kind of targets, or maybe not targets deliberately, I know he’s very popular with younger people, but apparently his videos still appeal to anyone of any age. And yeah. Anyway.

Andreas Kretz: 00:39:23

That’s for data engineering very difficult.

Jon Krohn: 00:39:26

Exactly. Data engineering for kids.

Andreas Kretz: 00:39:30

It’s, it’s a, it’s a very tight niche. Yeah, but it’s, it’s,

Jon Krohn: 00:39:37

You’ve managed to make the most of it yeah.

Andreas Kretz: 00:39:39

Yeah, it’s fun.

Jon Krohn: 00:39:39

More than 10,000 subscribers on YouTube. More than a hundred thousand followers on LinkedIn. So a lot of the data engineers in the world are following you.

Andreas Kretz: 00:39:49

I think it’s also because this is so close to actually software engineering and that the computer science that actually people who are, are bored with doing the software development day in and day out in Python or in Java or stuff, that they are looking for a, for a next step, which this can absolutely be, the same like we talked before.

Jon Krohn: 00:40:13

Yeah. So yeah, exactly as you’re saying, if somebody has this strong computer science background, but they’re like, “Oh, man, these amazing things happening with ChatGPT, I wanna learn more about AI. What’s a way that, like, I can make a move in that direction, be useful to machine learning and data engineering”, then pops up as a really obvious choice.

Andreas Kretz: 00:40:32

But also the interesting part is, in my academy, I also have a lot of data analysts who wanna make the next step towards a role that is, that is wider, not just sitting at the end and…

Jon Krohn: 00:40:46

More lucrative.

Andreas Kretz: 00:40:48

I don’t know if it’s more lucrative. Well, it’s

Jon Krohn: 00:40:50

Yes. Data engineering relative to data analyst, probably most of that.

Andreas Kretz: 00:40:55

Yeah. It’s a bit better paid, but it’s not as, as well paid as data scientists or, or other roles. It’s for me it’s a passion. So if I, even if I would make more money as a scientist, I would never want to go into science. I always said that. But there are a lot of people coming actually from the outside, from, from a non strictly computer science field, and they know, okay, I need to at least work that I, that I understand how to code in Python that I know how to work with SQL. A lot of people do that as an analyst for instance. But these are the two things, and if you know that you can start, then you can learn. But these are the basics. I also have that in my academy, say, people know, if you don’t know how to code, it’s going to be very rough.

Jon Krohn: 00:41:48

Right.

Andreas Kretz: 00:41:48

But there are a lot of people who actually go through this. Yes. It’s like everything where you learn, it’s not easy, but people make a good career out of it. So one thing I wrote down, I’m sorry, one thing because I, yeah, I don’t wanna forget it again. When we were talking about ML engineer and so on, it sounded a bit that the data engineer does not need to know about machine learning. But what I, from my experience working as an engineer, working as a team lead for, for a data lab, the engineers should know at least the basics. Like how does the actual machine learning process work, right? That first you have the training phase where you need to make a lot of data accessible, and then later you have your application phase where the model is, or yeah, the model is learned, is trained, and you, the data is coming in maybe as a stream, and then you apply the new data to your model, and that creates outputs, which goes somewhere. I think that is something every engineer needs to, needs to know, needs to understand, to be able to actually work together with scientists and understand the language a bit, what they’re talking about.

Jon Krohn: 00:43:11

Yeah. Shashank Kalanithi who is in episode number 623, he made the argument that the best data engineers that he knows are people who were previously data analysts, because they have this understanding of what the downstream user of these data pipelines is going to be interested in. So I think that kind of ties into what you’re saying here, that if you don’t know how the data are going to be used in a data science model, then you might, there could be some easy opportunities that you’re missing in the way that you create your pipelines.

Andreas Kretz: 00:43:46

Yeah. Like with a lot of chops, it comes down to communication. And if yes, you can be maybe the, we were talking about two year olds, they’re maybe a bit more more narrow from the understanding of the whole process. But as a data engineer, you should understand, okay, very often you say you need to understand what’s the goal and what are people doing with it? And yeah, as a, as a analyst, you have that for you.

Jon Krohn: 00:44:19

All right. So now we’ve been talking about lots of ways that you can end up in a data engineering career, but Andreas, one of your most popular videos on YouTube of all time is called The Right Path to Becoming a Data Engineer. So, Andreas, what is that one and only path? What is the right one?

Andreas Kretz: 00:44:37

Well there are, there are actually multiple paths when you think of tools right? You might be on AWS, you might be on Azure, or there might be some specific tools that companies need. But in generally, in general, if you understand the process, and I think we were already hinting to that if you understand how is platform usually structured and what kind of tasks you need to do with what kind of tools you need to work, then that’s where, you know, or where you know which, what to learn and how to get to it, that you understand. Okay. There is, there are two layers. There’s your transactional layer and there’s your, your analytical layer, or where you most likely on the top, you’ll most find likely find some transactional databases. And on the bottom in the analytical layer is usually where you have your data warehouses and stuff.

00:45:36

I do understand that, and that, you know, the difference between these tools, and again, from the left I talked earlier, left where the data originates, right? Where the data is used that you understand, okay, there is some, there is an integration phase. The data is coming in, somehow it needs to be integrated. There are tools available, learn one of them. Most of the time that’s the how to do it is very, or is the same than in a lot of other tools. Then it comes in, you maybe have something like a message queue, then you have a processing framework, and then you have a storage layer and a visualization or a connect layer on the other side. And then pick out from these layers, pick out specific tools. Maybe if you want to go on AWS or if you want to be on Azure, then you would select the fitting tools for these platforms.

00:46:38

Or if you wanna stay open source, okay, then you might use, for an API, you might use FastAPI or Postgres for the database before that, then you would get into Kafka and Spark for a message queue and a processing framework. And on top you might sit a MongoDB, right? So this way you already have, have a lot of these topics fitted. And then you could even say, okay, let’s do a visualization. Let’s play around with Power BI and see how then Power BI would actually connect to that data or a dashboard would connect to the data and how this would, would get shown. And then you have these important areas covered. From there, you can then say, okay, I know, I understand how that works. Let’s see how another tool works. Let’s see maybe how, how this all works on AWS. And then, yeah, go from there.

Jon Krohn: 00:47:39

Cool. That was a great explanation.

Andreas Kretz: 00:47:42

Thanks. Because you don’t, that’s also something that, something that a lot of people do wrong. They see all these tools out there and they think they need to learn everything to become a good data engineer. No, no. You need to understand, okay, what are the, what’s the usual template? How is it this is usually going to work? And then once you, once you know this, you see this everywhere.

Jon Krohn: 00:48:05

So in that explanation, you mentioned lots of different tools like FastAPI, MongoDB, and some others that we talked about earlier. Like Kafka, what are your favorite data engineering tools? Or maybe an even more interesting question, what are the tools that you think our listeners should be checking out that maybe aren’t obvious,

Andreas Kretz: 00:48:30

Maybe aren’t obvious is actually. That’s difficult because you’re getting very,

Jon Krohn: 00:48:36

Something up and coming, or excited about.

Andreas Kretz: 00:48:37

I’m getting very quickly into the niche market. So especially if you want to learn, if you wanna apply, I would start looking into the big platforms, right? I’m like, I’m an AWS guy, so I would go towards AWS. It’s also the platform that is used the most out there. But lately because I, back in my old work, I was working a lot with Spark, and I did trainings for, in my academy, I have trainings for Snowflake and for Databricks. And these two are, can you call them upcoming? I mean, they’re already there, right? They’re everywhere. They’re everywhere for a reason. You see them on LinkedIn and on YouTube, in YouTube videos for a reason, because they are so strong and they give you these, these options that weren’t there before in a very tight package.

Jon Krohn: 00:49:40

What’s up with DBT? We’re seeing that a lot lately. DBT seems to be like, they’re, they’re kind of like joining in a way. It seems to me like the way the Snowflake and Databricks seem to become ubiquitous. I’m hearing DBT more and more and more.

Andreas Kretz: 00:49:54

Well, DBT is actually we are preparing right now something for the academy with DBT. There you go, because it’s so popular… The thing is, with DBT, you, or let’s say it from the other side, you have your data, let’s say in a data warehouse, it might be in Snowflake. And you want to start transforming that data not directly in Snowflake. So that you say, okay, I want to create code in Snowflake, and then do everything there, but have something from the outside that actually then executes the statements within your platform. Thus the transformation like triggers the transformation within Snowflake, and then Snowflake does the rest. Or in that you don’t need to write Spark code in Databricks, but actually you write your statements in DBT, and that does then the whole processing or kicks off the processing in Databricks, where you of, of course have that upside, you don’t need to learn spark or Python, PySpark for it. So I think that’s where this, that’s coming a lot also from the analyst, right? Because now there are these tools and you wanna make them as easily accessible to analysts, to scientists. And that’s where tools like DBT are coming in and are very strong.

Jon Krohn: 00:51:27

So it’s like an abstraction layer on top of a data warehouse that allows you to kind of have this standardized syntax for doing work across lots of different data warehouse types.

Andreas Kretz: 00:51:39

I would, yeah, I would… You could call it that. Yeah. It’s like, yeah, an obstruction layer.

Jon Krohn: 00:51:47

In recent conversation, you’ve mentioned things like AWS being the cloud platform that you use the most. The other big ones out there are Azure, GCP. Is there any reason why a data engineer should choose one or the other? I guess if they’re getting started, AWS might be the most obvious choice since it’s the most broadly applicable. Yeah. Maybe just give us some context on your thoughts about these different cloud platforms. You know, should people be getting certificates in any of them or all of them?

Andreas Kretz: 00:52:20

So yes, you could say, or you can say that AWS is the one that is most used out there. So it’s fair, it’s a fairly safe bet to go with AWS. On the other hand, it’s GCP or Azure, all the same, same safe bets, because once you understand how it works on GCP, you are going to know how to get into AWS. For instance, how I did that with my students before is, I’m telling them, do some research, because it’s not always that AWS is the one to go for. Like for instance, I had a student from Scandinavia, and then he did research, and he actually found out the industries that he want to go into, and actually in Scandinavia, they are using Azure the most. So then he started actually targeting Azure. It’s also, very often if you go into larger companies, into large corporations, they might be interested in Azure because they’re already in the Microsoft ecosystem.

00:53:26

They already have their office, they have their SharePoint, and then everything. And it’s, it’s, for them, it’s very tightly knit in, and with single sign on and everything. So it depends a bit on where you want to go. If you’re in the startup sector, if you’re, yeah, then, then you most likely wanna look at AWS or maybe at GCP, but the larger companies, yeah. So that’s one thing. It’s not, it’s not as easy that you say, okay, with AWS and that’s it. It needs a bit of research. And certificates. I actually did a poll on LinkedIn a few weeks back. Let me quickly remember. I asked what, why should you do certifications for the job or for getting a job for actually learning and something else. And the interesting part was people were the, most people actually said for learning for education, because you train yourself up to that certification, and then you have a very specific or proven knowledge in these areas. So that’s where people use certifications the most. Of course, there are some companies that need a certification for Azure. If you are doing client work and you need to, you’re like charging a lot and maybe you actually need to have people certified. But it’s, yeah. It’s more, people tend to see this more as an educational tool.

Jon Krohn: 00:55:12

Cool. Yeah, that makes a lot of sense to me. Nice. So we now have a great sense of what data engineering is and what the favorite tools are and how we can be doing it in cloud platforms. Where do you think the future of data engineering is going address? Our dataset size is going to get smaller.

Andreas Kretz: 00:55:34

No, and there are also going to be more uses of data. So it’s here to stay. It might change here and there a bit. As you said, there’s also ML engineers and there’s this work towards taking care of data that is in the data warehouse. Some, some people call this analytics engineer. So that’s where the data engineer is moving. It’s data engineer is moving more towards what I mentioned before, the right side towards the destination or the consumers. But it’s also on the left side of the data integration. That’s where a lot of stuff is changing right now, where you have tools that actually making it very easy to interest data from sources. So you don’t, as an engineer, you don’t have to configure something and then figure something out for the 20th time to integrate a simple database or simple API. And that’s this kind of automation, this kind of helper work. That’s what’s coming in the next years as well. So these two areas, I think that they’re very important.

Jon Krohn: 00:56:49

I like that analytics engineer is not a specific career term that I’ve heard before, but makes a lot of sense.

Andreas Kretz: 00:56:56

Yeah. I’m always a bit cautious with, with career names, right? But there’s a lot of shift or, or moving right now. But that’s the general area where we as engineers need to be a bit more, more into,

Jon Krohn: 00:57:16

In a recent episode, number 653 with Carlos Aguilar, he described effectively that role, I think he was kind of doing for many years. He was at a cancer data startup managing the data insights engineering team. And I think this sounds very similar to analytics engineering. It, yeah, maybe that’s kind of like another way of describing it. Data insights engineering. It’s like, yeah. On the far right of your pipeline. Yeah. Helping with getting the data be easily digestible. Analyzable.

Andreas Kretz: 00:57:53

Yeah. Absolutely. That makes, that sounds like it, because we’re already behind or after that stage where as an engineer you say, okay, I’m going to take the data. I’m going to do some processing to the transformation, and I’m going to drop it into a staging table, and then somebody else is going to do that. Okay, I’m job done, next. And I think that’s where we are past we need to move towards that modeling the data in the warehouse or in the destination and helping actually then make or making it easy to process or to use in that later stage.

Jon Krohn: 00:58:33

Nice. Yeah. And you mentioned the past there. Let’s go into your past a little bit, Andreas. So before your career as an educator, you worked as a data engineer and at well-known German companies like Bosch. So what can you tell us about your early career, maybe some challenges that you encountered or aspirations you have, and in particular, I’m interested in what prompted you to transition to running your Learn Data Engineering Academy full-time?

Andreas Kretz: 00:59:05

This history actually leads a bit into how I got into engineering, because I didn’t start out to, okay, I wanna become a data engineer. I actually was working on a project where a lot of machine data was coming in. And the problem was with the tools that I learned back at university and before I actually, I couldn’t process them. It was very clear that once this goes live, nothing’s going to work. Everything’s going to break down. And then there was starting to actually look at what are the alternatives and big data was the big thing back then. And that’s how I stumbled into that role of a data engineer. Okay, let’s figure out a solution for this. Which type of tools can we actually use there? What are the upsides of NoSQL databases?

01:00:02

How can we actually manage that? Because a lot of machine learning was also within that project, or was, was part of that goal of that project. How can we incorporate that? How can we make this data useful? And that’s how I, how I got into that and how I started with this. And from there, stuff grew and did a lot of things then, was the team lead of engineering team, and then had for some time under me, the data lab with data scientists as well. But throughout the, one and a half years before I jumped off, I actually was already teaching people through my academy. It wasn’t called Learn Data Engineering back then, but I was, I was doing coaching. I helped people become data engineers. And yeah, when Covid hit, I started on the site with my additional time that I had, because I didn’t need to travel, started the academy.

01:01:06

And yeah, for me, the actual, I liked the coaching. I liked the teaching. I liked recording stuff for people that actually helped them. And then I made the jump. It wasn’t like it wasn’t a big, a big decision or it, it wasn’t very difficult decision for me because I am, how the years before that, I always had health problems. I usually don’t talk about this, but I have this chronic disease. You maybe know this, it’s called colitis. There’s colitis and there’s chroan, which are, which are these, how are they called?

Jon Krohn: 01:01:49

Yeah, it’s my, I have the same name. Crohn’s Disease.

Andreas Kretz: 01:01:52

Okay.

Jon Krohn: 01:01:53

That’s the, so that’s, yeah. It’s like the, yeah. So that’s when people like, or how do I pronounce your name? Well, I say, you know, like colitis.

Andreas Kretz: 01:02:05

So that’s basically, and I always had I, it was getting worse and worse. And so I was saying, I said, okay, I need to do something. And then I was fortunate enough that my employer said, okay, then let’s try to get you healthy again. Let’s, you can do a year of sabbatical and see what you want to do. Then after that, and I actually, I start, I worked on the academy, had fun teaching people, and this led somehow into, okay, I want to do this full-time. I wanna help people. It feels good. It feels right. And that’s how I made this transition. And I don’t regret it. It’s a great job of actually being able to help people and getting the feedback “Andreas, I made that switch to data engineer”, or I had a, a student in the coaching who was actually a data scientist who had problems with the engineering at work. So we figured that out back then. So that was that. I love that kind of work.

Jon Krohn: 01:03:12

The Learn Data Academy is doing exceptionally well, and yeah, your audience is enormous and growing quickly, so yeah, it does really seem to me, Andreas like the right fit as well. Alright, so Andreas, we’ve learned a lot from you today. As we reached the end of an episode, I always ask our guests for a book recommendation. Do you have one for us?

Andreas Kretz: 01:03:35

I actually have one that’s Zero to One from Peter Thiel. And we’ve, yeah, I actually, I have that here. So we were talking about that before because the title, that was something that was very influential for me. I actually, I read that years back. I can’t even remember exactly what was in that book, but back in the day, I was working basically on innovation. I was working on from zero to one, which that is something that a lot of people need to understand. That’s a big step. And I did that with my academy. That was, first there was no academy, and then I figured out what could I do? How could I structure this for people to actually learn this? How could I structure this for beginners, for scientists, for analysts, and so on?

01:04:34

And that, that process is very, very important. And that’s why I like this book, and especially the title, because people should try to incorporate that into their life. If it’s work, if it’s private, like do something, something new. Try to come up with something that is innovative. And that’s, yeah, that’s what a lot of people within companies think they’re doing, they might not be doing. And because of that, you, that idea zero to one, am I really doing zero to one, or am I just doing another iteration? So that’s something I highly recommend to read, yeah.

Jon Krohn: 01:05:18

It’s a great book. I liked it a lot. It’s been many years since I’ve read it as well, but it’s content rich. A lot of non-fiction works, have a lot of fluff.

Andreas Kretz: 01:05:31

I actually, I’m going to keep that here. I’m going to read that again. So nice. Maybe I learn something new again.

Jon Krohn: 01:05:40

I bet. Yeah. Completely new perspective, I imagine many years later.

Andreas Kretz: 01:05:44

Yeah.

Jon Krohn: 01:05:44

Well, Andreas, clearly a lot of wonderful insights can be gleaned from you. We would love to know how we can be following you after the show to get more of your insights going forward.

Andreas Kretz: 01:06:02

You find the biggest platform where I’m the most is LinkedIn. That’s where you find me the most. Also, you can comment under one of my YouTube videos, I always answer the comments. Yeah. I have a Instagram. That’s also something I started, I started an Instagram, Learn Data engineering, Instagram, where I’m posting, I’m usually doing Q&A here, where I just, I post Ask me a question, and then people have very, very interesting questions and I’m getting back to them and helping them. I think that has over 10,000 followers now. So…

Jon Krohn: 01:06:35

Yes, you do.

Andreas Kretz: 01:06:37

Yeah, these three things, that are the platforms where I’m at the most.

Jon Krohn: 01:06:44

All right, Andreas, well, thank you very much for joining me on the show. It’s incredible to have somebody that Kate knows so well and for so many years, I can’t believe that we hadn’t had a conversation earlier. It’s been great to get to know you on air and I look forward to doing it again in the future.

Andreas Kretz: 01:07:01

Yeah, thanks was really great time talking to you and yeah, thanks for having me on.

Jon Krohn: 01:07:12

I hope you love today’s deep dive into data engineering with the leading data engineering educator, Andreas Kretz. In today’s episode, Andreas filled us in on how data scientists depend on data engineers for their model training data. While data analysts depend on data engineers producing clean structured data for them to work with. He talked about how data engineers are closer to computer science while data scientists are closer to math and statistics, how data scientists can become more capable and independent by developing data engineering skills such as expertise with relational databases, the use and creation of APIs, ETL tools, data streaming tools like Kafka and data monitoring. And finally, Andreas told us about how Snowflake, Databricks, and DBT are becoming essential tools for data engineers to know. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs from Andreas’ social media profiles, as well as my own social media profiles at www.superdatascience.com/657. That’s www.superdatascience.com/657.

01:08:10

If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel, and of course, subscribe if you haven’t already. I also encourage you to let me know your thoughts on this episode directly by following me on LinkedIn or Twitter and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. Thanks to my colleagues at Nebula for supporting me while I create content like the SuperDataScience podcast episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another excellent episode for us today. For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast.

01:09:11

All right, and thanks of course to you for listening. It’s literally why I’m here. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Podcasts SDS 657: How to Learn Data Engineering

SDS 657: How to Learn Data Engineering

Podcast Transcript

Share on

Related Podcasts

July 24, 2026

July 21, 2026

July 17, 2026

Podcasts SDS 657: How to Learn Data Engineering

Share

SDS 657: How to Learn Data Engineering

Podcast Transcript

Share on

Related Podcasts

July 24, 2026

SDS 1012: The Open-Weight 2.8-Trillion Parameter Competing at the Frontier

July 21, 2026

SDS 1011: The Math Still Matters: Deep Skills in the Age of AI, with Dr. Catherine Williams

July 17, 2026

SDS 1010: Fable 5 as Advisor: Anthropic’s Two-Model Pattern for Smarter, Cheaper Agents