79 minutes
SDS 595: Data Engineering 101
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Joe Reis and Matt Housley, co-founders of Ternary Data and co-authors of the book “Fundamentals of Data Engineering” join us for an in-depth episode that dives into the major undercurrents across the data engineering lifecycle, their top tools and techniques, and the key components of their new book.
About Joe Reis
Joe Reis is a “recovering data scientist” and the co-founder & CEO of Ternary Data. He’s a business-minded data nerd who’s worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and everything in between. He hosts the popular data show, The Monday Morning Data Chat, interviews top professionals on The Data Nerd Herd podcast, and runs several meetups. Joe also teaches at the University of Utah and is the co-author of the O’Reilly book, Fundamentals of Data Engineering. When he’s not busy running a company, teaching, or creating content, Joe often finds himself rock climbing or trail running in the mountains around Salt Lake City, Utah.
About Matt Housley
Matt Housley is the CTO and co-founder of Ternary Data. He's both a “Recovering Data Scientist” as well as a “Reformed Academic,” holding a PhD in Math and dual Masters degrees in Math and Physics. His STEM background in combination with his knack for teaching makes him a mastermind at overhauling processes, improving teamwork, and incorporating engineering best practices so that real value is delivered to companies. While making the journey from data scientist to data engineer, Matt began to focus more on data & cloud engineering, working extensively with Amazon Web Services, Google Cloud Platform, Containers, Apache Airflow and GPUs, among other technologies. Matt is an adjunct faculty member in the Math Department at The University of Utah. He is the co-author of the O’Reilly book, Fundamentals of Data Engineering.
Overview
As the authors of the new O'Reilly book 'Fundamentals of Data Engineering,' Matt and Joe set out to bring greater definition to the role and, in turn, amplify the success of their responsibilities but also complementary roles like data scientists.
Ahead of a deep dive into the field, it was only natural to kick off the episode by defining the term 'data engineering' and all it entails. According to the authors, it involves "the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality, consistent information, that supports downstream use-cases."
But if you're new to data science, how do you choose between data scientist and data engineer roles? Matt has generally discovered that pure mathematicians are more drawn to data engineering, while applied mathematicians and statisticians gravitate toward data science.
While diving deeper into their book, and the responsibilities related to data engineering, Joe and Matt explored the major undercurrents of the role, which involve security, data management, data ops, data architecture, orchestration, and software engineering. And when it comes to the most under-utilized tool in a data engineer's toolbox, Joe and Matt agreed that communication is absolutely vital for success. "Communicating with upstream stakeholders and downstream stakeholders are both insanely important," says Joe.
Tune in to hear Matt and Joe tackle many more essential topics in the field of data engineering, including latency tradeoffs and their top tools and techniques.
In this episode you will learn:
- What is data engineering? [3:55]
- Why Joe and Matt identify as “recovering data scientists” [6:12]
- What kinds of people tend to become data scientists vs. data engineers [10:38]?
- Key components of Joe and Matt’s book [26:31]
- Major undercurrents across the data engineering lifecycle [28:26]
- The most under-utilized tool in a data engineer's toolbox [34:39]
- How there are tradeoffs in any data pipeline latency considerations, but faster is typically the default assumption [38:55]
- Joe and Matt’s favorite data engineering tools and techniques [43:39]
Items mentioned in this podcast:
- Fundamentals of Data Engineering by Joe Reis and Matt Housley
- Ternary Data
- Data Science Insider
- How To Move From Barely Doing BI to Doing AI // Joe Reis // MLOps Meetup #45
- Snowflake
- Databricks
- Azure
- AWS Sagemaker
- GCP Vertex AI
- Keras
- Scikit-Learn
- Designing Data-Intensive Applications by Martin Kleppmann
- The Gray Rhino by Michele Wucker
- Poor Charlie's Almanack by Charles T. Munger
Podcast Transcript
Jon Krohn: 00:00:00
This is episode number 595 with Joe Reis and Matt Housley co-founders of Ternary Data and co-authors of the book, Fundamentals of Data Engineering.
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn, thanks for joining me today. And now, let's make the complex, simple.
Welcome back to the SuperDataScience Podcast. For today's episode, we have not one guest, but for the first time ever, two guests, they are Joe Reis and Matt Housley, two peas in a pod. They co-authored the brand spanking new book, Fundamentals of Data Engineering, a tremendously ambitious book that was published by O'Reilly and is poised to be a best seller. They are also co-founders of the data architecture and data engineering consultancy Ternary Data.
Joe is CEO of the firm while Matt is CTO. In addition, they do also have their own separate lives. So Joe is an adjunct professor at the University of Utah and chairman of Utah Python. He previously founded several tech companies and has held both software engineering and data science roles. He holds a math degree from the University of Utah. Matt also holds a degree in math from the University of Utah, but in his case, it's a PhD. He worked as a professor before becoming a data scientist in industry.
Today's episode will appeal primarily to technical experts like data scientists and data engineers, but will also be of interest to anyone who manages technology projects that involve data flows.
In this episode, Matt and Joe detail why they identify as recovering data scientists. What kind of people tend to become data scientists versus what kinds tend to become data engineers. Key components of their books, such as latency, trade offs and the six data engineering undercurrents. Their favorite data engineering tools and techniques. What the live data stack is and how it's putting various data professional titles on a collision course. And they talk about the biggest data engineering problems firms face and how to fix them. All right, you ready for this fun yet practical episode, let's go?
Joe and Matt, welcome to the SuperDataScience Podcast for the first ever three-party, three-way SuperDataScience experience. So welcome to the program. It's an honor to be having this first with you two. So we kind of broke the experience then. Yep, yep. So for listeners who are listening to the audio-only version, the really deep sultry sound is Matt.
Matt Housley: 00:03:06
My voice is kind of like that anyway, I guess, but I was hanging out at a bar last night at a data engineering event actually, and I had lost my voice. So this is the aftermath of that, I guess. So it sounds like I've been smoking for 20 years or something.
Jon Krohn: 00:03:19
Yeah. Alcohol-
Joe Reis: 00:03:20
He's such a party animal.
Jon Krohn: 00:03:21
Yeah. If you became an alcoholic, it would do wonders for your radio voice. That would be-
Matt Housley: 00:03:25
Oh, fantastic, so take up smoking and alcohol. Yeah. Perfect.
Jon Krohn: 00:03:30
And then yeah, Joe also has a very nice voice. It is distinctly not as deep as Matt, so it should be easy for listeners to pick up on who is speaking.
Joe Reis: 00:03:42
I'm still developing. So just got a few years.
Jon Krohn: 00:03:48
That explains-
Matt Housley: 00:03:48
Joe's 18.
Jon Krohn: 00:03:48
Theirs facial hair differences as well.
Joe Reis: 00:03:48
Yeah. My voice will crack once in a while. So yeah, don't worry about it.
Jon Krohn: 00:03:53
Okay guys, so this episode is all about data engineering. What is data engineering?
Matt Housley: 00:04:00
So data engineering is the development implementation and maintenance of systems and processes that take in raw data and produce high quality consistent information that supports downstream use cases. For example, analysis, machine learning, data science, reporting operational analytics, very typical use cases inside of business, in fact.
Jon Krohn: 00:04:20
Nice. Joe, you got anything to add to that?
Joe Reis: 00:04:24
Yeah, I mean, I think to kind of summarize that, a data engineer takes data from store systems makes it useful for downstream use cases and users like data scientists.
Jon Krohn: 00:04:36
So we're taking relatively raw data and then having it automatically processed in a pipeline. And so that it can be used downstream for analytics, business intelligence, for machine learning anywhere downstream, where we need to have the data cleaned up in some way, maybe merged together. You know, you're going to often have lots of different raw data sources that need to be merged together. Does that sound like a reasonable paraphrase?
Matt Housley: 00:05:03
[inaudible 00:05:03] That's right.
Joe Reis: 00:05:04
And I'll add to that too. So data engineers really serve the purpose of flipping the funnel on its head of what a data scientist is expected to do. So there's this kind of this old trope about how data scientists spend 80, 90% of their time getting data, cleaning data, munging data, all this kind of stuff. Really, that should be the job of a data engineer, which means that the data scientists should in theory spend 80, 90% of their job doing the things that they're trained to do. Which are data modeling, algorithms, analysis machine learning and so forth. So it's really serving the interests of the data scientists, if data engineering is done correctly.
Jon Krohn: 00:05:48
Yeah. And data scientists would love to be in that position.
Matt Housley: 00:05:52
Exactly. That's the ideal. Yeah. And that's part of what we're advocating for with our book. Like we need to clearly define this role and what the purpose of this role is. So that data scientists can be more successful in their jobs and machine learning, engineers and analysts and so on.
Jon Krohn: 00:06:07
Nice. Well, let's get into that a little bit. So we're talking about data engineers and data scientists. You two, both identify as recovering data scientists. So what does that mean? Where did that come from? And are there any, is there anything that you guys miss from being data scientists?
Joe Reis: 00:06:28
Yeah. I'll kick it off. And I think Matt's, he's got his own take on this, but so I've been in data for a long time. And I would say kind of when the term data science started getting popular in the early 2010s, what I'd noticed and what I experienced was getting hired for data science type of jobs, or seeing friends get hired for data science jobs, and then spending most of their time doing non-data science things. So back then data science, I think, was much more closely aligned with machine learning. You know, maybe today it's broadened its scope a bit, but the point remains that there just... What I noticed is there wasn't a lot of data upon which to do science, right?
So the foundations weren't there, i.e. data wasn't either collected or stored in a reliable manner or wasn't collected or stored at all. If it was most often the data was kind of crap and you'd have to kind of make do with that.
And so recovering data scientists, I think, was more of a reaction. It was sort of so the etymology of it was, I think it was Dave Gonzalez and Ben Taylor, who live in Salt Lake with me, we're calling each other kind of reformed or recovering data scientists back then that just the name kind of stuck and sort of went with that. But it was really the recognition, that I felt like data science was just a very premature, but overhyped field that somehow was both the sexiest job of the 21st century, but perhaps the most idle job of the 2010s. So Matt, I don't know what your take on the recovering data scientist is, but...
Matt Housley: 00:08:13
Yeah, so when I met Joe, we started talking about our experiences and I had experiences very similar to Joe, perhaps even amplified a bit, in the sense that early, so I came from background in mathematics, PhD in math, taught for a long time. And I was hired into this data science team by an executive that frankly was really struggling. And they were like, all right, we're creating this data science team. It's going to be magic. We're going to fix all this company's problems and we're just going to kill it because data science is amazing and you can do really cool stuff that we can never do before. And then as I got further into the job, I discovered that even though we had a lot of systems and we had a lot of data, we had Hadoop, we had a data warehouse, things just really weren't in place to make data science work.
And there were no systems to attempt to get data off of laptops, into production processes and to actually score models in real time, and that kind of thing. And the data often wasn't in the correct shape to even do any interesting data science on it. And so I gradually just kind of migrated and learned at the time the whole world was going through a transition to cloud, which was still happening. So I just kind of migrated and learned cloud tools and learned data engineering and began to serve those use cases for other data scientists within that company to make their jobs easier. And so when Joe started talking about this notion of recovering data scientists, I'm like, "Yeah, that's very much me." And of course there's a bit of a joking and trolling going on there. But the point is to just call this fact out that we're not meaning to insult data scientists. It's more like you have to have good data ensuring to create.
Joe Reis: 00:09:45
Started a group like data scientists anonymous. So but in all seriousness, it's one of those things, and increasingly we see this where we run into people who come from data science backgrounds. And inevitably it is very similar experiences. So at first we were kind of like we felt like there were just too crazy guys, like screaming at the sky, I suppose, is still true. But we realized there were other crazy people also screaming at the sky. And so at least we know we're not alone.
Jon Krohn: 00:10:19
Nice. Yeah. So you've got your little DSA, your data science's anonymous group. You guys can be shedding at the sky together and acknowledging that it... luckily there is a cure. All right. And so data engineering is part of the cure. And so what kinds of skills or personalities distinguish a data engineer from a data scientist? Like why would somebody become a recovering data scientist and a data engineer, as opposed to just staying as a data scientist? And do you think there's value or what do you think the value is in data scientists who still want to be data scientists, will have that be their primary title, developing data engineering skills.
Joe Reis: 00:11:06
You want to take that Matt?
Matt Housley: 00:11:09
That's a very good question. And I'll try to answer all the parts of that the best I can and Joe can jump in as well. I think part of it is you have to identify what's going on in your company and what you're going to do about it. So in other words, if you find yourself in a situation like Joe and I have been in the past where you're hired as a data scientist to do magic, and these other pieces aren't in place, excuse me. You either have to advocate for creation of a data engineering org to deliver the data that you need to do your job, or you need to take on the job yourself. What I've generally found is that people who really love statistical approaches to problems maybe do better on the data science side.
And so from my academic world, my background is more in pure math and what's called the representation theory where you're writing a lot of proofs and things like that.
And many of people in my cohort were working more in applied math and studying more like statistical problems or differential equation, simulations and such. And so in general, I found that like, just from that narrow slice of humanity that exists in math. A lot of my friends who came from the pure math side, who really liked writing proofs and like drilling into hard logic and such have done well moving into data engineering, because it's kind of similar to that. It's not just building systems, but it's like problem solving and troubleshooting. And then coming up with concepts for logical flows that will fix problems. Whereas my friends more on the applied math and statistical side have really enjoyed data science and machine learning and such. It's not to say that there's no crossover, but that's kind of the personality profile I've seen.
Jon Krohn: 00:12:53
Eliminating unnecessary distractions is one of the central principles of my lifestyle. As such I only subscribe to a handful of email newsletters. Those that provide a massive signal to noise ratio. One of the very few that meet my strict criterion is the Data Science Insider. If you weren't aware of it already, the Data Science Insider is a 100% free newsletter that the SuperDataScience team creates and sends out every Friday. We pour over all of the news and identify the most important breakthroughs in the fields of data science, machine learning and artificial intelligence. The top five, simply five news items. The top five items are handpicked. The items that were confident will be most relevant to your personal and professional growth. Each of the five articles is summarized into a standardized, easy to read format and then packed gently into a single email. This means that you don't have to go and read the whole article. You can read our summary and be up to speed on the latest and greatest data innovations in no time at all.
That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do. I skim the Data Science Insider newsletter every week. Those items that are relevant to me, I read the summary in full. And if that signals to me that I should be digging into the full original piece, for example, to pour over figures, equations code, or experimental methodology, I click through and dig deep. So if you'd like to get the best signal to noise ratio out there in data science, machine learning and AI news, subscribe to the Data Science Insider, which is completely free, no strings attached@SuperDataScience.com/dsi. That's SuperDataScience.com/dsi, and now let's return to our amazing episode.
That's cool. I have not heard that before, but I can kind of imagine how that makes a lot of sense. Not having done the thought experiment very thoroughly on my side, but being one of these people who loves applications of math and statistics, and I gravitate towards data science. So I feel like we're on the money here.
Matt Housley: 00:15:14
I mean, there are a bunch of math people in this field. Go ahead, Joe.
Joe Reis: 00:15:18
I don't know what you saying?
Matt Housley: 00:15:20
Oh, I was just going to say there, there are tons and tons of math people in both of these fields I've found. I mean, at one point how many, we had like five math people at our company at one point, Joe, is that right?
Joe Reis: 00:15:29
Yeah. I think it was almost like an implicit requirement say so, but they just happen to grab it. I think nerds tend to attract nerds though. So it's this like math degree tractor beam that we sort of have, but so I'll follow up on what Matt said. I think that there's also, it depends... So for a data scientist to become a data engineer, I think you need to understand where you are in your career too, and the type of company you're in and their data maturity, right? So if you're a data scientist who's been hired by a company that's pretty low in the data maturity, and you're the only data person there either it's going to be you or the software engineer that ends up building the systems that will support data science.
Jon Krohn: 00:16:08
Right.
Joe Reis: 00:16:09
So if you're fortunate enough to have an engineer, the challenge with the software engineer is going to be, they have the engineering chops, but they don't understand data. You understand data, but you don't have the engineering chops. So you're going to have to figure out how you're going to split that those roles and responsibilities, but ultimately for data scientist, science to level up, you're going to have some semblance of data engineering. And so that's where we see data scientists. I guess, making a decision of whether or not they like data engineering I'd say it's pretty 50/50 split, honestly. Like I think it was Matt pointed out, it just depends on your temperament and your background. If you tend to be the type of person who likes to tinker and you like deterministic outcomes engineering is probably more suited for you.
If you come from an applied math background, like I have an applied math background. So stats is kind of my jam, used to be, I don't really do it much anymore. I understand it and machine learning and so forth, done that, and I like that aspect of it. If I had my choice, I mean, if I didn't get into data engineering, I'd probably still be doing it. I'd really love data. I love solving these types of problems, but I guess I like solving other types of problems too.
Jon Krohn: 00:17:23
Yeah. And it's cool that you guys mentioned earlier on when you were explaining this recovery data science thing, you mentioned Ben Taylor, who's also out in Utah with you guys. And at this time we are running a series of Five-Minute, Friday episodes featuring Ben Taylor.
Joe Reis: 00:17:40
Ooh.
Jon Krohn: 00:17:41
The last two, Five-Minute Friday episodes, as well as the next two, Five-Minute Friday episodes will feature Ben Taylor answering specific targeted questions.
Joe Reis: 00:17:51
Very cool.
Jon Krohn: 00:17:52
Dig into his expertise. We absolutely love Ben on the show. He's been a proper guest on guest episodes about as much as anybody else has in the history of this program, really fascinating data scientist. Anyway enough about Ben you'll get enough of him in the Five-Minute Fridays. You guys have a book that has just come out. So the eBook version has been available for about a month or two. And the physical copies are starting to ship right about now, right about at the era date of this episode. So you should be able to grab it on Amazon or whatever your preferred retailer is. And O'Reilly, who's the publisher of the book has very kindly given me three copies to give away. So look out on the morning that this episode comes out Tuesday morning, I will post on LinkedIn that the episode is out. And the first three people that comment on my post about the episode and ask for a copy of the eBook will get one for free from O'Reilly.
So thanks very much for that. The book is called Fundamentals of Data Engineering and you two are the two co-authors of the book, Joe and Matt. Joe is holding up a copy of the coveted book right now. Somehow he's managed to get a physical copy before it's even available.
Joe Reis: 00:19:17
Don't know how that happened.
Jon Krohn: 00:19:21
So to introduce this book, we've got a great question from Leandro Mora, who asked me a question on Twitter. I mentioned on Twitter, as well as on LinkedIn, that we were going to be having you guys on the show and that I'd be interviewing you about the book. And Leandro asked this question. He said, "There are many books written about best practices for data engineering." One of his favorites is Designing Data Intensive Applications, which also is an O'Reilly book. So Leandro wants to know, "What new ideas are you guys going to expose in this new book?"
Joe Reis: 00:20:00
So I'd say my favorite book on data engineering up to this point has been Designing Data Intensive Applications. I think it's a fantastic book, Martin Kleppmann the author was actually one of our technical reviewers. So shout out to Martin for reviewing our book. I would say when we wrote the book, we did survey the entire landscape of data engineering books and what we found was, whereas I think there were some gems like Martin's book, which I think was very classic. And stood the test of time, a lot of the books felt very ephemeral. Like it was data engineering on platform X, Y, or Z, or tech using language ABC or whatever. And the big question that Matt and I always had was, "Well, I think these books definitely sort of a purpose in terms of teaching you tactics on particular platforms or technologies," but there really wasn't a book that we had found that taught you the thought processes and a strategy behind data engineering in a way that the book would remain fairly timeless over the years.
So the challenge was how do you write a book on a fast moving field that will be relevant several years from now, right? And when we pitched the book idea to O'Reilly they said, "You guys are nuts. Like, why would you write a book like this? This is hard. Nobody's written this for a reason. It's not because it's a bad idea. It's because it's a very hard book to write." So Matt and I being the gluts for punishment that we are said, "Okay, that's cool. So we'll come back to you with a book proposal, and flesh that out." And as far as we heard that once the book proposal was in, it was green lit, I think in some of the fastest time that our editor had heard of at O'Reilly, because it was like, we really do need this book.
Jon Krohn: 00:21:51
That was Jess Haberman. The acquisition [inaudible 00:21:53].
Joe Reis: 00:21:53
Yeah, exactly. Yep. Yeah. Shout to Jess awesome person. And so that's what I would say. This book is different, is that it, "The new ideas," Leandro asks, well, the big ideas behind the book are twofold. So the first big idea is a data engineering life cycle. So the path that data takes through a data engineer's hands and capacity and so forth, but you also have to think of what undercuts that life cycle. And so we're talking about things like data management, ops, orchestration, architecture, security and software engineering. So it's really, I think the big ideas are actually the most simple ideas. Have been hiding in plain sight the entire time but I would say haven't been clearly articulated up until this point. Matt, what do you think about Leandro discussion?
Matt Housley: 00:22:43
Yeah. Yeah. I agree. And for this question, I'm actually going to loop back to the discussion earlier of the definition of data engineering. I think in our data, cultural milieu, there was this weird, there were a lot of bad definitions of data engineering floating around for years and years and-
Joe Reis: 00:23:02
Oh geez, Matt, come on.
Matt Housley: 00:23:03
Yeah. So for example, I think when I got started in this world, the definition of data engineering is, "Oh, a data engineer is someone who works with Hadoop, and Pig and MapReduce and maybe this cool new thing called Spark." And then a few years later it was like, "Oh, a data engineer is someone who works at Spark and maybe Databricks or something like this." And none of that actually captures what data engineering really is about, which is about managing data flows and serving end users and end use cases. Those are technologies, Hadoop was useful in its day. Spark is pretty useful now, but those are tools it's like saying a car is a clutch plus some pistons plus maybe an electrical system that doesn't get to what a car is.
Joe Reis: 00:23:46
Even it gets worse because it's talking about brand names. So a car is set of Michelin tires or-
Matt Housley: 00:23:50
Right, right.
Joe Reis: 00:23:51
... or bring it back to data science, like data science. How would you feel Jon, if we said data science is the use of TensorFlow to, with Keras and some Scikit-learn mixed in, on video GPUs. I mean, would that be a fitting description of data science?
Jon Krohn: 00:24:07
That's, that's a timeless definition joke.
Joe Reis: 00:24:10
Yeah. So that'll be our next book. Just kidding. But-
Matt Housley: 00:24:16
I'll loop back... Go ahead.
Joe Reis: 00:24:18
Oh, no, but it felt like the definitions, I think what Matt was saying, the definitions felt very naive. Like it was very superficial. And that's why I felt like we needed to address because the easy book to write would've been like data engineering with platform X, Y, or Z or language A, B and C.
Jon Krohn: 00:24:35
Right, yeah.
Matt Housley: 00:24:36
The Martin Kleppmann book, that's a fantastic book and I want to give him a shout-out right here. But what I'd say about that book is that it's intended for two audiences. So audience number one is Faang engineers who are working on a very specific, at one point we call this primordial data engineering, which is they're actually working way down in the guts of tools. So you might have an engineer at Google who works on Colossus, their data storage system, and they need to understand clocks and synchronization and inconsistencies and such. And the second target audience for this book is people who maybe are working with Spark or Snowflake and don't need to worry about those issues day to day, but need to kind of understand them in the back of their mind.
And that doesn't, it's a... I think any data insurer should read it, but it doesn't really explain what the whole job is about. And so our goal with this book was to write a book that would kind of compliment all the technical books that are out there to say, take what you learn from Martin, take what you learn from Spark Fundamentals and then bring it into this bigger picture of how you can actually serve data science, machine learning, et cetera.
Joe Reis: 00:25:39
Yeah. I mean, Jon, to kind of bring it back to something you're probably familiar with. So, we both lift. And so I would say Design Data Intensive Applications is sort of the book about like, how do muscles work? And like how do energy systems work? ATP fires and does stuff and there's different energy systems and that's, I think, really the equivalent of Designing Data Intensive Applications. But it doesn't really get into okay, so like lifting, how do you just go about like being a lifter and getting better at that? That's I would say is a really good analogy that maybe you might identify with personally because we're both meat heads.
Jon Krohn: 00:26:17
That's a lovely analogy, all of my meat loves it. So thank you so much for telling us about the audience because that was exactly what I was going to ask you next. Perfect. Do you want to give us a bit of a rundown of the topics that are covered? So we've talked about why this book exists. We've talked about the audience and then give us a rundown of the topics. And in particular, maybe give us a rundown of the topics that might be particularly interesting to people who want to be data scientists or want to stay data scientists. They're not thinking about becoming strictly data engineers, but they want to have some more data engineering under their belt.
Matt Housley: 00:26:58
Yeah. So I would say again, our core audience for this book is data engineers. But the point is that as a data scientist, if you read this book, it's kind of like understanding where your food is coming from or something like that. And then of course not just understanding, oh, my food is farm to table and this restaurant tells me where it comes from, but also how to communicate back to your data engineering stakeholders about your needs. Or in some cases we see this often in a small company you're hired as a data scientist and you actually kind of have to build the data engineering team. In other words, you have to help the company make that first hire to say, "Hey, we don't have people to build these pipelines so I can do my job. So this is the kind of person we need to look for to accomplish that task."
And so we do feel like maybe a data scientist, isn't going to be reading the book as deeply in certain technical areas, but they'll get a really nice picture of what data engineering is all about and why they should care and how it's complimentary and how they can work together with data engineers.
Jon Krohn: 00:27:53
Super cool. Sounds like an invaluable book for a lot of data science listeners out there. So yeah. So check it out. Fundamentals of Data Engineering to get into some specific questions related to content in your book, hat tip to Serg Masis our researcher on the show and whom you guys both know. So one of the questions that he brought up is that in your book, you guys discussed the fundamental technologies and methodologies used in the data engineering life cycle. Can you walk us through the major undercurrents, what you guys call the undercurrents across this data engineering life cycle?
Joe Reis: 00:28:33
Sure. I'll take half and then Matt can take the other half.
Matt Housley: 00:28:36
Okay.
Joe Reis: 00:28:36
So I think there's what six or seven of them or something like that. But yeah, I mean the first one is security. So security is one of these things where we feel like if you can't get security right, nothing else really matters. If you have a data breach, this kneecaps you in such a way that it's going to be hard for you to get trust or recover from it. And so security is the number one undercurrent and a lot of security ends up being behavioral, as you find out, when you read about hacks, for example, right. Hacks occur for a lot of reasons or breaches occur, but it's not because you have awesome hackers it's because people get sloppy.
They might unintentionally unveil passwords in a phishing scheme or leave servers or object storage buckets open to the public among a lot of other things. But these are just really simple things that we feel like people, you know, data engineers, data scientists should just keep top of mind. Security is the first undercurrent. Again, it's not really a technology issue so much as it is a behavioral and people issue, but we do think that's the first one.
The second one would be data management. So this is a giant topic actually. So we got the idea of data management as being an undercurrent as we were going through the book, we're kind of like, "Okay, so where would data governance fit in?" And data stewardship and all this other stuff. And like all these topics that data management encompasses, if you want a really good understanding of data management, you should just go read the DAMA, Book of Knowledge.
It's a giant tome of every conceivable aspect of data management, which is somehow ends up encompassing every conceivable aspect of data, but we took the best parts. So data management is just one of those things where we feel like that's definitely undercurrent governing your data, should be front and center among a lot of other aspects of it. But then this brings us to another undercurrent data ops, right? So data ops is observing, monitoring and being able to respond to incidents in your data. So this is the third undercurrent. Matt, do you want to go over the other ones?
Matt Housley: 00:30:54
Yeah. So the next one is data architecture. And even if you're coming from a data science background that one's probably pretty obvious that it should be there somewhere. And the thing I'll say about data architecture is that it's not just technology. It's about a lot of other things. And so we obviously go deeper on that in the book and orchestration is next. And the interesting thing about orchestration is that not only does it cut across all stages of the data engineering life cycle, but it kind of cuts into the data science side too. So in other words, what does orchestration-
Joe Reis: 00:31:26
Yeah,
Matt Housley: 00:31:26
Go ahead.
Joe Reis: 00:31:26
What does it mean?
Matt Housley: 00:31:27
Yeah. Yeah. So basically means that it's like think of being on the New York subway system and you have all these switches that have to route trains the right way, and you have to make sure that they don't collide. And there's all kinds of management of asynchronous processes going on to make sure that trains can get where they need to go and that they don't run into each other. People don't get hurt, people don't get killed. And orchestration for data is kind of the same thing. Like it's basically a switch yard manager that says, "Okay, first I need to ingest the data. Then I need to post-process it. Then actually I have this other data over here. I'm ingesting. Once data is ingested from those two sources plus another source, then I'm going to join those sources together and then cutting over kind of to the data science side."
Okay. Once those three data sets have been joined together, I'm going to trigger the training of a model and to understand the importance of orchestration. I mean, orchestration systems have been around for a long time, but you got like Airflow, Prefect, Dagster, Coalesce, DBT, and so on. You have so many different orchestration systems in the marketplace right now. And that's just been driven by the fact that no matter how good your data processing technology is, it can't really be very effective or support data science very well, unless you keep all these different systems asynchronously.
Joe Reis: 00:32:45
Cool. And it's also software engineering. Right.
Matt Housley: 00:32:47
And then software engineering is the last one. Yeah. So the point is on that undercurrent as a data engineer, no matter what tools you're using, you're going to have to do a lot of software engineering. Cutting across many different systems and many different capabilities.
Joe Reis: 00:33:01 So you notice that we actually kind of sidestep Serg's answer or question as well. He asked for technologies. But I think the whole point behind the book is that it's meant to be technology agnostic. Like we feel like these are very much the things that do undercut the data engineering life cycle, but at the same time, technology's come and go, right?
Jon Krohn: 00:33:23
Yeah, definitely.
Joe Reis: 00:33:23
This is a fact.
Jon Krohn: 00:33:24
Yeah. His question was just to go through these six undercurrents.
Joe Reis: 00:33:30
Okay.
Jon Krohn: 00:33:31
And he mentions that in the book, you guys do cover the fundamental technologies and methodologies that are used in it today.
Joe Reis: 00:33:37
Yeah. I mean, there are types of technologies I would say, so there are data warehouses for example, are very common for storing data and being able to query, same with Data Lakes and Data Lakehouses and so forth, various data pipelines, orchestration, frameworks, and so forth. But yeah, it's interesting because there's fundamental technologies that I would say sort of intertwined with fundamental practices.
Jon Krohn: 00:34:04
Cool.
Matt Housley: 00:34:05
And adding to that, the one thing we emphasize is that we kind of give a snapshot of the technology picture today, but new technologies come out every month. And so you kind have to slot those into the framework as they arrive and assess accordingly.
Jon Krohn: 00:34:20
Yeah. Very cool. All right. So another one related to content in your book and also kind of related to the idea of a current or a stream, I guess that is a kind of an easy analogy to understand with lots of data flows and data engineering. So in the book you mentioned, "How a data engineer must constantly communicate with downstream stakeholders." So why is that so important? Can you elaborate on that?
Matt Housley: 00:34:50
I think a lot of what we've seen is that actually this goes against the grain of how most companies are set up. And so typically what happens is application developers create data schemes in transactional databases to support their application. And they might use ORM or something, so that they're pretty hands off on what the schema even looks like. And then that kind of gets shipped downstream to say, "Hey, data insurers take care of this mess that I created and make some sense of it." And I think increasingly what we're seeing now is that there's an interest in integrating the application development layer, application development, kind of vertical with the data engineering side of things. And ideally there's bidirectional communication for data engineers to say, this is what I need from you guys on the application side.
And then the data engineers can then feed that data back to the application developers to provide capabilities like embedded analytics, to have realtime dashboards for their users, for example, to have dashboards that show what's going on for that company in this SaaS platform.
And so that's an organizational evolution that we really want to push. Like we just feel like those teams should not be isolated from each other. And that data engineering is actually an application developer's problem too, even though it's not their role. That bidirectional communication.
Joe Reis: 00:36:10
But wasn't the question about downstream stakeholders with data [inaudible 00:36:13]
Matt Housley: 00:36:13
Was about downstream, yeah-
Joe Reis: 00:36:14
Yeah.
Matt Housley: 00:36:14
... definitely, but I'm thinking of it as bidirectional.
Joe Reis: 00:36:16
Yeah. So I think Matt does highlight though, is something that has traditionally been the case with... Engineers do lab things over the wall, but then data engineers may take that same tendency and lob it over the wall to data scientists and so forth.
Matt Housley: 00:36:29
That's also true. Yep.
Joe Reis: 00:36:31
And so what I do think is data scientists and analysts could make really good data engineers because they understand what the outcome should be. Right.
Matt Housley: 00:36:42
Right.
Joe Reis: 00:36:42
So if you know, what a machine learning model should look like, or a report should look like this is very advantageous. Same way that Matt described what the software engineer, for example, if you don't know what the outcome is with the data that you're providing, you have no empathy. And so you're just throwing things over the wall and saying deal with it's your problem? And this is super, super common in much the same way, if a data engineer doesn't understand what the data scientists or analyst needs at the end of the day, these downstream stakeholders, you're simply doing what you think is your job, but I would really question, are you doing your job or are you simply going through the motions of what feels convenient or might be sort of the cultural norm of your company?
If a cultural norm is to throw things over the wall, that's what you're going to do. This happens a lot of places. This is sort of, I would say the default, because it's just like, "Not my problem, not my job."
So that's, that's why I think communicating with downstream stakeholders and upstream stakeholders to Matt's point these are both insanely important. And we cover this in the book. In each stage of the engineering life cycle. We talk about who you work with, right? And we give advice for how you should talk to people. Talking to people and communication, these are two of the areas where it's, we think this is like the most underutilized tool in a data engineer's toolbox, right? It's easy to stand up infrastructure. It's easy to read books like ours. The hard part is to go and actually like talk to somebody, understand what they need, understand how they can help you, let them understand how they can help you as well, and so forth.
That sort of communication I think, is like super underrated. And I would say like 90 plus percent of the problems I see with data teams and their ability to get stuff done comes down to communication or lack thereof with other stakeholders, whether they're upstream or downstream.
Jon Krohn: 00:38:43
Beautiful. That was a really great answer. And we've got one last question about specific content in the book. And so this is about something that is throughout the book. So throughout Fundamentals of Data Engineering, you guys hammer home the importance of latency of focusing on latency. So why is latency such a central concern for a data engineer?
Matt Housley: 00:39:09
Yeah. It's not just a concern. It's also, I think the big theme would be understanding trade offs around latency. And one of the mistakes we see rookie data engineers make when they're migrating from software development is using a transactional database to try to solve data engineering problems. It can work okay, but the problem is that a transactional database is designed to deliver extremely low latency on transactions. It's very, very good at that. It's not as good at scanning large amounts of data. And so the compromise you make is you move to like a columnar database, for example, where updates of data are slow, and typically queries take a while to run. They might take like half a second or a second to run, but now you can efficiently scan like a petabyte of data instead of looking efficiently at just a small amount of data. And that's just one example. I think the point is that as a data engineer, you're always thinking about what your latency compromises are so that you can essentially optimize around the cap theorem to get the kind of result that you need.
And the same would apply. If you're working with streaming technologies, the same would apply if you need true near real time. What are your latency trade offs in the different parts of the pipeline? What do you think, Joe?
Joe Reis: 00:40:28
Yeah, I think it's good. I mean, there's definitely a lot of, I would say, mistakes of omission as opposed to mistakes of co-mission. Just simply not knowing how to use proper tools that contribute to latency. But the broader question, okay so why is latency important? Why is low latency important across the data ensuring life cycle it's because you want to reduce time to value, right? You always want to assess the trade offs of, am I able to make... is the downstream stakeholder able to either make an automation or an analytical decision faster and better than they were if I did something else. And so it really starts at the place where data is generated and goes all the way to the time data is delivered. That should be as low of latency as possible, unless there's some compelling explanation as to why that shouldn't be, and there might be. There might be a reason why you want to actually have larger delays, but the default assumption should be, you want to have things happen as quickly as possible.
This will tie into some other things we'll talk about in a bit, but you know it's just data. Data is one of these things where if you can capitalize on it from the point of conception to use in a much faster way, why would you not do that? That would seem like kind of a waste of time, no pun intended.
Jon Krohn: 00:41:57
Excellent. Yeah. Yeah. That makes sense. Makes perfect sense. So we need to be aware of latency, trade offs for the given pipeline that we're building, but generally speaking faster is going to be better as a default assumption.
Joe Reis: 00:42:11
That doesn't mean you have to do faster things, dumber things faster though, and which does happen as well.
Jon Krohn: 00:42:17
Right.
Matt Housley: 00:42:18
Yeah. And fast depends on the context. So like for an analytic stakeholder, 10 seconds, 3 seconds might be really, really fast, whereas in an application where you need to do transactional updates over a hundred milliseconds might be really, really slow.
Joe Reis: 00:42:33
Yeah. I guess it also depends on like the question you should always ask yourself is what action am I going to take if I get data faster? So like Matt points out sometimes it's 10 seconds. Sometimes, maybe it might be 10 days frankly, if it's a report, it's just a historical report, but it really comes down to the use case. But again, our default assumption is it should be speed, fast is good, slow kills.
Matt Housley: 00:42:57
So we'll sometimes have clients though, come to us saying, "Hey, we need real-time." And it's like, "Okay. It seems kind of weird for your business. Like, what do you mean by real-time?" They're like, "Oh, two days," is real-time in some cases. You have to understand the requirements, and like Joe said, optimized to go fast, if you can.
Jon Krohn: 00:43:14
Nice. So we will get to your consulting practice. In just a moment. But before we get there, I know that the focus of this book as we've hammered on a number of times in this episode is not on specific tools or techniques, it's about general data engineering principles. And that's what made writing Fundamentals of Data Engineering, such a tricky task. However that's said, at this moment, are there particular data engineering tools or techniques that you recommend our data science listeners check out first or focus on?
Joe Reis: 00:43:48
I would say, I'll cover tools and Matt, you can cover techniques you going to Rochambeau on this one.
Matt Housley: 00:43:52
Yeah.
Joe Reis: 00:43:53
So tools, there's a lot of great tools out there fairly mainstream at this point. Snowflake is one, I would say data scientists would have a heyday on. Databricks is another great platform, you can get a lot of leverage on. And then there's obviously, each of the clouds has their own ecosystems that are trying to get data scientists hooked on. So AWS has SageMaker, it's fantastic, fantastic ecosystem. Same with GCP's Vertex AI and Azure's ML. I mean, all these are great, great ecosystems. And so I think it really comes down to, if you're working at a company what have they adopted that? That's the first thing. If it's an Azure shop, I think your fate's already been sealed. You're going to be doing Azure.
It's so you could try and lobby for something else, but you got to have a good reason. So, but really at the end of the day, a lot of these platforms are fairly similar. They have their own sort of data science life cycle, for example, and so I would say you're familiar with those types of tools. I mean, those are the five big ones, I would say again Snowflake, Databricks, Azure, GCP, AWS. And all of them have their trade offs. I would say, just get familiar with what ever ecosystem you're in and get to know the tools.
And it's not to say, you have to stick with like the clouds managed tools or open source, just pick what works best. I mentioned open source, there's no shortage of, the Apache ecosystem for example, is like exploded in the past several years. So I don't even know how many Apache projects there are right now. There's probably more projects on Apache than there are like atoms in the universe at this point.
But I could say the same thing about data startups, right? So if you look at Matt Turck's Data Landscape Slide, which we intentionally put in the book in sort of humorous fashion, because it's just like a blur of a box, because there's so many micro logos or nano logos at this point. But the whole point of that is like, there's again, more data startups in the universe than there are probably like atoms and you know, a couple multiverses. And so it's like... so there's no shortage of tools.
And this is one of the chapters we talk about actually is chapter four talks about how to choose technologies. I would think this is actually one of my favorite chapters because it goes through a lot of the ways you should think about assessing what tools you're going to pick, whether it's for data engineering or for data science or even software applications. So as far as practices, Matt, what do you think?
Matt Housley: 00:46:28
Well, let me follow up what Joe was saying about like trying to orient yourself towards cloud-based technologies. And one of the things we talk a lot about more broadly is the utility of having data scientists, migrate things they do on their laptops, into a cloud environment. So there are a couple different reasons for that. So if you use a common notebook server, that means that you are way, way more streamlined in terms of working together as a team, we see data scientists work in isolation so much because their work just sits on a laptop. And it's very, very hard to share.
Joe Reis: 00:46:59
What are the dangers of working on the laptop, Matt?
Matt Housley: 00:47:01
I think we've seen this many times, like code getting deleted or data that's been transformed. You know, someone who's done a bunch of hand transformations and that gets deleted.
Joe Reis: 00:47:12
And dependency management too.
Matt Housley: 00:47:13
Dependency Management is a huge [inaudible 00:47:15]
Joe Reis: 00:47:15
Because it works on your laptop doesn't mean it's going to work and you know, the cloud, so.
Matt Housley: 00:47:19
Yeah. And yeah, exactly. And so there's this collaboration aspect. And then also it just puts you closer to the data engineers and the ML engineers too. It's just easier to move things into production. If you're already doing your work in a cloud notebook environment, doesn't mean they won't have to do work to transform what you're doing into something that's appropriate, but it's a lot closer. It's an easier process. Another, not really a technology, but a language. And I like your opinion on this as well, Jon, but like SQL, we see a lot of weakness in people's SQL chops. And there was kind of during the big data dupe era, a certain disdain for SQL developed. And we advocate a lot for the utility of SQL even for data engineers. And it's not to say that it's end all be all technology, but it lets both data engineers and data scientists move really, really fast on certain types of problems.
So for example, simple things like just filtering down data. So you can drill down on particular keys. SQL can do that really, really fast if you're good at it. In terms of practices, Joe, let's talk about the organizational practices and go back to that. Just better communication between data scientists-
Joe Reis: 00:48:31
Right.
Matt Housley: 00:48:31
... and data engineers, ML engineers as well, to streamline processes. I think there's sometimes in our industry is not enough emphasis on organizational aspects of the job and how it can make you a lot better at the job and better actually able to leverage the technologies to do something interesting.
Jon Krohn: 00:48:51
Nice, super cool. I love that rundown of tools and techniques for us. So we've got cloud-based tools like Snowflake, Databricks, AWS's SageMaker, GCP Vertex AI, SQL was emphasized there near the end. And then some key techniques would be things like making sure that you are using cloud as opposed to local development. Even if you are doing model development in your Jupyter notebook, you can be doing that in the cloud instead of doing it locally and emphasizing a platform earlier in the episode on cross-functional communication.
Matt Housley: 00:49:26
Yeah.
Love it.
Joe Reis: 00:49:27
And I'll add just a quick minute to, with the tools, I mean. There's a whole other like parallel universe with machine learning tools. So ML Lab, similar engineering, that's like this parallel thread going on right now with data engineering. So in that realm, I mean, the sky's the limit in terms of whatever technologies you want to do. So what I tried to do in that point was just illustrate, okay. So the commonalities that a data engineer might overlap with, but it's very possible you might be going off and doing like metaflow or something like that too. So I would say for data scientists pay attention to what's happening in ML engineering as well. Like that's a very exciting field that I would say is even less fully baked than data engineering.
Jon Krohn: 00:50:08
Yeah. So this is something that we have talked about on the show before I've asked various people to define data engineering versus machine learning engineering. Do you guys want to take a quick crack at that?
Joe Reis: 00:50:18
Yeah, and I think really, as we describe data engineering, it has its own life cycle really from getting data from source systems, making it available to scientists and probably machine learning engineers. My version of this and I'd love Matt's version of this too, but machine learning engineers really pick up where the data engineers leave off, right? So data engineering-
Jon Krohn: 00:50:37
Exactly.
Joe Reis: 00:50:38
... has a very similar life cycle, but it's a different life cycle. And it has a different set of requirements. Everything from training models to storing features, observing models, retraining models, and so forth. I mean, that's its own workflow. I would say separate, but we do make an argument in the final chapter of the book that these practices of data engineering and machine learning engineering and software engineering in fact may actually be on a collision course, which we can talk about in a bit. But Matt, what do you think about ML engineering?
Matt Housley: 00:51:11
One comment I have is that I think there's kind of some competition right now going on between data engineers and ML engineers. And sometimes we see in companies a lot of repeated work because there's inadequate integration between these teams. So kind of like Joe was saying, I think our vision is that you, the ML engineers sort of take over where this core data engineering takes off, but by working together, these teams can be much, much more efficient. Because there's this big area of overlap where it's like, okay, data processing to featurize my data. Well, that's something that either data engineers or ML engineers can do is so they should be coordinating on that kind of process.
Jon Krohn: 00:51:50
Right. Cool, great answer. What's this collision course? I want to hear about that.
Joe Reis: 00:51:59
So this comes back to, it's something we talk about in the book called the live data stack, which is so we kind of, we talk... We took a stab at what's next, really after the modern data stack and the postmodern data stack and whatever else, but it's really the recognition that applications and third party data sources are obviously where data originates. But coupling applications with more realtime data generation processing and serving, the feedback loop is just going to get a lot tighter. And so the question really is... Okay, so for questions that are of a what and when nature, for example, so like what happened maybe on this date, and in large part, these types of questions can be automated.
If I ask you what type of an action are you going to take on a, what type of a question?
Well, and now let's say that's happening in real time. I could make a very strong argument that these should be responded to either by code or machine learning, not necessarily in a report where you take a manual action, especially when they're having in a large volume, same thing with when questions. And this really frees up analyst to focus on why type of questions, like causal type things. That's where your domain expertise comes in. But so the whole notion is getting rid of this tedium. But then, okay, so now automations can either happen heuristically or they could happen through machine learning models. And so this is where... Because there's such a tight overlap and such a tight feedback loop, the question really is, where does the boundary between a software engineer and the application end and the machine learning and data engineers, like where does that begin and end?
I think Matt sort of alluded to this when he is talking about these separation of concerns and this competition, but when things happen much, much faster in greater volume, it amplifies that question to a much greater degree.
And so I really do... we actually question is it possible data engineering is actually called something else or turns into something else? I think we're very open to this question. We're not tied to any religious argument about much of anything. It's entirely, we've seen this happen repeatedly and technology, and so I wouldn't say this be the... There's a lot of precedent for this happening, but if you look at the trends of where things are going, and if you kind of zoom out 10 years, I can make a pretty strong argument that the lines between ML and application software engineers is incredibly blurred. Like ML is part of the application. There is not artificial distinction and same with analytics, these are all just part of the same thing.
Matt Housley: 00:54:38
Yeah. Yeah. It's not clear that this is necessarily going to happen everywhere, but it's going to become much, much more common, I think. In going back to the integration discussion where, like Joe said, the application is just very much integrated with its analytics downstream. And this schema is designed to support both an application and the embedded analytics. And you could probably argue that this has already happened at places like Uber, where those analytics directly feed the application experience. So they can tell you, for example, how long it's going to take your car to get there, or how many cars are around you.
Joe Reis: 00:55:11
Right. [inaudible 00:55:12] You're hearing this the whole... Yeah exactly, like you're hearing about data apps, right? This is-
Matt Housley: 00:55:15
Exactly.
Joe Reis: 00:55:16
... new hot buzzword, that's in reality, been around for a long time, but data apps are going to be front and center. And so what I think it's going to do though, is like completely shape or completely like shake up sort of how we view everything right now. Data science, data engineering, I think is it's all up for speculation say in the next five, 10 years. Because again, the world is moving towards a real-time paradigm and like fast, low latency analytics. And that's just the nature of the beast.
Jon Krohn: 00:55:44
Very cool. All right. Well that ended up digging into, I didn't know why you were so sure that the collision course question was going to come up later, but now I see it's because you knew I was going to ask you with a live data stack, but find to have had the answer now on both the live data stack and the future of engineering. Let us now turn to your consulting, which has been alluded to throughout this episode. So a number of times you guys have talked about things that you've seen at companies, as you try to help them with data engineering, and that experience comes from your consulting firm, Ternary Data. So that name there for listeners, it's like primary, secondary ternary. It's amazing how often primary and secondary get used. And you don't hear ternary that often, but it's just the 1, 2, 3.
Joe Reis: 00:56:33
Yeah. Yeah. We, we have some, some people like partners will introduce us to customers as Tyranny Data.
Matt Housley: 00:56:38
That was a danger of that name we didn't realize at first.
Joe Reis: 00:56:41
Yeah. We didn't foresee-
Matt Housley: 00:56:42
Yeah.
Joe Reis: 00:56:44
... that part. But anyway, yeah, we are not Tyranny Data, we are called Ternary Data. So yes. Thank you. Yep.
Jon Krohn: 00:56:51
And so Joe is the CEO. Matt is CTO, and you guys are co-founders together and you guys are consulting on data architecture as well as data engineering. Do you have a couple of interesting case studies beyond we've kind of had general illusions in this episode to specific kinds of, sorry, to general kinds of things that you see across clients' kind of recurringly, but other, any kind of illuminating case studies that describe why working with a data architecture and data engineering consulting firm can make a big difference to a company.
Joe Reis: 00:57:29
I mean, I think there's a couple threads to this. One is, its amazing how many companies, I think, fall under the same habits, whether they don't even realize it. And so it's interesting, so architectures tend to be very much copy and paste, it seems. Like they may read like a couple articles and say, "Oh yeah, that's the architecture we need." And so I think it's interesting in that regard where people, I think have taken the same playbook from the first page of Google and sort of implemented that as their architecture, the other piece-
Jon Krohn: 00:58:11
First page of Google search results, you mean?
Joe Reis: 00:58:13
Search with Google, yeah. Google.com, right.
Jon Krohn: 00:58:17
Page one of Google.
Joe Reis: 00:58:18
Page one of Google out of 50 billion pages. But so that's interesting to see, we see a lot of commonalities there. The other area where we see a lot of commonalities is, and this is interesting. So data teams are fascinating to watch. What we see is either data science or data engineering, for example, there is not really, there's some efforts to correct this, but there's not really like a playbook for a good data team, for example, or what a data team needs to know. Right. So a data science team, data science team, how do you pick the players on there and how do they get up to speed in a common universal way where they can be effective. Now I kind of use the analogy of, say that we wanted to go I don't know start sales team six or something like that.
We're going to go on a mission, but the question is, we just pick a bunch of like random people off the street, like you could, "You know how to use a grenade launcher maybe," or something like that.
But that's kind of like how I think a lot of data science teams are in data engineering teams are formed right now. It's like, oh you know databases, that's cool, you can join the team over here. And what we see over and over is that it's a lack of a really good knowledge foundation and a skills' foundation. So that's, one area where we... So what we specialize in is we come in and we help coach and train and advise data teams. So we're not a body shop. We don't send in like full people to camp out of your office for the next five years. And that kind of stuff. We really want to take the approach of it training and enabling your team to be like the best version of itself.
Jon Krohn: 00:59:57
Very cool.
Matt Housley: 00:59:58
And I'll add, I think where we're at our best is where we can assist with some technology changes with the technology migration and then some team reorganization as well. And so inside of some fairly large retailers, we've assisted with like a migration from a very restricted on-prem data warehouse system to much more flexible cloud-based data architecture. And those technology changes can actually support team organization changes to where you're going from kind of an ETL very traditional corporate paradigm of how data is transformed to a much more flexible, modern data engineering oriented paradigm, where the data engineers can become integrated with the data science team to support their applications. And so those experiences have been very exciting.
In smaller startups, what we've often seen is that startups tend to be driven by people from the application side. And so you sometimes end up with Frankenstein's monster architecture to support analytics. People running analytics on transactional databases and ETL jobs or pipeline jobs take like six hours to run. And so in those cases, often we can come in and explain some basic principles of how to migrate data engineering into appropriate systems. How to reorganize your team, so that suddenly now those same jobs are taking five minutes and you can really think about scaling up data engineering and therefore data science that runs on top of data engineering. So like Joe said, we'd love to come in and more, not do the work so much, but like support teams in doing the work, help them to reorganize and help them to move to new technology.
Joe Reis: 01:01:40
But it's a recognition too, that teams really want to-
Matt Housley: 01:01:43
Yes.
Joe Reis: 01:01:43
... learn this stuff and feel like the rock stars. And so I think it was early on, we were so small that we just realized, like the jujitsu move that we could do is coming in and instead of us doing the work for you and potentially having you hate our guts, because we're doing your job for you, why don't we just teach you how to kill it with this technology and then make you look like the hero at the end of the day. And people love that. They like us because now we're advising them and helping them and making them better. And in general, this I think is just a much cleaner and better approach. We feel good about the work we do, customers like it and devs love it. And so it's great.
Jon Krohn: 01:02:23
Nice. Sounds like a really cool model. A specific question for you related to something that you said in a YouTube video, Joe, that we will include in the show notes. You mentioned in that video, that companies have been trying AI before they're barely doing even much simpler analytics like business intelligence. So companies want to put the cart before the horse very often with you. They may not have their data organized in any way they may not have, as you mentioned, cloud system set up, they might just have information on individuals' laptops, and they come to you and say, "Hey, we want to be an AI company." So what do you do in those kinds of situations?
Joe Reis: 01:03:05
Yeah, I think the reference was most companies are barely doing BI let alone AI. And that was recorded a couple years ago, and I still think that it holds true everything that we've seen still indicates that. I think the adoption of machine learning is increasing for sure. And I think the real life use cases of machine learning are slowly increasing, but at the same time, it's interesting. I think there's a lot of misconceptions about AI. I personally try and remove those two letters from my vocabulary as much as possible, unless we're literally talking about AI and I think there's a very definite use case or definition for it. But in a lot of cases, it's being sold as this panacea in this sort of magical fairy dust that you can just magically transform your company to an AI company.
And an AI company, I suppose if we're going to agree on definitions is just a company that has AI running everything. I don't know if it's Skynet or what but it's AI powered at the end of the day.
But I guess if you tried to do this, if you tried to do machine learning and production and try and actually make these things happen, you know how hard it is. And like you need data and this is where I see a lot of companies fall short, like let's do AI. You don't even know anything about your data. There's no analysis data probably doesn't even exist, but the prerequisite, even Google will tell you this too, it was like, do analytics first, before you start doing machine learning, you got to know your data. You don't just blindly jump in.
And I know what informed this was I've worked at a couple automated machine learning companies as an engineer. And so I know firsthand how hard it is of solving the problem of give me a data set and I'll give you some predictions. I think it works really well on unstructured data where maybe there's a canned kind of result where you're just doing object identification. But if you're talking about structured data rows and columns, like this is an insanely hard problem to solve-
Jon Krohn: 01:05:07
It is.
Joe Reis: 01:05:07
... really hard.
Matt Housley: 01:05:09
I'll follow that up. Last week, we had a conversation with Josh Tobin and he was talking about this process of going from a draft machine learning model, all the way to a high quality production machine learning model. And he advocates for this idea that your first draft is just business logic. It might be in SQL, it might be in Python, but it's just business logic. It's like things that you kind of already know about your target audience or whatever you're trying to optimize for. And then you let that run and then you take the feedback from that to train your next generation of model and so on. And so the point is, if you can't even do that step one, then you're not going to have a lot of success in delivering really high performance models down the line. I'm curious to hear your thoughts on that, Jon, go ahead.
Joe Reis: 01:05:49
Oh, Jon, yeah, I'd love to hear your thoughts too, actually.
Jon Krohn: 01:05:55
So there was a story that I told you guys before we started recording about, I did some consulting at a hedge fund where the most senior people there, including the most senior technologists right up to the CTO thought that their competitors were now being run by AI and that they were going to be left behind. And so, just from asking them follow up questions, I just discovered that people even highly technical people who have technology platforms and could have tens of billions of dollars of assets under management have this idea in their head from, I guess, film and television and just the word AI that somebody out there Google maybe fit their competitors at other hedge funds, just have a machine that takes in any kind of input and spews out the answer to whatever you want it to be doing, to be making your buying and selling decisions if you're a hedge fund.
And so it is amazing this disconnect between AI in some people's expectations and the reality of the nitty gritty when we're working with data pipelines and building models and feature drift. And as you say, Joe is structured data, like it is extremely... Like I cannot fathom a system today that you could just feed in any kind of structured data, here's the new table, figure it out and tell me what stocks to buy and sell. Like its fantasy.
Joe Reis: 01:07:43
But the thing is though, I think back to the question though. Why is it that companies want to jump into AI first? And I think as you alluded to, in this example, there's a lot of FOMO. If you're missing out and what this does is it creates like a very fascinating prisoner's dilemma game where the optimal decision is actually to do AI. You would, that would be the rational choice. And so your choice is to go and do that because your opponents probably doing that too, and so on and so forth. So I think a lot of that is maybe fear driven marketing on the part of certain large companies but what I also notice is there's... I've worked with certain executives in the past where like, the things they want to do is I want to be able to talk to my peers about all the AI I'm doing.
Whether, or I'm actually doing it or not, it's a different story, but I want to be able to talk about the fact that we have AI initiatives at my company. So it becomes sort of this interesting keeping up with the Joneses with execs too. And I've seen this firsthand, which is why I know that a lot of these companies want to jump straight into it. Not necessarily because there's a practical use case, but because this is a great way for promotions, you'll work at a company where they're doing like slightly cooler AI stuff. It's a reality of it.
Jon Krohn: 01:09:09
Yeah. And so if people want to check out this video that Joe did in late 2020, it's called How to Move From Barely Doing BI to Doing AI. That's what sparked this whole question. And yeah, thanks for the great and thorough answer, both of you. We will include the link to that video and the show notes, as well as everything else that we can think of that we've mentioned on the show. So we are nearing the end of this episode and something that we always ask guests. And for the first time, I'm going to be able to ask two guests, this question. Do you have a book recommendation for us?
Matt Housley: 01:09:45
Shall I go first?
Joe Reis: 01:09:46
Yeah.
Matt Housley: 01:09:46
So I'm going to recommend the book is actually more on my reading list than I've been reading about this concept and want to read the book that it comes from. But it's a book called The Gray Rhino by, I think it's Michelle Wucker. And the idea is, I think many of us have read books by Nassim Nicholas Toleb, where he advocates for this notion of a black Swan. And Michelle says basically, well, actually many of these events that we call black swans are highly, highly predictable, like a gray rhino, right? Doesn't mean 100% probability, you're going to see a gray rhino, but it's a high probability. And so things like the COVID-19 pandemic, for example, at least once it hit China, what we saw with US business is, and policy makers is they kind of said, "Ah that's a Chinese problem. Maybe it will come here." Instead of saying, "You know what, there's a high probability that it's going to cross our shores eventually. And maybe we should do something about that."
Joe Reis: 01:10:44
My recommendation with CI it's not 50 Shades of Gray Rhinos, but I think, okay, so I'm reading Ministry of the Future right now. It's about climate change. I think that's a really good book. But I think my all time favorite book is still Poor Charlie's Almanack. So it's a collection of writings and talks from Charlie Munger. I consider to be the smartest person on the planet. So I'd [inaudible 01:11:13] recommend that one.
Jon Krohn: 01:11:14
Supposedly he reads 500 pages a day.
Joe Reis: 01:11:17
I wouldn't be surprised. He's just crazy because he's got like one eye. So his eyes get tired. And I know that because Matt and I actually, we were at a talk at Charlie's Daily Journal shareholders meeting a couple years ago. And like you had to stand on a certain, you had to let... If you're on a certain side of him, you had to let him know because he can't see you. But the guy he's an information machine. I would say more importantly, it was a machine. I saw him actually at the recent Berkshire meeting in Omaha a couple months ago. And I swear he gets smarter every year and he's like 98 right now, but I've seen him for a long time, but it's like for some reason it's contrary to human biology, but he just gets smarter and wiser every year. It's the weirdest thing.
Jon Krohn: 01:12:03
Yeah there's one of the strongest negative correlations with developing Alzheimer's is level of education. And the theory here is that the more that you learn, the more pathways that you create between memories and ideas in your brain. And so even if, so it... Let's pretend we have two identical people, twins who are genetically exactly the same and one of them never doesn't learn very much. Is competent in the world but doesn't read books, isn't trying to learn or grow. And then the other one finishes a PhD becomes a data scientist and then a data engineer and their whole life is constantly learning new ideas and concepts. They're always reading. So even if both of those people due to that genetic predisposition, and let's say they have the same diet and other environmental factors or whatever, so that they develop Alzheimer's at the same rate in their brains.
Then the person who has all of these extra pathways, even as some of those pathways get blocked, there are other ways of retrieving information that aren't available to the person who hasn't developed all those pathways, so that could be it. So if Charlie's reading 500 pages a day maybe there's no reason to think that you would have any cognitive decline.
Joe Reis: 01:13:40
Oh no, no. And it's quite the contrary. Him and Buffet, I would say are just two of, I would say, the most active thinkers I've seen, Gates is probably, Bill Gates is probably among them too. But there's just a different level of, I think curiosity. The person as I said... The person I saw read the fastest was actually Kim Peek. So have you ever watched the movie Rainman or heard of it?
Jon Krohn: 01:14:06
No, I haven't, but I know it.
Joe Reis: 01:14:09
It's about a savant, right. But it's based in somebody who actually lives in or lived in Salt Lake City and Kim Peek was considered the human Google. I've actually sat across from and watched him. Because he was probably oblivious I was sitting across from him. His brain was like somehow fused. And so he was like on paper, mentally disabled, but he could read like probably a 300 page book in like 15 minutes, probably less actually, because I blocked it. But he could read both pages simultaneously and just keep flipping like this and he would have 99% recall. So you could ask him any fact, like what day did this happen on and be like, "Yeah, it's this, what are you talking about?" So anyway, that's another example too.
I would say you just never know. I mean, he was like a living Google. It just makes you wonder how many other people are like that. I think for me, what's interesting is just intellect's one of these things where I think all of us probably have our own idea of what it is. And then there's just other people who are just in completely different leagues in their own special way. So anyway...
Jon Krohn: 01:15:13
Notoriously difficult word to define. And when you say fused, you mean that the two hemispheres of his brain-
Joe Reis: 01:15:19
Yeah.
Jon Krohn: 01:15:19
... weren't separated? Huh.
Joe Reis: 01:15:21
No.
Jon Krohn: 01:15:21
Wow.
Joe Reis: 01:15:21
Yeah. You should go watch a video on him sometime Kim Peek. It's fascinating.
Jon Krohn: 01:15:25
Sounds cool. Yeah, I'll do that, at my next meal. Seriously that's when I watch stuff. So awesome episode guys. Thank you so much.
Joe Reis: 01:15:38
Thank you.
Jon Krohn: 01:15:39
This, my first three-way I would say was a great success.
Joe Reis: 01:15:42
I'm glad we could help you.
Jon Krohn: 01:15:43
We had a lot of fun and hopefully the audience enjoyed this as well. So how can listeners keep up with the latest from both of you? Matt, do you want to go first?
Matt Housley: 01:15:55
Yeah, I'm on LinkedIn. I'm also on Twitter. Not really as active there, but I do have this ambition to try to be more active there. We also have a weekly newsletter that comes out where we both write articles and maybe Joe, you can talk a bit about that.
Joe Reis: 01:16:09
Yeah. It's on ternarydata.com, right now. We're actually going to be moving it to Substack very shortly. So that's actually the first time we announced it, but that'll be happening, the newsletter. It's a very popular underground newsletter, but unfortunately if you aren't subscribed, you can't get the articles and we like to keep it that way. It's sort of like an underground mix tape from back in the day, but you can find me on LinkedIn. I'm not on any of the social medias. You can actually find an article somewhere on the internet about why I'm not on any social media except for LinkedIn, but that's how you can find me.
Jon Krohn: 01:16:40
Cool. Sounds great. Well, it's funny that you say, I mean, LinkedIn is definitely a social media platform and on that social media platform, you have 23,000 followers. So not doing a great job of staying isolated from social media. But yeah, that sounds interesting. We're out of time. I'm sure listeners can find that article. Maybe if you ping it over to me, Joe, I can make sure that's in the show notes for people to read about why you're not on other social media platforms. Super cool. Thanks so much for taking the time. It's been wonderful having you on the show. Thank you, Matt. Thank you, Joe. And looking forward to catching up with both of you again.
Joe Reis: 01:17:24
Thank you.
Matt Housley: 01:17:25
That was great, thanks for having us.
Jon Krohn: 01:17:31
Well, I think it was pretty fun having two data engineering experts, trade answering my questions and backing up each other's responses with even more rich detail. In today's episode, Matt and Joe filled us in on how pure mathematicians seem to be more drawn to data engineering while applied mathematicians and statisticians seem to be more drawn to data science. They talked about how their book Fundamentals of Data Engineering is not focused on specific technologies, but instead on the broader data engineering role process and philosophy. They talked about their six data engineering undercurrents, namely security, data management, data operations, data architecture, orchestration, and software engineering. They talked about how there are trade offs in any data pipeline, latency considerations, but faster is typically the default assumption and they provided their data engineering top tips, including doing all work in the cloud, as opposed to locally and ensuring cross-functional collaboration.
As always, you can get all the show notes, including the transcript for this episode, the video recording any materials mentioned on the show, the URLs for Joe and Matt's social media profiles, as well as my own social media profiles @SuperDataScience.com/595. That's SuperDataScience.com/595. If you'd like to ask questions of future guests of the show like one audience member Leandro did during today's episode, then consider following me on LinkedIn or Twitter is that's where I post who upcoming guests are and ask you to provide your inquiries.
All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience Episode for you. And thanks of course to Ivana Zibert, Mario Pombo, Serg Masis, Sylvia Ogweng and Kirill Eremenko on the SuperDataScience team for managing, editing, researching, summarizing, and producing another awesome episode for us today.
Keep on rocking it up there folks, and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.
Show all
arrow_downward