51 minutes
SDS 609: Data Mesh
Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn
Zhamak Dehghani is reimagining how the business world interacts with its data. In this episode, Jon Krohn and his guest unpick the tangles of terminology and explore how the data mesh’s approach towards secure interconnectivity will help solve a roster of data-led business problems.

Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Zhamak Dehghani
Zhamak Dehghani is the CEO and founder of a stealth tech startup reimagining the future of data developer experience. She’s a member of multiple technology advisory boards. Zhamak is an advocate for the decentralization of all things, including architecture, data, and ultimately power. She is the founder of data mesh.
Overview
The business world is witnessing yet another trend, and this time it’s the “data mesh”. The term has been appropriated as a catch-all for pretty much anything to do with wrangling and analyzing unwieldy amounts of siloed data. But what is a data mesh, and are companies using the word correctly?
Episode guest Zhamak Dehghani and host Jon Krohn consider why some terms catch on while others (like “data quantum”, which Zhamak attempted to include in her latest book Data Mesh, but had to reject it for being too “nerdy”) are strongly resisted. They also tackle the question: What needs to change, to ensure we stop turning data mesh into a buzzword and make full use of the data mesh in the business environment?
Zhamak first addresses the terminology, explaining that a data mesh helps businesses get insights across application silos. By “meshing” data, such methods allow teams across departments to work with the same data simultaneously, enabling owners of data nodes who can choose to make them accessible to teams across a company. Data meshes will allow groups to work safely and autonomously, and that’s not all: Machine learning models will also be able to run within a data mesh, returning helpful results and insights automatically.
Zhamak reflects that data meshes are ultimately better at solving problems that data warehouses and data lakes were themselves devised to dispel. Large companies with lots of data will quickly find that data meshes mark an exceptional turning point in helping individual groups to work with the company’s essential information. By standardizing interfaces and federating data, data meshes help to interconnect and synchronize how people across departments can work, without anyone needing to worry about losing or compromising their precious information.
It sounds like a promising start, but Zhamak notes that creating the “perfect” data mesh isn’t entirely free from obstacles. In its design, it is also imperative to ensure solid architecture at the front- and back-end of the system, both for ease of use and to ensure security. For the latter, Zhamak contends that the data mesh’s focus on localizing data will help avoid breaches. The empathetic technologist also shares with Jon why she decided to launch her stealth startup, which she plans will address obstacles concerning interoperability and security, among others.
In this episode you will learn:
- The importance of data meshes [03:29]
- How standardizing database interfaces helps tech giants like Amazon [06:40]
- Current challenges with data meshes [9:33]
- How data meshes give users the freedom to work with data [17:09]
- The missing piece of the puzzle for data meshes [22:11]
- How data meshes connect with the metaverse and Web3 [33:18]
- The times when data meshes aren’t fit for purpose [42:24]

Items mentioned in this podcast:
- Zencastr - Use the special link zen.ai/sds and use sds to save 30% off your first three months of Zencastr professional. #madeonzencastr
- Data Mesh by Zhamak Dehghani
- Software Architecture: The Hard Parts by Neal Ford, Mark Richards, Pramod Sadalage, Zhamak Dehghani
- Build: An Unorthodox Guide to Making Things Worth Making by Tony Fadell
- Where Mathematics Comes From by George Lakoff, Rafael Nuñez
- Forgive for Good by Frederic Luskin
- Jon Krohn’s Podcast page
Follow Zhamak:
Podcast Transcript
Jon: 00:00
This is episode number 609 with Zhamak Dehghani, entrepreneur, author, and founder of the Data Mesh concept. Today's episode is brought to you by Zencastr, the easiest way to make high-quality podcasts.
Welcome to The SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I'm your host, Jon Krohn. Thanks for joining me today. And now let's make the complex simple.
Welcome back to The SuperDataScience podcast. Our guest today is the illustrious and visionary Zhamak Dehghani. As the founder of the concept of the Data Mesh, Zhamak is an advocate for decentralization, including with respect to distributed AI. Zhamak is newly the CEO and founder of a stealth tech startup, reimagining the future of the data developer experience through data meshes. She previously worked as a software engineer, software architect, and as a technology incubation director. She authored the O'Reilly book, Data Mesh, and also co-authored an O'Reilly book on software architecture. She holds a bachelor of engineering degree in computer software from the Shahid Beheshti University in Iran, and a master's in information technology management from the University of Sydney in Australia.
Today's episode should be broadly interesting to anyone who is keen to get a glimpse of the future of how organizations will work with data and AI. In this episode, Zhamak details what a data mesh is, why data meshes are essential today and will be even more so in the coming years, the biggest challenges facing distributed data architectures, why now was the right time for her to launch her own data mesh startup, and her tricks for keeping pace with the rapid pace of technological change. All right, you ready for this awesome episode? Let's go.
Zhamak, welcome to The SuperDataScience podcast. It's such a delight to have you here. Where in the world are you calling in from?
Zhamak: 02:21
It's great to be here. I'm calling from north of San Francisco, an area called Marin County.
Jon: 02:28
Oh, yeah. Nice. Is that the same thing as the wine country, kind of around?
Zhamak: 02:33
Just before.
Jon: 02:34
Just before.
Zhamak: 02:35
Just before the wine...
Jon: 02:36
In-between?
Zhamak: 02:37
In-between, about 20, 25 minutes north of Golden Gate Bridge in traffic.
Jon: 02:43
Nice. And so we know each other through Scott Hirleman, who's the host of Data Mesh Radio, and he has been a big builder of the data mesh community. He's a huge advocate for it. We're going to be talking about data meshes throughout today's episode. And I think we should jump right into it. So for five decades, organizations stored their data in data warehouses. But then in the last decade, data architectures evolved a ton. So first data lakes came along, and then more recently, things like multimodal cloud data architectures popped up. Now you're proposing an even newer architecture called the data mesh. So could you elaborate on what a data mesh is and why we need them now?
Zhamak: 03:33
Exactly. Maybe I'd start with why we need them now and think about that history of data architecture evolution. I think all of those solutions were meaningful solutions as a response to a problem at a point in time, and they did phenomenal work. So as an example, when we go to why we had data warehousing, they were addressing the problem of requiring to get data from silo of applications, and it was very hard. Business intelligence was hard. So we had data warehouse. And since then, we have done incremental evolutionary improvements on that. So in 2010, we had data lake, because we still wanted to do crosscutting analytical workloads and run those workloads. But the process of modeling data perfectly to do that analysis was too costly, causing friction.
So we said, "Well, let's dump the data just as is in this or in the [inaudible 00:04:36] lake and less structured, and we can still get meaningful data out of it for the data scientist kind of scientific workloads and ML workloads." And I think data mesh came, again, very recently at a point in time that those centralized approaches weren't meeting the needs of complex organizations that need to really move fast. So they're just a response to a problem that arises with the evolution of technology. And data mesh particularly is... what is it. It's a decentralized socio-technical approach in managing, sharing, accessing data for ML and analytical use cases.
Jon: 05:21
socio-technical?
Zhamak: 05:22
socio-technical. Yes. So it started as an architecture to be honest, but then very quickly you realize architecture and people are very close the way we organize ourselves. Conway's Law kind of mimics the architecture and vice versa, architecture influences the way we organize teams. And data mesh as a response to organizational complexity and rapid change of organization's mission and their growth and application of data in all sorts of different teams and areas that it just needed to respond to an organizational complexity first. So hence it was phrased as a socio-technical. It concerns itself with both architecture, technology and people.
Jon: 06:07
So a data mesh would allow an organization to collaborate on data projects in a way that otherwise wouldn't be possible?
Zhamak: 06:20
Exactly. Exactly. So data mesh allows kind of independent autonomous teams that are organizing themselves around a particular business outcome, a particular business mission, a particular business function to do data work.
Jon: 06:39
The finance team, the HR team, the data science team, all of these people are connected to the data mesh. It allows different parts of the organization to make use of one kind of consistent... I don't know how to phrase it, you could probably do it better than me. But you can have one kind of consistent data process that is then accessible to different kinds of teams that might use it in very different ways. So the way that a finance team is working with data, maybe with Excel spreadsheets or financial modeling is very different from how human resources would use the data or how the data science team would use the data. Is that kind of the idea?
Zhamak: 07:24
You're getting there. So maybe a little bit... the domain is a funny word and it can be interpreted in many different ways. But you are right that we want to structure the data ownership and data sharing capabilities and responsibilities around parts of the business who can operate fairly autonomously. So as you said, the finance team generates some data, consumes some data. We want to give them autonomy to do that, but do that in interconnected fashion, that cohesive kind of experience that you're talking about is about standardizing the interfaces between these teams. So Amazon, when API revolution and microservices revolution happened back in the building larger scale operational systems, Amazon had this idea of two [inaudible 00:08:14] teams. The teams are autonomous. The teams only communicate through APIs.
So to bring that idea to the world of data, then you have a retailer that has a team that's focusing on customer acquisition, a team that is focusing on eCommerce, the team that is focusing on order management, logistics and so on.
And each of these have a very clear business outcome and function. And instead of technologies, your eCommerce application, your order management services. So why don't we extend this idea of domain oriented organization and architecture to data? And if we did that, then what are the foundational technologies and principles we have to be put in place so we don't end up with this data siloing we have? When data siloing around domains and application has resulted in a collection, accumulation of data in warehouses and lakes. So we have to make some changes to have [inaudible 00:09:12] to have autonomy of teams and business functions, and yet interconnect activity and access to data across these boundaries.
Jon: 09:22
Love it. I think I finally get what a data mesh is. Thank you so much for that amazing explanation. So what are the biggest kinds of challenges that we face with these kinds of distributed data architectures like data meshes? Are there privacy concerns in that situation or is it kind of like federated learning where a data mesh actually helps with privacy concerns?
Zhamak: 09:47
Absolutely. I mean, you touched one of the most important points that a lot of cross cutting concerns become complex to manage. There is an inherent kind of system, kind of complexity that comes with distributed systems. When you have a centralized system, a monolithic system, it's very easy. Say I'm going to put the walls around it and suddenly I've got security and privacy when you separate, which is actually a... I don't agree with that point. I think it makes us feel comfortable, but in reality, it's actually very hard and less secure. And we can talk about privacy and security in a second. But in general, for data mesh to be successful, one of the challenges is the engineering disciplines that have to be put in place so that these autonomous - in data mesh language, we call them data products - these autonomous data products can be secure, can be standardized in terms of their interfaces, can have a level of understanding and trust built into them through the, both, engineering practices, as well as their APIs and interfaces so that the application and usage of them doesn't cause a lot of burden on the user.
The user doesn't have to deal with so much diversity. So I think the most challenging part that I don't think we have really built and solved is the engineering, the operating system that runs data mesh. Right? So how do we automate these cross-fitting concerns like the privacy that you mentioned, and that's why data mesh has a forced component around kind of computational policies. And it expects that every single data product has a set of policies that are being applied to it and maintained by it and in an automated fashion. So right now-
Jon: 11:48
So the data science team can't get access to everybody's pay information across the company from the HR team. So even though they're interconnected, the way privacy is built in to the way that the system is engineered, so that it's not like a free for all, and anyone can access any data across the organization.
Zhamak: 12:09
Exactly. Because there are two things that happen. One is the localization of data to a particular domain and you can then localize and refine, have these fine grain access to different data products. And each of them gets managed differently by the rightful owner, by the part of the business that actually will take a line business team that is managing that data. So that fine grain kind of data product-oriented application of the policies is one way that makes the data more secure. And then the other way, which I don't see much of that implementation is that once you have this idea of it, data mesh is a computational data in a way. You've got computation, policy computation, data transformation computation, as well as the data itself as one unit with clear contracts for sharing it.
So once you have this computational data in a way, then the access to the data in future, through those computational APIs look very different.
So right now access to data is like, "Give me all the bits and points." And of course, we are building differential privacy and a whole bunch of other techniques, but the access of the future is going to be different from, "Just give me the files." The rows and columns might be actually run this computation that gives me some insight about the distribution of the data on the data and just give me the bit that's relevant. And it's probably more secure. So a big shift with data mesh that a lot of people miss, is that the reason I didn't use data in the language or dataset in the language, and I use the language of a data product, and I had this quirky name in the book called data quantum, which doesn't catch on. So I'm not going to use it much.
Jon: 14:04
Was it data quantile?
Zhamak: 14:08
Yeah. Data quantum and data quanta as in-
Jon: 14:08
Oh, data quantum?
Zhamak: 14:10
Yes, as one and data quanta as many. But yeah, so data quantum, it was a language that I used in the book and it's just too nerdy for people to feel comfortable with it. But the idea was that-
Jon: 14:22
And this is just quickly there for listeners, the book that Zhamak is referring to, is her book. She's the sole author of Data Mesh Delivering Data-Driven Value at Scale, which was published by O'Reilly and came out earlier in 2022. And it is very popular and well-reviewed on Amazon as well as oreilly.com. So yeah, just a little bit of context.
Zhamak: 14:50
Thanks a lot for that.
Jon: 14:50
A lot of terms... no, I would've provided a brief intro to the fact that you are an author in the episode intro, but it's nice to dig into it a little bit. So in that book, you're not only explaining what data meshes are, but also defining terms in a way, setting this standard in a way for the terminology in the data mesh world. And so some of those you're saying catch on better than others and data quantum hasn't taken off just yet.
Zhamak: 15:17
It's not taking off. So data product is basically a node on the mesh and very quickly got adopted. And there are many manifestations of it, but a certain manifestation of it that I had in mind, which I call data quantum in the book, and yes, it's not very popular. It's kind of scary. And from the future, so people don't want to come close to it. It's about the idea that the units that we will use, this portable unit of exchange of value, data, will have some additional capabilities. And one of those capabilities around actually being able to perform computation on the data. And that really opens up the possibilities of secure data processing that doesn't require access, direct access to the data itself. And you can kind of push processing up to these nodes.
Jon: 16:12
Super cool. Trying to create studio quality podcast episodes remotely used to be a big challenge for us with lots of separate applications involved. So when I took over as host of SuperDataScience, I immediately switched us to recording with Zencastr. Zencastr not only dramatically simplified the recording process, we now just use one simple web app. It also dramatically increased the quality of our recordings. Zencastr records lossless audio and up to 4K video and then it synchronously uploads these flawless media files to the cloud. This means that internet hiccups have zero impact on the finished product that you enjoy. To have recordings as high quality as SuperDataScience yourself, go to zencaster.com/pricing and use the code SDS to get 30% off your first three months of Zencastr Professional. It's time for you to share your story.
I didn't know that at all about data meshes until just now. So hopefully many listeners out there are also learning that this is a big... it sounds like a big revolutionary part of the data mesh idea is that you don't need to be pulling out the raw data, a table of data, but actually you're asking for the computation to happen and happen separately from you in the data mesh. I don't know if we can say it like that. And then it returns for you the result or the insight as opposed to a structured table that then you have to run some model on yourself. That's cool.
Zhamak: 17:43
Yeah. And I think that they are kind of parallel and this is a bit futuristic. We don't have that. I mean, somebody who's listening, the thing about data mesh is because there is no yet beautifully implemented like that... it came from the future a little bit. And we try to build it with our present tools. So people that are listening to this, they will say, if they're coming from data virtualization or federation [inaudible 00:18:10] engines, they say, "Well, we have that right now. It's called data virtualization. You run SQL statements distributed." Yes? That's one way of performing computation. SQL is not the only way to perform data computation. If data machine learning engineers are training models, they're not necessarily running SQL. Right. So-
Jon: 18:31
Probably not.
Zhamak: 18:32
Very likely not. Exactly. So yeah, there are pieces, slices of data technology that existed that can be adapted to this model and take us one step forward. But we still don't have again, a generalized way of bringing these ideas to life and then allowing different technology providers to kind of stick to it. We don't have a codification of this architecture. So some of the stuff we would talk about today is a little bit more futuristic than what is possible today.
Jon: 19:05
Nice. So speaking of which, I found out just before we started recording, so it's something that you just made public on your LinkedIn profile, I don't know, I guess days before we started recording. So this episode is coming out in September, 2022. But we're recording in August. Yeah. So your job title just switched to founder and CEO of a stealth startup. So this probably relates to the kind of future that you are envisioning and that you've been talking about. So why was now after a long history at your previous company... so you were at ThoughtWorks for nearly 11 years. And so there must be something special about this moment in time that made you feel like it was the right time to step out and start your own thing. What's going on out there?
Zhamak: 19:59
Sure. For the last four plus years, I've been focusing on this, what the future of data look like that is scalable, resilient, and yet intentionally responsible. So this path, the trajectory, that movements like data mesh can put us on, and I was privileged enough to have a platform at ThoughtWorks. And with the help of kinds of Martin Fowler and his reach of kind of audience, be able to put these ideas out there and get it heard, of course people liked and that the industry has spoken. But then there is a point in time that you sit back and look, okay, how this little butterfly wing is changing and reshaping the future of technology? And what I saw was the pain points were real with data mesh, the pain points of data mesh surfaces, people are excited about the idea, the technology. A lot of technology providers are kind of sitting on the edge that they're relabeled, their technology as being used in data mesh. But [inaudible 00:21:06]
Jon: 21:06
What you're saying there to be clear is that there's lots of big technology providers out there that are jumping on what is clearly a trendy data name, data meshes?
Zhamak: 21:14
Exactly.
Jon: 21:15
And they are renaming existing services that are data mesh-y as data meshes and saying, "Yeah, we've got that already. We've got it. No problem." And exactly what you are saying is what data meshes will be in the not too distant future are a far cry from what some of these big existing vendors are calling data meshes today.
Zhamak: 21:39
That's the plan. Yeah. It felt to me, to be honest, people that make these sudden changes in their lives, they probably have this [inaudible 00:21:49] moment of like, "Will regret this on my deathbed?" And it was one of those litmus tests. Will I regret if I don't participate in shaping the technology itself? And I was stupid enough to say, "Yes, I will." And jump in the ring. Yeah.
Jon: 22:09
Awesome. So clearly the tools of today, whether we're talking about data meshy tools that people might be branding today, or even more archaic tools that can't even pretend to be data meshes, clearly the tools of today are not cutting it in your view for organizations to be working with data together, to be collaborating on data. So what is your vision for the future? Give us a taste.
Zhamak: 22:40
I have to be careful because I can't share a whole lot. So I dance around it a bit, but what I see really missing is a layer that enables this very autonomous experience of working with data and yet be able to share data, this computational data nodes that we talk about, these data products, the platform that really makes it easy for developers to create them, to use them, to connect them, to share them, makes data scientists' life easier to find them and easily trust them and get access to them. So there's a whole platform capabilities that can be built to reshape the experience of data developers. I mean, I'm just using a data developer as a label.
Jon: 23:30
As an example of one of the many types of potential users of a data mesh. Some of which I think this has been implied by some of the things that I've been saying what my rough understanding of data mesh is, but the users of the data mesh could vary widely from highly technical machine learning practitioners, software developers who are maybe writing code to interface with aspects of the data mesh. Whereas other users might be click and point, users like the HR managers or things like that. Is that correct?
Zhamak: 23:59
That is correct. I mean, every data product provides the ideas that you provide multimodal access to your data so that you can support the same data, but you can support those levels. The layers that the company is going to focus on is probably below that, below the point and click type interfaces. I think we want to reshape how all the other tools are going to come out and connect and provide those higher level experiences by... now, if I say, I say more, but really think of it as an operating system, a new operating system for data that organizes the future of technology.
But the way to think about it is right now. A lot of our technologies with the composition of large tech into smaller, what we call modern data engineering tools or modern data stack, it's still organized around an operating system of pipeline. Pipelining data from one place... pipeline is such a common language. If data mesh is successful, pipeline disappears as a first class concern in our language.
Yeah, because pipelines have a lot of challenges when they're scaled and then they don't have clear contracts, very task oriented or job oriented. So very quickly they get hairy and complex and hard to debug and hard to maintain. Those clear boundary of contracts and interfaces are not embedded into that. And there is a separation of data from the computation for me, the separation of body from the soul type scenario. So the technology right now is very much organized around pipelines and then centralized storage. And it could be a future store or a warehouse or a lake or whatever, and then layering. Less layer, metadata, less layer access control.
So we want to build something that actually gives a different operating model, a data mesh like operating model that then the tools and technologies can attach to it. But the basic fundamental constructs look very different. They don't look like pipelines.
Jon: 26:16
So I guess if we shift away from pipelines and layers and centralized stores, I guess this data mesh, it means that... so if it's not centralized, the data, I guess it is stored within the individual nodes. So it's more proximal to the teams that are working with it, but still available to other teams across the organization.
Zhamak: 26:42
At the logical level. At the logical level, the control and life cycle management of the data is within the node and within the teams that can actually do that work. But the interconnectivity allows the federated training or federated kind of access or querying. But underneath that, if you are an organization that happened to standardize a particular storage system, I don't know, Amazon S3 or whatever storage system you use, then at the physical layer, you can have all of the storage, physical storage of those nodes harmonized, even at a physical level, co-located for rapid access. So those are for me two layers of concern compared to each other, where are we physically storing? And how are we physically storing for optimizing for the machine as in terms of access and movement and so on? And then logically, how are we storing or presenting that storage to optimize for people and autonomy? So that kind of logical layer doesn't have an organizing system right now.
Jon: 27:52
Super cool. So we've talked about how this distribution and this federation... so I don't know if we've defined federation for the listeners yet, I don't think we have. This idea of federated learning, is this ability to say for, you could have data on your phone that could be useful for training a machine learning model, but it's data that are very private to you, maybe healthcare data or something else that's sensitive, or for whatever reason you don't want to share your data. So historically the way that the big tech companies or a lot of people have been training machine learning models is using data that they would be pulling the data from your phone and having it themselves centralized. With federated learning, your data stay on your phone, but we're able to train machine learning models with those data anyway. So in that respect, this idea of federation, it seems kind of related, the distributed sense of it seems kind of related.
And then so you also specifically specialize in this idea of distributed AI. So that seems like I'm starting to get the sense of how that's related to data meshes. So if the data mesh allows different groups within a broader organization to be working autonomously and also potentially having machine learning models running in the mesh that return results or insights to them, as opposed to that needed to work with the raw data, I can see how that's kind of related to this idea of distributed AI. But up until I was researching for your episode, I hadn't heard of this term, distributed AI. So maybe you can tell me what I got wrong in my kind of high level explanation or refine the idea further first.
Zhamak: 29:53
Yeah, absolutely. I think you are right. And again, data mesh doesn't try to hyper optimize for that scenario. But the idea is that scenario is made possible, that you don't have to move data around and AI and distributed AI, as you described it, or distributed federated training is one application of distributed insights or analytics. Other application could be as simple as some simple analysis of distribution of the data and the shape of the data, or it could be-
Jon: 30:22
What's the average?
Zhamak: 30:23
What's the average? What's the median? Or generating live reports. Dashboards are pretty, I don't personally get a lot of value out of them. But nevertheless, they create a digital experience for decision makers and that's then other end of the spectrum, or you can... yeah. So anything in between. So I think all of what we want to do with data mesh is really remove centralization as a bottleneck and centralization or organization, sort of centralization of technology and ultimately centralization of power. So yes, all of those scenarios should be possible distributedly. Having said that, to be completely pragmatic, we have to allow an interim kind of world where new data sets or new data products, I have to correct myself. It's not just about data sets.
But new data products get created by from the upstream kind of data products and get aggregated, more business logic gets applied to them and they become a data product on their own.
And they become another federated source for those computations which implies a level of data transformation, data copying into new nodes. But if we do that, it's because those nodes have inherent value. As an example, you might have three different kind of sources of the data that customer touchpoints information come from and you want to create this, I don't know, customer touchpoints aggregates across whether it's call center systems and teams, or whether it's eCommerce or whatever to get this holistic view of your customer and you might say there is a business logic that we could apply here, that we could apply some intelligence to actually detect it's the same customer that came from different touchpoints and so on. And we create a new data product customer touchpoint as an example.
So this idea of mesh should allow this kind of infinite, scalable landscape of data products to by interconnecting an existing one to generate higher value kind of data products. But each of those nodes are valuable in themselves. They provide some value to some user and then interconnection can generate higher value data products.
Jon: 32:54
Super cool. Okay. So I'm starting to get how transformative data meshes are going to be and how exciting it is that you are launching your own company in this space. And so data meshes, distributed AI, these are terms that are very much the spirit or the zeitgeist of 2022. How do you see data meshes coinciding with other technologies that are coming over the horizon, that are becoming more mature? So things like Web3, the metaverse, blockchain, quantum computing. Does the emergence of data meshes relate to these other kinds of emerging technologies?
Zhamak: 33:39
I think maybe there is underpinning thread across some of them like the Web3 and blockchain and so on. And that underpinning kind of theme is autonomy. It's decentralization. So I think that it's in the same vein. It looks like people are speaking, the world is speaking in terms of what is a scalable... to respond to this super complex system that we as humans and machines are creating, co-creating. So this distribution, decentralization autonomy, is a theme across those. In terms of application where it can... blockchain become a component of, let's say, your federated governance to basically permanently store and share policies and never lose track of the policy that was applied to an upstream as you move downstream, why not? I haven't explored that, but I think there might be some synergy in terms of application of existing common technologies across those as a foundational blocks.
One thing I would say though, when we imagine decentralization, particularly with Web3 and so on, there's an image of anarchy that comes to mind. There's a connotation. Now, data mesh, while it's being very much founded in autonomy and decentralization so that we can scale, but there is another side of it which is around interoperability and interconnectivity and a sense of intentional responsibility. So there is not that teams are autonomous and they can do whatever they want. The teams are autonomous to do what is needed to be done for the objective of those teams. But they're also accountable and responsible as a good citizen of an ecosystem. So complying with standards and so on. Now that might in one community or trend might be less pronounced than it is in data mesh.
Jon: 35:33
Wow. Yeah, that was an awesome answer, Zhamak. I love it. And so clearly data meshes, Web3, blockchain, there's lots of change happening for data scientists and others to keep up with. How do you keep up with this fast technological change? Do you have recommendations for listeners?
Zhamak: 35:53
Immersion, immersion, immersion. I mean, immerse yourself however, whenever, wherever you can. Before coming to this podcast, I went for a run and listened to podcast, Get your Hands Dirty. I mean, what I find really helpful to me is pick this micro project. I don't have a lot of times to go deep in one tool or another, but just pick a little micro project to get your hand dirty for even a few hours to get a sense of the technology and then immerse yourself with podcasts and books and You Tubes. But have a good filter. I must admit there is a lot of low IP misinformation out there as well. So building a good filter to filter out, "Okay, this is ad driven. There's not a whole lot of value and this is misleading." And I really don't know what to instruct in terms of developing those top of the funnel filter for yourself to stay sane.
Jon: 36:57
Well, it seems like just like with the term AI, this term data mesh is being misappropriated by various people who would like to piggyback on how trendy and popular this term is. Well, so this is probably something that you wouldn't say yourself, but I could give the recommendation to the listener that trusting a reliable source like you, who is very much at the Vanguard leading the cutting edge of defining data meshes, starting with something like your book and as you say, doing micro projects related to things that they're learning in your book. So get part way through a chapter and think, "Oh, I could implement some simple version of this in code." Give that a shot. That sounds like a really great way to get started. And you might be too modest to suggest it, but then that way... I love books in general because publishers, especially well repeated publishers like O'Reilly do a great job of ensuring that there isn't bunk in there and that it isn't ad driven as you say.
Yeah.
And so I think a book like yours or books in general are a great place to start when you want to be digging into something as opposed to just relying on say, whatever blog post you come across or YouTube video you come across. So amazing answers to my questions. Thank you so much, Zhamak. We also had a huge amount of engagement from the audience. So we had great audience questions for you. I had hundreds of people react to my post that you were going to be on the show this week on LinkedIn and some amazing questions. Phil Mourot for example, who is a senior researcher at the AI Institute of New Zealand, he asks, how can a data mesh fit? How can we use the data mesh during the creation and tuning of our machine learning models? Do we need a centralized system to build machine learning models and a decentralized system to run inferences?
Zhamak: 39:06
That's a fantastic question. So I would say both. In fact, both training and inference can be done distributedly. You might need to build a few things and bridge some gaps that exist, but that's the idea. So the way you would train machine learning model in a perfect kind of data machine implementation is that you act as a consumer of the data and data coming from many different places, you directly have contact to those data products. So as a data scientist, if you're first... in fact, in the book, I go through the experience of a data scientist as an example, as a data persona of a data user, how you can be part of the mesh and interact with the mesh. So while you are hypothesizing about, "Okay, can I find some patterns here? Can I discover this class of segmentation of my customers based on various attributes?" You need to discover on the mesh, what the data products are.
You need to connect to the data products.
And if there is something that's not meeting your requirements, you actually can... there is a data product owner in that domain that you can directly talk to. You don't need to go to a data team middleman. So you go to them and say, "Oh, it seems like some data is missing. Can we augment your data product?" And because they have autonomy, hopefully that process is much faster. So the training can be done kind of distributedly by accessing the data from different data product. And if you happen to require yet another data product that is designed for your case and hopefully for others, then make that a data product. So this idea of a future store that we dump all the futures that we've discovered in the future store is kind of a paradox in contrast with data mesh.
So as the part of the training, whether you're directly consuming from those sources and hopefully those sources are... not hopefully, they must be cleansed data that is suitable with some sort of guarantees and service level agreements that satisfy your use case. And if you need to do some work on it to really make it, I don't know, get some sentiment analysis on something else, you do that in your new data product. So your computation, training computation happens in that. So training can be done distributedly, ideally will be done and also infers if your model can sit in the flow of the data as in consumed data from the sources that you're inferring from, that also became again, a computation of a data product and is generating some data product that... data sets that other people downstream are using.
And that's why I really see data mesh as a governing structure, as an operating model in a way that whether your computation in this machine learning model training or inference or a simple, I don't know, analysis and transformation that we have a similar structure, similar graph structure at the macro level.
Jon: 42:11
Nice. And then the other questions. So there's questions from Andy Billington, Andre Richter, Raju Basumatary and Juan Pablo, all of them are around the same idea, which is, are there situations where a data mesh is not appropriate? Or are there situations where you've seen that a data mesh is... that there's a failed implementation of a data mesh and how can we avoid those kinds of issues? How can we resolve data mesh failures?
Zhamak: 42:41
Yeah, absolutely. That's a great question. And I might tweak the question slightly and say, "Are there situations at this point in time where data mesh is not the right solution?" So we have to get a slice of time where we are. And plenty of scenarios and this is slice of our time, that data is not solutions... it's not the right solution because of the level of complexity, technical complexity, and mettle work that you have to do to build the frameworks for this to come to life. And unless your organization has the inherent organizational complexity to justify those efforts, and if... don't do it. If data lakes and data warehouses are not a pain point for you, don't do it because there is a lot of footwork that needs to happen. But maybe fast forward 10 years down the track, if data mesh has become business as usual, disappeared in the background, it's just how we do things and their tooling is suitable, then even if you're a small company without the complexity, it might be the right thing for you.
Jon: 43:45
Awesome. That was a great answer. All right. So Zhamak, we've had an amazing conversation. I've learned so much about data meshes from you and you are a foremost expert on it. So I'm not surprised that I did. But I came into the show having this kind of vague understanding and I'm leaving it feeling like I know what's going on. And I'm really excited about data meshes. So we are starting to wind down the episode, which means that I just have my two final questions for you that I ask all of our guests. And the first is, do you have a book recommendation for us?
Zhamak: 44:13
Oh, I have multiple. But if I have to give one, I would give the one I'm reading because it's relevant. If you're building something, if you're building a product like I am, I would recommend Build: An Unorthodox Guide to Making Things Worth Making by Tony Fadell. It's a relatively new book, highly rated already. So I would recommend that. If you want to break your brain, I can recommend something for that. Or if you want to enlighten your heart, I can recommend something for that. [inaudible 00:44:50] for you.
Jon: 44:50
Yeah. Can you do both quickly?
Zhamak: 44:52
We can do both quickly. If you want to really break your brain, Where Mathematics Come From. It's by George Lakoff, if I say his name correctly. I actually refer to his other work in my book and it really breaks your brain. How our brain conceptualizes mathematics. And if you want to enlighten your heart, forgiveness, Forgive For Good by Frederic Luskin. He's a professor. I think he has a program. He has a Project of Forgiveness at Stanford University. That's my three.
Jon: 45:26
That sounds great. I would love to read that book, Where Math Comes From. I think about that kind of idea philosophically a lot. And so I'd love to dig into that. And then forgiveness is-
Zhamak: 45:38
Sorry. Go ahead.
Jon: 45:39
No, you first.
Zhamak: 45:40
Oh no. Yeah. The breaking brain, I think it is one of probably polarizing books that you might disagree or agree, but nevertheless is good exercise.
Jon: 45:47
Nice. Yeah. And then the forgiveness thing is huge. I mean, many years ago used to hold... I realized through a daily mindfulness practice. I was like, "Wow, there's particular thing, people in my past that I'm like, I don't like what they did. That's unforgivable. And I hope that they have some kind of karmic event happen. How could somebody like that continue to feel good about themselves?" But I realized through a daily mindfulness practice that that was being super unhelpful to me, that I am then this person from years ago. Why am I letting that experience cloud my day? And then it actually led to some experiences where either I was just able to let it go and it sounds like this book might have lots of guidance for that, which just gives you so much more brightness and capacity in your day and creative solutioning.
And even some situations where I reached back out to someone and just sent them a message and said, "Hey, how are you doing? Let's have a conversation." And just have a half hour phone call with somebody that I was holding this grudge with. And you're like, "You know what? They're not so bad. And I hope they're doing well." And yeah, so it can make a big difference in your life. I love that. All right. So the audience probably doesn't need to hear about my thoughts on forgiveness. You can refer to the experts on that. And then so that just leaves us with our final question, Zhamak, which is clearly you are a world expert on data meshes, this emerging topic. Things are only going to get bigger for you now that you've set out on your own and launched your own startup. How can people keep up with your thoughts on the one hand as well as what you're doing as an entrepreneur on the other?
Zhamak: 47:50
I hope I had a better answer. I hope the company will have a website and a way to reach out. But at the moment, just fragmented Twitter, LinkedIn, and platforms like yourself, where they host me. But I usually share them on both Twitter and LinkedIn.
Jon: 48:07
Awesome. We will be sure to provide both your Twitter and LinkedIn URLs in the show notes. You definitely have a very active LinkedIn profile and people engage a lot with your content. So we're delighted to also be able to provide you with another platform here on the show. Thank you so much Zhamak for taking the time with us. It's been an amazing episode and hopefully we'll catch you again soon.
Zhamak: 48:28
Thank you Jon for hosting me. Thank you for helping me reach a new audience, data scientist, ML engineers. I'm grateful for that.
Jon: 48:37
Nice. My pleasure.
Zhamak radiated presence, warmth and confidence throughout today's episode. I hope that energy shone through to you too. I personally left our conversation feeling excited about what the future of data and AI holds for us. In the episode, Zhamak filled us in on how data meshes solve data silos while supporting greater autonomy and security organization wide. How both AI model training and inference can be done distributedly with a data mesh, how data meshes can provide insights via distributed AI computation, how data meshes enable users to work with data products autonomously while facilitating a shift away from inefficient data pipelines, layers, and centralized data stores. And finally, how decentralization is a common thread running through data meshes, Web3 and the blockchain.
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Zhamak's LinkedIn and Twitter profiles, as well as my own social media profiles at superdatascience.com/609.
That's superdatascience.com/609. If you'd like to ask questions that future guests of the show like several audience members did of Zhamak during today's episode, then consider following me on LinkedIn or Twitter as that's where I post who upcoming guests are and ask you to provide your inquiries.
Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you.
And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another stellar episode for us today. For details of everyone on the team and their responsibilities on the show, you can visit jonkrohn.com/podcast. All right then. Until next time, keep on rocking it out there, folks. And I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.
Show all
arrow_downward