Podcasts SDS 669: Streaming, reactive, real-time machine learning

101 minutes
Business, Data Science, Machine Learning

SDS 669: Streaming, reactive, real-time machine learning

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

This week, discover the future of machine learning with Adrian Kosowski, Co-Founder and Chief Product Officer at Pathway. Adrian shares his expertise on streaming data processing and reactive data processing, and tells us how they’re revolutionizing real-time model training and data applications. It’s not one to be missed!

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Adrian Kosowski

Adrian obtained his PhD in algorithms at the age of 20. He specializes in network science and modeling processes involving graphs, time, and all things random. He is currently a co-founder of Pathway (pathway.com) – a programming framework which takes care of data updates in data streams. Adrian spent a decade in academia, at Inria and Ecole Polytechnique in France. He took a strong interest in bio-inspired distributed systems, working on topics ranging from DNA computing to modeling ant behavior. His publication record includes two Best Paper talks at major ACM conferences. Adrian is also a co-founder of the competitive programming website spoj.com, which has been used by a million people to boost their programming skills.

Overview

In today’s world, data is everything. The ability to process, react, and make decisions quickly and in real-time, is the difference between success and failure in businesses. That’s where Dr. Adrian Kosowski and the power of streaming data processing comes in. He joins Jon to explain how streaming data processing is revolutionizing the way machine learning models are trained, making it possible to do so in real time and with incredible efficiency. The result is the ability for data applications to react instantly and automatically to never-before-seen input data, potentially saving firms vast sums of money and enhancing their competitive edge.

But technology isn’t the only vital component behind streaming data processing. Adrian also discusses the important role of a computer scientist and when they should step in as a product leader. Adrian also details why Pathway selected Python for its platform interface and Rust for high performance behind the scenes, revealing important information that startup leaders will surely be interested in hearing.

As you listen to Adrian discuss future opportunities for ML startups, it’s easy to see how this is the future of machine learning and the potential applications are endless. From financial fraud detection to the global supply chain network, the ability to handle and react to data in real time is changing the game. Wherever you are on in your data science journey, Adrian’s insights into streaming data processing are sure to leave you fascinated, inspired and leaving you wanting to learn more.

In this episode you will learn:

About Pathway’s reactive data processing framework [04:45]
Reactive data processing use cases [17:08]
What is the difference between batch and streaming processing [33:18]
Transformers in data engineering and data streaming [53:44]
The benefits of Adrian’s technical background as a CPO [1:04:17]
Adrian’s responsibilities and favorite tools as a CPO [1:15:25]
Emerging ML approaches and tools for startups [1:28:49]

Items mentioned in this podcast:

Posit
AWS Trainium
AWS Inferentia
Pathway
SDS 632: Liquid Neural Networks
Spoj
Rust
MapReduce
GroupBy
INRIA
Rust Documentation
Networks: An Introduction by Mark Newman
ODSC East 2023 | Open Data Science Conference
Jon’s Podcast Page

Follow Adrian:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 669 with Adrian Kosowski, Co-Founder and Chief Product Officer at Pathway. Today’s episode is brought to you by Posit, the open-source data science company, and by AWS Cloud Computing Services.

00:00:17

Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:48

Welcome back to the SuperDataScience podcast. Today, the positively brilliant researcher and entrepreneur, Dr. Adrian Kosowski returns to the show to give us a taste of what the future of machine learning looks like. Adrian is Co-Founder and Chief Product Officer Pathway.com, a framework for real-time, reactive data processing that is based in Paris. He has over 15 years of research experience, including nine years at INRIA, a prestigious French computer science center, leading to the co-authorship of over 100 articles in a range of fields, theoretical computer science, physics, and biology, for example, and he’s covered topics in those papers, like network science, distributed algorithms, and complex systems. He previously co-founded and led business development for Spoj.com, a competitive programming platform used by millions of software developers, and he obtained his PhD in computer science at the ripe old age of 20.

00:01:38

Today’s episode will appeal primarily to hands-on practitioners like data scientists, machine learning engineers, and data engineers. However, we do our best to break down technical terms and provide concrete examples of topics so that anyone can enjoy learning about the cutting edge in training machine learning models. In this episode, Adrian details what streaming data processing is and why it’s superior in many ways, to the batch training of machine learning models that historically dominated data science. He talks about how streaming data processing allows highly efficient real-time model training, how reactive data processing enables data applications to react instantly and automatically to never before seen input data, potentially saving firms vast sums. He talks about when it makes sense for computer scientists to become a product leader like he did. He talks about why Pathway selected the particular programming languages they did for their platform, and the big up-and-coming opportunity for data and machine learning startups. All right, you ready for this mind-blowing episode? Let’s go.

00:02:40

Adrian, welcome back to the SuperDataScience podcast for your first full-length episode. You were here…we met at the Open Data Science Conference West in San Francisco, back in the northern hemisphere autumn. And you recorded an awesome episode on Liquid Neural Networks – that’s episode number 632. Fascinating technical topic. Adrian, where in the world are you calling in from today?

Adrian Kosowski: 00:03:09

So, I’m based in Paris. I’m calling in from just outside Paris, France, from a place which used to be the countryside, but is now meant to be the Silicon Valley of France.

Jon Krohn: 00:03:19

Oh, yeah. And it’s the Pathway Office, is that right?

Adrian Kosowski: 00:03:22

It is. It is.

Jon Krohn: 00:03:23

Nice. And, so you’re the co-founder. You’re a co-founder, and you’re the Chief Product Officer at Pathway, which is a reactive data processing framework that allows people to create real-time data products much more easily. So, I know that we’re going to get into a lot of what Pathway is, but before we even get into that, I want to let our listeners know that you very kindly offered, you’re offering 10 free hoodies to the first people that respond. So, when I, when we release this episode, it’ll be, it’s always on Tuesday mornings from a North American perspective, it’ll be the morning, and I post on LinkedIn from my personal account, a big post about what the episode’s gonna be all about. When I make that post, I’ll include in it, to say, the first 10 people that ask for a Pathway hoodie, get one, and you’re offering to ship them anywhere in the world. So, it’s very kind. Thank you, Adrian.

Adrian Kosowski: 00:04:20

It’s our pleasure entirely. These are really good hoodies. We hope you’ll be satisfied.

Jon Krohn: 00:04:24

Yeah, apparently they’re hoodies so good that you’ll want them, even if you’re in a very hot climate.

Adrian Kosowski: 00:04:31

That’s what they say. We also do software, but we do hoodies most of the time.

Jon Krohn: 00:04:39

Yeah, so when you guys aren’t designing hoodies tell us about the software that you make. So, you have a reactive data processing framework. Yeah. Tell us what that means and what can you do with it.

Adrian Kosowski: 00:04:49

Yeah, sure. So, reactivity is all about the art of dealing with changing data in such a way that you don’t have to worry too much about the processing part when data changes. I think if you want to be formal, there’s probably some dictionary or encyclopedic definition of reactivity which will tell you it’s about being declarative, declarative in a programming sense, like explaining the logic without imperatively saying what to do at every step of a data transformation, without explaining vocals in like a functional programming sense, explaining what the transformation should be. And that’s about the essence. So, it’s really combining the ability to be declarative with the ability to process data changes automatically in an efficient way. So, that’s a notion known as incrementality.

00:05:48

It’s the idea that when data changes, you don’t have to do a full recomputation of over models, of over things that you’ve designed in your data pipeline or in your data science project. You just do a minimally computation to react to the way data changes. So, I guess the most best known example of a reactive system out there is your spreadsheet, call it Excel sheets, whatever you prefer, software, you define the rules on the data, and when the data changes, the cells update. So, this is like one example. It’s a data processing example. It’s not one that scales very well, but it’s an example of data processing. And actually, since spreadsheets came in, I think nobody has been able to fully replicate the success of this type of approach at scale. And we come with, with our attempt. I should say that reactive reactivity is a concept that’s very well known, very familiar to frontend developers, if you’ve worked with JavaScript TypeScript.

Jon Krohn: 00:07:02

Yeah. It’s even probably the most famous framework right now for front end development is called React.js.

Adrian Kosowski: 00:07:09

It is, and the others that don’t have React in their name only active, nonetheless. All of them are. And like the, the kind of place this has got front-end developers to is that you when designing a front-end system you don’t have to do as much event handling as you would do 15 years back. So, some of you may remember having to write things like on-the-click events to describe the state change of a button, you know, when you click, you have to do it, and so on. And these days in front-end development, you don’t do it that much. Surprisingly in data processing, even data processing at scale, if you want to work in a real-time setup, you want to work with data that changes or with streams of event data, a lot of the time you still find yourself doing the equivalent of on-the-click-do or something like this.

00:08:06

The back-end equivalent is on-data-change event or on-arrival of a certain packet of data do. And this is something that has to be done behind the scenes, no question about it. It’s just that we don’t necessarily want the developer, be it the data engineer or a data scientist to be exposed to the pain of doing this type of on-something event processing after they’ve already had to put a lot of effort to design the system just to get their job done, to create a model or something.

Jon Krohn: 00:08:44

So, it seems relatively straightforward to me to understand in the case of a user interface, we have a browser app that is reactive to somebody adjusting how wide the browser is, or whether they come in on a mobile browser or a tablet or a desktop, that the website automatically adjusts, or as you’re saying, to behaviors to somebody clicking on something and the application reacting to that. In the case of data changing, what does that mean? Like how do the data change that are flowing into a machine learning system? Like, the machine learning system it could be handling different kinds of data types, or what does, yeah, what does it mean when the data changes?

Adrian Kosowski: 00:09:35

So, the most straightforward setup, I’d say, is when the data type does not change, you just have to deal with new data, the same type, just data that you haven’t seen before. If you’re a data scientist, like the ideal world is when the data sample that you’re working with is the actual data that will be, that has to be analyzed. When you’re in the online ever-changing world where new data comes in, this is never the case that the data sample that you look at, at the time of designing your model is for one, for which you need to do the insight. So, where, in some sense, where at least for the testing data, the real-world testing data, is not known to you. Sometimes even in real-world scenarios, it may be the case that the training data, so to speak, is not known to you. So, your model is retraining itself or adjusting to new incoming data.

Jon Krohn: 00:10:39

Online learning.

Adrian Kosowski: 00:10:40

Online learning and, and things like this. So, this is for the general setup. If you like diagrams, you can picture data inputs on the left, data outputs on the right, and your data pipeline in between, and whatever fresh events come in from input need to be taken into account. If you want to make life fun, fun in, in a architecture sense, you can also put a human in the loop, somebody who’s providing feedback on how your model is performing and saying, actually, we should tweak this parameter. For example, it’s like, you know, it’s, today we need to adjust because it’s a, we have a special day, and something like this. Some parameter for your forecasting prediction, anomaly detection model. Or the user can say, actually in the training data, the was a mistake, and we need to pull out the training data point and say, look, this should never have been obtained like this, or the label should have been changed. Or a certain class of automated inputs which entered the system may have entered with incorrect values. For example m, there was some confusion between m denoting meters and miles. Of data that input needs to be rescaled, and you have to sort of unlearn the data that came in previously and relearn with the new data. So, anything, anything is kind of possible in the sense of data changes for the system.

Jon Krohn: 00:12:17

So, in the past, on the podcast, we’ve talked about issues around things like feature drift, where, you design a machine learning model to be able to handle the kinds of training data that it’s encountered in the past, but then the real world changes. And so the inputs, the features that are coming into the model, so you know, you described a flow from left to right. So, on the left-hand side, in those data inputs, the inputs are fundamentally changing, the structure of the inputs is changing. So, even though, as you say, it’s the same data type, you know, it’s still a 16-bit float value, or it’s a, you know, it’s an integer, whatever, the features are now in a range that are outside of your training data because the world has changed. So, is what you are describing, this reactive data processing, it’s designed to allow your machine learning models during online learning to be able to adapt to this feature drift automatically.

Adrian Kosowski: 00:13:25

So, this is part of the story. I think reactive data processing should be treated very broadly and if you start implementing it in the larger system like data pipelines in enterprise, which are processing event data, the start of the story is in data engineering. The end of the story is in analytics. In order to benefit fully from this type of framework, from real-time data processing in general, it has to be put into place like end to end, or at least it helps to put it into place, end to end. And the, the models, the analytics models, the machine learning models that come in are kind of the cherry on the cake, but the one that allows it to [inaudible 00:14:13], a lot of value. So, we make sure to make this possible.

00:14:17

This is just to say that in a strictly machine learning context this would be a very, let’s say, a good application and at the same time, an ambitious one when you get to models which are sufficiently advanced to be very much aware of problems like feature drift versus, let’s say, an intermediate class in somewhere in between engineering and more advanced machine learnings of models which have a certain time horizon, a time window in which they learn, and they are updated as this time window moves ahead. I’d say this is like first more natural example in the real world project that you’ll be looking at for last few months of data, some kind of moving average on the data and trying to adapt to it. So, indeed it may be the case that we get into questions of models getting outdated and needing updating, model versioning, and so on. But this is some heavy machinery which comes in relatively late in the project. A lot of the time it’s actually possible to just adapt to the structure of the data itself by having a model which knows how to adapt to the structure of the data, which has this capacity to encompass horizons to be somehow scale-free with respect to the nature of the data.

Jon Krohn: 00:15:54

Nice. Every company wants to become more data-driven, especially with languages like R and Python. Unfortunately, traditional data science training is broken. The material is generic. You’re learning in isolation. You never end up applying anything you’ve learned. Posit Academy fixes this with collaborative, expert-led training that’s actually relevant to your job. Do you work in finance? Learn R and Python within the context of investment analysis. Are you a biostatistician? Learn while working through clinical analysis projects. Posit Academy is the ultimate learning experience for professional teams in any industry that want to learn R and Python for data science. 94% of learners are still coding 6 months later. Learn more at Posit.co/Academy.

00:16:40

So, let’s try to make this example a bit more concrete by going into let a specific use case. So, you and I were planning on later in the episode talking about global supply chain networks and how the pandemic broke these, and how reactive data processing combined with IoT (Internet of Things) hardware could make, help make rather global supply chains more resilient to abrupt delays and shocks. So, maybe let’s dig into that specific use case now, so that as we kind of address other questions around how reactive data processing works, we can like tie into the specific concrete example.

Adrian Kosowski: 00:17:18

Yeah. So, actually, we started Pathway working closely with actors in the logistics industry, working to improve global supply chains and global transportation patterns. Logistics is a pretty fascinating area because if you look at the importance for value in the scale of the industry, it’s about something like 10% of a world economy. It’s, so it’s really big. It’s highly concentrated, and a lot of a value is in international trade, trade that goes on containers, tucks, large vehicles. And it’s in some sense from a data processing perspective, when we were starting, this was largely terra incognita. It was of a new world of analyzing this type of data patterns related to logistics assets.