SDS 834: In Case You Missed It in October 2024

Podcast Guest: Jon Krohn

November 8, 2024

Jon Krohn heads into November with his round-up of favorite clips from October. Hear from Bradley Voytek, Natalie Monbiot, Luca Antiga, Chad Sanderson, and Ritchie Vink in conversations about the ongoing potential of AI.

 
In this month’s “In Case You Missed It”, we speak to some of the world’s most exciting thinkers in data science and AI. We start the show with a clip from Bradley Voytek, Associate Professor of Cognitive Science at US San Diego, with his interdisciplinary project on mapping neuroscience research onto an image of the brain that he calls the “brain viewer”. You’ll also hear from Hour One’s Natalie Monbiot, who talks to Jon about the great potential of AI to create digital representatives that we can send into the online world and spread our professional messages even further than we previously thought possible.
Jon continues his October excursion into GenAI with episode 831, where the CTO at Lightning, Luca Antiga, muses on what excites him about generative AI’s prospects. He is excited, in particular, about the possibility of automation getting to a point of minimal human input. When speaking with Chad Sanderson in the next clip, Jon finds out more about a future and far safer world of working with data. How can we establish trust between all the stakeholders working with a dataset? Expect contracts! And finally, Ritchie Vink details the benefits of the open-source library Polars.

Podcast Transcript

Jon Krohn: 00:05

This is episode number 834, our “In Case You Missed It in October” episode. 
00:27
Welcome back to the SuperDataScience Podcast. I’m your host, Jon Krohn. This is an “In Case You Missed It” episode that highlights the best parts of conversations we had on the show over the past month. 
00:40
My first clip is from episode 829 with Dr. Bradley Voytek. Brad is Associate Professor in Cognitive Science at UC San Diego. I asked him how data science facilitates breakthroughs in our understanding of the brain. 
00:55
Data science, particularly, I think things like LLMs, we touched on this a little bit, will, I think, be able to accelerate discoveries in a lot of different fields, including neuroscience. I mean, do you agree? Do you think that there are emerging data science technologies or methodologies that could accelerate our understanding of the brain in the coming 10 years? 
Bradley Voytek: 01:16
Oh, for sure. I mean, it’s almost a given that it has to and will, right? It’s like saying, “Do you think calculators will accelerate science?” Yes. Do you think search engines are going to… I can’t even imagine running a research lab without search engines. Just the rate at which I can quickly and easily discover information has a… Just a huge impact on the way that everybody does science. So, Google is probably one of those transformative aspects of science in the last 100 years. It’s significantly shaped the way that we are able to find and retrieve information that allows us to then continue to build science and do research better, more accurately, and faster. And so, I think LLMs are going to be something similar, right? Yes, there are so many problems with the current iteration of LLMs hallucinating and things like this, right? 
02:15
But they do … you can see the glimmer of where the future will be. And so, just to give a concrete example, when I was doing my PhD, I was looking specifically at the effect that very focal brain lesions in the prefrontal cortex or the basal ganglia, two interconnected structures in the brain that are known to be involved in higher level cognition. If somebody has a stroke that damages one of these brain regions, what impact does that have on their memory functions? That’s what I spent my PhD doing. At the start of my PhD 20 years ago, in 2004, in my naivete, I believed that there must be some kind of website that I could go to where I could click on the prefrontal cortex on the brain, like an image of a brain, and get a listing of what all the inputs and outputs to that brain region are. 
03:02
That didn’t exist, and it still doesn’t exist. Just very frustrating. And that drove me to ultimately leading to a project years later, at the end of my PhD, that my wife and I had published together, which was that this issue frustrated me and stuck with me for so long because instead of having a really easily to discover mapping of these inputs and outputs to these different brain regions, I had to go into the UC Berkeley where I did my PhD archives of peer-reviewed papers published in the 1970s, where they did all these anatomical tracing studies and digging through these old papers to try and figure out what the inputs and outputs were to these brain regions. I was at a conference on a panel at Stanford in 2010 or so, with quite a number of, actually, names that your listeners will probably be familiar with, senior eminent people and AI and neuroscience. 
03:59
And somebody asked a question on the panel, and I answered by saying, “The peer-reviewed neuroscience literature probably knows the brain.” There’s something like 3 million peer-reviewed neuroscience papers that have been published that are indexed in PubMed, which is the National Library of Medicine, NIH, database of peer-reviewed biomedical research. If we could tap into all of that knowledge, we probably would be 50% farther along in neuroscience, but we as humans are limited to how much we can read and synthesize. And one of the faculty members that is a sort of giant of the field basically said, “That’s really dumb.” And I was like, “I’m pretty sure I’m right about this.” And so, back in 2010, my wife and I did a proto NLP project, where I just did… Well, I should say she wrote the Python code to scrape all of the text out of the abstracts of all of these papers to just look at co-occurrences of words and phrases with the hypothesis being that the more frequently two ideas were discussed in the peer-reviewed literature, the more likely they are to be related.
05:07
So very simplistically, if a paper is written about Alzheimer’s disease, they tend to also talk about memory because Alzheimer’s disease has a significant impact on memory. But it also mentioned like teleopathies, which is one of the mechanisms by which we think Alzheimer’s disease manifests. But papers to talk about Alzheimer’s disease are less likely to talk about bradykinesia, which is slow moving, which is much more commonly observed in Parkinson’s disease. And so, by looking at the word frequencies and co-occurrences, very simplistic NLP, proto NLP, we built a knowledge graph of neuroscience. 
05:49
And so, this was a paper that we published in 2012. We did the project in 2009, 2010, and it was a pain in the ass to publish because all the peer reviewers, we built this knowledge graph, and then we could find clusters in the graph and went to publish this paper, and we’re like, “Hey, look, we can naturally from natural language, free form, peer review text to discover clusters of topics that are interrelated, like Parkinson’s disease is highly clustered with dopamine, the neurotransmitter, and neurons in the substantia nigra, which are the dopamine neurons that die off in Parkinson’s disease that give rise to motor tremors and bradykinesia.” 
06:26
And this is naturally discovered just through text co-occurrences. And the peer reviewers said something like, “Yeah, we know these things.” And it was like, “Yeah, I know you as an expert who’s read the principles of neural science and has been studying neuroscience for 20 years knows,” but now math knows. Isn’t that amazing? And back in 2010, people weren’t really buying it. And now, I think we’re in an era where we can do that same thing. In my lab, we’re trying to build this right now, actually that same thing, but two orders of magnitude more sophisticated. So, we actually are building that site right now where you can go click on the prefrontal cortex or whatever brain region, and it is built on everything we know from publicly available data sets of human brain imaging about the brain. So the Allen Brain Institute has a database of gene expression in the human brain.
07:20
There’s about 20,000 or so different genes that are differentially expressed across the human brain. So we pull that data set in, and then there’s another data set of neurotransmitter densities based on positron emission tomography, and pull that data set in and pull this data set in. And this has already been done by collaborators up at McGill University, Bratislav Misic is the lab head there. And they created an open source Python package called Neuromaps, where I think it’s Ross Markello and Justine Hansen are the first authors on the Neuromaps paper published a couple of years ago, where they did all the legwork of actually going out and pulling in all these publicly available data sets. And so, what we’re doing right now in the lab is we’re building a brain viewer that collates all these different data sets in browser. So you can click on an arbitrary brain region and get a listing of everything we know about this part of the brain. 
08:08
And the next step that we’re trying to build on top of that with an industry collaborator who cannot be named yet because it’s not formalized, but alongside that browser is an LLM chat window where you can then say, “Show me the hippocampus,” and then the LLM will pop up, then illustrate on the screen in this sort of dynamic brain viewer where the hippocampus is, and then you can say, “Give me a listing of the top 10 genes that are most strongly expressed in the hippocampus uniquely compared to other brain regions.” Then you can ask, “What are the primary inputs and outputs?” And it’ll show the primary inputs and outputs.
08:41
So we’re trying to build a brain discovery engine that is LLM powered, that is trained on these peer-reviewed papers and these open data sets so that you can do better neuroscience discovery so that we’re sort of dissolving the boundaries between, like I said at the beginning of the podcast, the neurogeneticists who don’t know anything about theoretical neuroscience, who don’t know anything about neuroanatomy, and trying to dissolve all those boundaries to bring all these different data sets together in one easy-to-digest platform. So that’s honestly still probably a couple of years away, but we’re prototyping it right now.
Jon Krohn: 09:17
It was so interesting to hear from Brad about how technology might help us gather and streamline knowledge about the brain. I’m excited to check out his lab’s “brain viewer” once it’s ready. Our next clip is from episode 823. The eloquent Natalie Monbiot, who is Head of Strategy at the GenAI studio Hour One, talked to me about building digital avatars of ourselves to help “scale up” our public-facing work online. 
09:44
You’re the head of strategy at Hour One, which pioneered generating lifelike video avatars or AI clones of humans, you’ve called them virtual humans or virtual twins, and promoted a virtual human economy, which we’ll get to momentarily or maybe that will even tie into your answer right now. So tell us what value this technology does provide. What are the great use cases for virtual humans, for these virtual presences, and how does that create a virtual human economy? 
Natalie Monbiot: 10:15
So first of all, virtual humans should not replace real humans. And I think the whole preamble to this question that we just got passionate about suggests that. Virtual humans and the use of, let’s just call them AI avatars in content, have had success and should continue to be deployed in areas where humans don’t have any business being. And that is to say, “Okay, let’s start with where we found product market fit as a category.” 
Jon Krohn: 10:43
Nuclear cleanup sites. 
Natalie Monbiot: 10:46
Yes, or learning and development within enterprise organizations where people are just so bored with the content that is just the kind of content that you have to consume. You have to hit these kind of quotas. People need to learn everything from safety hazards to compliance and all this kind of stuff. It needs to be done. There isn’t a lot of budget that’s assigned to this type of content. It isn’t profit generating, but it’s boring and it usually exists as a PDF, right? So this has been a ripe place for AI avatars and generative video content to play. So what you just do is you can literally take PDFs and transform them into engaging presenter led videos through an assortment of AI avatars that you can select from the platform through different templates that actually make you look like that you’ve invested a lot in video editing and you can instantly upgrade your content. 
11:50
So that’s one very basic area where, not necessarily the sexiest, but where there has been massive product market fit over the last few years. And then I think the next place is, so as avatars have become more commonplace, or at least AI has become more integrated and accepted in society and culturally, and since the ChatGPT moment, I think we’ve seen more outward use cases of this technology. So for example, one of our customers, Reckitt, pharmaceutical brand uses AI avatars within their Amazon listings to explain baby formula products. So just again, so this is a place where you wouldn’t have a human being presenting the small print of these products, but the small print of these products is not easy to consume and it’s important and young parents need to know this information. So this small print has been transformed into these friendly, engaging AI avatar led videos that explain products in a way that is a lot more digestible. 
Jon Krohn: 13:03
Now, can this, today, I know it could be done, and it sounds like based on the podcast kind of interview format that you were describing being possible or being in development, I know that this is possible. Is it done today yet that in that kind of example where there is an Amazon shopping listing being explained, can the Amazon shopper ask questions and get a response at this time?
Natalie Monbiot: 13:28
At this time, within the Amazon use case, the way that you could do that is through having … The listing will support video, so static images and video. So within that setting you could have a series of videos that address different questions. That said, outside of that particular use case, yes, I think you can have conversations with AI avatars. Today, a real time live conversation is not going to have the same visual quality of a video avatar that is pre-rendered. That takes a couple of minutes to render. There are some trade-offs, but we are getting to a point where it will all come together where you can have a realistic real time conversation that feels immersive, that feels lifelike. So that’s coming. But today, I wouldn’t say that all of those components come together to have great experience. 
Jon Krohn: 14:24
We’ve talked about the L&D training, you talked about explaining shopping listings. So yeah, now I think you were about to go into a broader virtual human economy. 
Natalie Monbiot: 14:34
Yeah. So currently and through the Reid AI moment where we’ve seen a thought leader really use this technology in a way to fulfill his vision and what he’s trying to do, which is to get his points of view out there in ways that resonate with people. And so he’s been playing with the medium of having an AI twin to help him with his thought leadership. He also translated a commencement speech into a dozen different languages so that he could reach people in different countries who he could normally not communicate with. And so since that moment taught us a lot of things, and it continues to in this partnership, but what we’ve seen is that people who have IP in their image, their likeness, their ideas, who they are, have a lot to be gained through this technology. And also people are getting used to cloning themselves. 
15:36
This is kind of a bit of a pivot. It’s always been the vision that let’s say everybody with a LinkedIn profile would have an AI avatar that could communicate for them on their behalf, help them to be more productive, help them to augment their skills and all of that. And I think that we’re hitting that pivotal moment thanks to capabilities. So more realistic AI clones that people respond really well to the fact that people are more receptive to just AI as a communications medium in general. And then also now the fact that thought leaders are actually seeing the benefit of using this technology in order to fulfill their mission in terms of the brand that they’re trying to build and that kind of thing. So this touches on the virtual human economy in that you can start to use your virtual human, your virtual twin to help advance whatever it is that you are trying to do. 
16:33
And so in some cases, when we’re talking about people of note who are using their virtual twin to just scale and augment content, that’s one thing, but you can start to create products with that. We’ve had thought leaders make money out of their AI avatar having a job with a different platform. So for example, we had a futurist called Ian Beecroft whose AI twin became the AI correspondent for a news platform called Defiance Media, which is a hundred percent digital AI first, uses AI avatars for presenting. And so that was just a new type of deal, and the ability to scale yourself and then literally actually make money out of your AI twin is something that is just really fascinating. And so that’s an example of what I call the virtual human economy in which we can create our AI clones, our AI selves and put them to work on our behalf in myriad ways. 
17:37
And sure, when I think about it through the lens of Hour One, these are our physical AI avatars, our digital representations, but equally, it can be your body of work, your books, the way that you think, your expertise, because you can see how this works for entertainment and A-listers that have a lot of these assets that already trade on this asset of who they are. But then for white collar workers, how does that work? Well, I think what we’re going to see is our enable people to clone their expertise and make their expertise available at a lower cost than would ordinarily be available if you needed their time. 
18:27
And also opens up that type of expertise to people that couldn’t necessarily afford it or just imagine, this is what I’m hoping for. You need a contract to just be reviewed. You don’t necessarily want to pay thousands of dollars and spend weeks trying to make that happen. The idea that you can license just access to that as you need it, it’s interesting. And it can also become a lead generator for those experts. So you like what you saw, you like that little taste of my expertise, so maybe there’s a more involved project and then we’ll engage in person.
Jon Krohn: 19:02
We continue the thread of GenAI’s commercial applications with Dr. Luca Antiga. In a clip taken from episode 831, Luca, the Chief Technology Officer at Lightning, explains where he sees generative AI being so useful for our professional profiles. 
19:20
Looking a bit towards the future, generative AI obviously is transforming how software developers work, how data scientists work. What do you think are the kinds of the new skills and knowledge bases that developers, data scientists need to stay relevant in this generative AI world?
Luca Antiga: 19:40
Yeah, that’s interesting because I hear a lot of people that complement themselves with language models, and I do as well, in maybe not small doses, it depends on what I do. Honestly, I don’t think we’ve cracked the recipe for working alongside AI for coding yet. There are a few very notable examples. But at the same time, for the more mundane task, it’s great. Also for complicated things, it can be great. But you need to find your dimension as a… You need to use it as a wall to bounce ideas back and forth. That’s where I get the most out of it. It helps me. When I have an idea, even a methodological idea, even with some maybe math theory behind it, whereas I would look for papers on Google sometimes, that process of getting in the vicinity of where you want to be is very facilitated by a language model, a powerful one if you know how to bounce ideas back and forth. And I think that is the skill you need to develop.
21:06
Also, I think sometimes I develop on the back end and so on, and I see a lot of things that could be so much easier if only I could delegate it to someone else. And that someone else can be a language model, because those tasks are extremely predictable and repetitive. And yes, you can use libraries that obstruct things away, but then what happens when something goes wrong? You got many layers to peel off, and maybe it’s just easier to keep things simple from a library perspective and have AI filling the gap between you and your willingness to spend time at that level and the task you need to accomplish. 
21:50
And I do think that in the future, you will just write a function or call a function, and that function doesn’t exist until you call it, or it will be cached in many ways. But for sure, you will have to weave everything from the top to the bottom. There will be some layers that may be delegated, again, in that realm of repetitiveness and so on. I don’t think we’re at the point where we can say that software development will be rendered useless. I think it’s more like we’re nearing the point where personal automation is closer. And then what software development will become when you have personal automation, then it’ll evolve naturally. 
22:46
Of course, if you want to be in your comfort zone of being a super expert at a language and know the ins and outs, that’s great. It’s probably not the way things will evolve in the future. Although there will still be a need for that, but just in a smaller number. But something I already brought up I think in some other time, in another video, when I first interacted with ChatGPT, it reminded me, something crystallized in front of my old person mind, which is HyperCard. 
Jon Krohn: 23:31
HyperCard? 
Luca Antiga: 23:33
Yeah, do you remember HyperCard? 
Jon Krohn: 23:35
No. HyperCard? 
Luca Antiga: 23:38
Yeah. So HyperCard is something that coincidentally was created by a person under the influence of an hallucinogenic substance. But it was actually a product by Apple that shipped with MacOS Classic. I don’t know if it was a software on top of it or shipped with it, I can’t remember. But it was an attempt to bring automation and programming and building systems to people that were not programmers. It had a hyperscript, which was a scripting language that resembled English from a syntax perspective. The problem with that is that it’s not the syntax, the problem is that if you need to write a forward loop, and you might need to be crafted in such a way that you understand what a forward loop does, and you keep track of variables and blah, blah, blah. So yes, you can write it with a semicolon or [inaudiblE] or in English, but it doesn’t matter in the end, right? 
24:39
So it was trying to solve that problem from an angle of syntax and accessibility, but it didn’t really solve the problem throughout, which is, “How can I express what I want in natural language?” So of course, technology didn’t allow to express what I want in natural language because that was relegated to science fiction movies, and now we’re there. But the whole purpose of that thing was can I allow someone who doesn’t have a background in computer science to write their own tech, to have their own tech materialize in front of them because they need to solve a very specific problem and they want to solve it in the way that fits them, their immediate need? 
25:25
And so I think we’re at the point where the technology is ready to get there. And I think that’s what makes me the most excited, the ability bring automation and the ability to express algorithms with an intention rather than with having to walk the little robot every step of the way. So I think yeah, that I think is what the future of development might be. So in a way it will get more accessible so that I can just express what I want. And it’s also already partially true, but it is just very partial, with agents and so on. And then there will be another set of people that will just dive deep into whatever and use language models to empower them to think faster and to get the results faster, and that’s inevitable. 
Jon Krohn: 26:41
Nice. Yeah. So basically to kind of summarize back to you what you’re describing is that with generative AI, already today we have some of these kinds of personal automations where you can be delegating some software development or data science tasks and over time that will become more reliable, more expansive. But for the foreseeable future at least, the role of software developer, the role of data scientist won’t disappear. It’s just that there will be more and more automations that you can spin up easily. So it provides more accessibility. It means that you don’t necessarily need to be expert in all of the different programming languages that you are developing in, say.
27:24
And so maybe the kinds of skills that become more important in that kind of environment are the kinds of collaborative skills with the team to understand what the product needs, what the business needs, creativity to be coming up with solutions that will really move the needle for some product or organization, and simultaneously principles around architecture and having systems work well. So it moves you up the stack where you don’t need to be worried so much about the low-level coding as a software developer or data scientist, more you’re thinking at a higher level and maybe sitting a little bit closer to product.
Luca Antiga: 28:03
Yeah, that’s true. I agree with that. Although it lasts like Blockly right? In the sense that you need to be low-level to understand exactly what you want from a system. So we’re not at a level where an automated system will be able to figure out the architecture of something for you. But it will help you iterate much faster at getting that architecture out the door. And I don’t think there are any system that right now that is ready to just replace the whole full stack. You still need to be full stuck, but it’s actually easier to be full stuck right now because some of the things you just had to know in order to have an acceptable speed, you don’t need to know them all to have an acceptable speed, right? 
Jon Krohn: 28:52
“Accessibility” also concerns Chad Sanderson’s work on data contracts, as mentioned in episode 825. When we work on projects that concern data, as data science practitioners, we always need to think about how other users might come to interpret our data. This is why Chad finds data contracts so important that he recently wrote a book about it. 
29:14
You are the CEO of Gable, which is a data contracts platform, and you’re writing The Definitive Guide to Data Contracts with O’Reilly, probably the most prestigious technical publisher that you can be writing with for our space. So tell us about data contracts. Your book introduces them as a solution to the persistent data quality and data governance issues that organizations face, but candidly, it’s not something that I had heard much about. When I first saw that that’s what you were expert in, I was thinking about Web3 or the blockchain. It somehow sounded like that kind of contract to me, but I don’t think it has anything to do with that.
Chad Sanderson: 29:54
That’s right. So one of the big problems that has manifested itself in the last 10 or 15 years or so, really since the cloud took over as the primary place that companies are storing massive amounts of data is that back in the old days, you used to have a producer of data and a consumer of data that were very tightly connected to each other and more of a centralized team that was thinking about the data architecture and which data is actually accessible and could be used by a data scientist or a data engineer or an analyst, and they went through a lot of time and effort to construct a highly usable, highly semantically representative data model. 
30:36
But now thanks to the internet and thanks to the cloud, you’ve got so much data flowing in from everywhere, from hundreds, tens of different sources and when things change, it causes lots of problems for anyone who’s downstream of that data for models, for reports, for dashboards, and things like that. So the data contract is starting to adopt a lot of the similar terminology and technology as software engineers who use APIs, which is effectively a service contract. It’s an engineer saying, “Hey, this is what my application produces. You can expect this not to change. Here are some SLAs around that service.” And you can trust that there’s always going to be a certain level of latency and uptime, and we’re taking that approach and applying it to the data as well.
Jon Krohn: 31:21
So it is similar in software engineering to the idea of, what is the term of software engineering? It’s like a service contract? 
Chad Sanderson: 31:28
Service contract, yeah.
Jon Krohn: 31:30
And so you’re taking those kinds of ideas from software engineering, applying them to data space? 
Chad Sanderson: 31:38
Yeah, exactly. Data is obviously very different from applications. You need to think about the number of records that are being emitted at any particular point in time. If a team always expects there to be a thousand events in an hour and in one particular hour, it’s one event or two events, that’s definitely a big problem. The schema matters a lot. If you suddenly drop a column or add a new column that’s an incremental version of a previous column, it’s a really big deal. If you change the semantic meaning of the data, this is obviously another really huge deal. If I’ve got a column called distance and I as the producer have defined it to mean kilometers, but then I change it to miles, that’s going to cause an issue. So the same sort of binding agreements that APIs have, sort of the explicit definitions of expectations coming from a producer, we’re starting to apply that to the data producers and not just the software engineers on the application.
Jon Krohn: 32:35
Very cool. Sounds really valuable. In chapter two of your forthcoming book, you discuss how data quality isn’t about having pristine data, but rather about understanding the trade-offs in operationalizing data at various levels of correctness. So how can organizations strike a balance between data quality and the speed of data delivery? 
Chad Sanderson: 32:59
That’s actually a great question. So my definition of data quality is a bit different I think from other peoples. In the software world, folks think about quality as it’s very deterministic. So I am writing a feature, I’m building an application, I have a set of requirements for that application, and if the software no longer meets those requirements, that’s what we call a bug. It’s a quality issue, but in the data space, you might have a producer of data that is emitting data or collecting data in some way that makes a change, which is totally sensible for their use case.
33:35
So as an example, maybe I have a column called timestamp and that’s currently being recorded in local time and I as the engineer decide to change that to UTC format. Totally fine, it makes complete sense. It’s probably exactly what you should do, but if there’s someone downstream of me expecting local time, they’re going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between data producers and data consumers, and that’s sort of the function of a data contract is to help these two sides actually collaborate better with each other, to work better with each other and not so much prevent changes from happening. 
Jon Krohn: 34:15
So when you talk about data producers and data consumers like you just did there, is that typically referring to internal in an organization or I guess it could equally apply to an external facing API? 
Chad Sanderson: 34:27
Exactly, so a producer or is really anyone that is making a unique transformation of the data in some way, which could mean the creation of the data itself. That might be an internal software engineer who is creating an event that’s emitted from a front end, like a user clicks on a button in a web app. It could be someone who, a DBA who owns a database. It could be a data engineer who’s aggregating all of that data together and creating a silver and bronze and gold data models. It could be a data scientist who aggregates all of this into a training set that ultimately another data scientist in the company ends up using. It could be a tool like Salesforce or a CRM or SAP for an ERP, or it could be someone outside the company altogether, like another company providing an API or an FTP, sort of data dump or something like that. The problems are the same regardless. 
Jon Krohn: 35:24
Can you break down for us, as we’ve now been talking about data contracts, I get the utility, but can you break down for me what they look like? How is it formatted? How do you share it and how does somebody receive it? How do they read it?
Chad Sanderson: 35:42
Yeah, so this is where data contracts are a little bit different from the service contracts where you have something like an open API standard. In the data contract world, it’s more about having a consistent abstraction and then being able to enforce or monitor that abstraction in the different technologies where data is created or moved to. So I prefer using something like YAML or JSON to describe my contracts, and it has various components within it. So you might lay out the schema, the owner of the data, the SLAs, the actual data asset that is being defined or being referenced by the contract, any data quality rules, PII rules, and so on and so forth, and then the goal is to translate all of those constraints into monitors and checks against the data itself as it’s flowing between systems or potentially even before that data has been produced or deployed in some way, but I’ve seen teams that have rolled out data contracts as [inaudible 00:10:21] pages as Excel spreadsheets. Really anything that allows a producer to take ownership of a data asset I think works as a first step towards data contracts.
Jon Krohn: 36:57
Awesome. Yeah, crystal clear. Let’s talk about trustworthiness around data. So we’ve talked now about data correctness, which relates to trustworthiness, and so you’ve argued that the value of data hinges on its trustworthiness. So how do data contracts help establish trust between data producers and consumers? And what role do data contracts play in rebuilding trust if it’s been lost? 
Chad Sanderson: 37:24
So I think trust comes down to a couple components. One component of trust is understanding, and the second component of trust is meeting a consistent expectation. And when I say understanding, what I’m referring to there is I am more willing to trust a data source or a data set if I understand what it actually represents. When a table is called customer orders, does that mean customer orders that were placed through our website or through our application or through both or through our customer service line? Does it just refer to a certain type of customer or a certain type of order? So the more information I have about that data asset, the more that I can actually trust it, and then the second part of trust is the expectation setting. So what is going to happen to that data set over time? Is it going to be changing every month? 
38:20
Am I going to know when it changes? Will I know the context of the change so that I can adjust my training data or my query? I think the same is actually true in real life, right? If someone says to you, “Hey, Jon, I’m going to be coming over to your house later, but I might be 30 to 45 minutes late because of traffic.” You’ll respond very differently than if someone is just 45 minutes late and they don’t tell you, they just show up. So I think this is where trust comes from and the data contract is really all about setting the expectation and also helping people understand what the data actually means and how they should use it. 
Jon Krohn: 38:55
My final favorite snippet from this excellent month of interviews comes from episode 827 with Ritchie Vink. Ritchie is CEO and Co-Founder of Polars, Inc., and he made the perfect guest for answering all our listeners’ burning questions about working with the popular Polars library for DataFrame operations. We had a great chat about the open-source Python library’s incredible specs and what users can expect from them.
Jon Krohn: 39:24
So, Ritchie, what is the secret sauce? Or I guess it’s not so secret because you have blogged about it. What is the not-so-secret sauce behind Polars being so much faster and more memory efficient relative to the incumbent out there?
Ritchie Vink: 39:41
For relational data processing, DataFrame like data processing, there are a few things you can do. It’s actually pretty old, relational data processing. It’s what databases do for over decades, and it’s what SQLite does, and it’s what ClickHouse does and Snowflake. So, all these databases, they exist, and they have different performances. And the different performances are because of various reasons, for instance, if you look at a SQLite which is row oriented, which is great for transactional processing. 
40:27
Transactional data processing is when you load data, when you have a database, and you use it when you do transactions. For instance, if you buy a product, you update that row and then you need to check if that transaction has succeeded or not. Otherwise, you have to fall back. That’s an application of databases.
40:49
Another one is analytical data processing, and that’s more where Polars, Pandas or Snowflake come in. And in that case, doing things columnary is way faster. So, columnary means that you process data column by column. This is something Pandas does as well. They’re based on NumPy, but there are other things you need to do, which Pandas has ignored, and that’s multiprocessing or multithreading to be just multithreaded parallel programming. I mean, my laptop has 16 cores available. I want to use them. It’s a waste of those resources if you only use one core for expensive operations like joins or group bys. 
41:42
The other one is that Polars is close to the metal. Pandas has just taken NumPy, which was meant for… For numerical data analysis, it’s great. But when you had string data before NumPy took over, there wasn’t a really good solution for that. And if you talk about nested data like lists and structs and arbitrary nested data, Pandas actually gave up because it just used the Python object type, which means, “Hey, we don’t know what to do with this. We let the Python interpreter see what to do with this.” And in that sense, Pandas took NumPy and built on top of that. NumPy was never really meant to build a data processing tool like a database on top of that. 
42:40
And Polars is written from scratch. It’s written from scratch in Rust. And every performance critical data structure we control, we control. And by that control, we can have very effective caching behavior, very effective resource allocation, very effective control over memory. That’s the most important part because a lot of compute or a lot of resource is in control over memory.
43:26
And then the third one, I think, is very important, is that we also use an optimizer. So, we actually made a different… If you look at databases, they, A, can be really fast because of performance and how you write the code, how you write the kernels that execute the compute. But there’s also an optimizer, and this optimizer will make sure you only do the compute that’s needed. And this is very similar to what a C compiler does. If you write your C, you can be sure that the computer will never execute the code as you’ve written it. There will be a compiler in between that will try as hard as possible to prove that it doesn’t have to do certain kinds of work. And that’s actually quite similar to data processing. If you don’t need to load a column, it saves a whole IO trip. It saves a whole research allocation. So, this can save a huge amount of work. 
Jon Krohn: 44:31
All right, that’s it for today’s ICYMI episode. To be sure not to miss any of our exciting upcoming episodes, be sure to subscribe to this podcast if you haven’t already but most importantly, I just hope you’ll just keep on listening! Until next time, keep on rockin’ it out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon. 
Show All

Share on

Related Podcasts