Podcasts SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

69 minutes
Data Science, Statistics

SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this episode, host Jon Krohn talks to Chris Wiggins about the centuries-old history of data and statistics and why going into the history has been so important to supporting Chris’ multidisciplinary approach to teaching students and writing his latest book on the emergence of data science. Chris explains how learning about data history helps bridge the divide between science and the humanities, the controversy behind Bayesian statistics, and how he maintains the tech stack at the New York Times.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Chris Wiggins

Chris Wiggins is an associate professor of applied mathematics at Columbia University and the Chief Data Scientist at The New York Times. At Columbia he is a founding member of the executive committee of the Data Science Institute, and of the Department of Applied Physics and Applied Mathematics as well as the Department of Systems Biology, and is affiliated faculty in Statistics. He is a co-founder and co-organizer of hackNY (http://hackNY.org), a nonprofit which since 2010 has organized the hackNY Fellows Program, a structured summer internship at NYC startups.

Overview

Chris Wiggins’ latest book, How Data Happened, was inspired by conversations with his Columbia University students. Chris knew that not all practitioners know the detailed history behind their field, and yet he felt that understanding history is always necessary for appreciating the motivations underlying theories and applications. Together with his coauthor Matthew L Jones, Professor of Contemporary Civilization at Columbia University, Chris came to have conversations about data history as well as the role that data have played in society. Chris highlighted the importance of discussing with Matt what to include and exclude, how to acknowledge the multivarious exchange of ideas during periods of rapid change such as the Industrial Revolution, why knowing the ideologies and backgrounds of the featured mathematicians might cast light on those theorems’’ proper or improper use, and how to structure an introduction into a topic that itself is concerned with structuring knowledge.

Chris notes that data scientists also rely on the humanities to establish communication and collaboration skills that are crucial to understanding human behavior and needs—both vital to running a business. Even more essential is understanding the ethics of solving a problem with data. Chris believes that no one in data science should be handing over responsibility to another department and that data scientists are actually “uniquely qualified” to be in the room for conversations surrounding ethics regarding data use.

Chris also talks about the imbalance of power in handling data. He says that data mining and data collection is a new type of “invisible” threat that can have devastating effects on people and populations, changing mindsets, ideologies and warping facts. Chris believes that the inefficient – and insufficient – regulatory power over corporations that handle large amounts of data became particularly acute in the 1970s when the power dynamic first flipped from governments to corporations. Given that people believe data is knowledge, without adequate regulatory frameworks, data’s inherent “truth” can be used to back up false statements and discriminatory attempts.

Finally, Chris discusses the tech stack he maintains as Chief Data Scientist at the New York Times. He uses the example of the New York Times’ COVID database during the pandemic, a vital resource for many people nationally and internationally. He is hopeful that, while newspapers depend on new business solutions to keep themselves financially afloat, there will be sufficient services for local and global journalism outlets to invest in tech stacks and maintain their journalistic rigor.

Listen to hear how Chris and Matt Jones ultimately came up with the three-part organizing principles for their new book, how the subjectivity of Bayes theorem has made it a historically controversial theorem, and the necessity to be transparent about the unavoidable subjectivity involved with analyzing and training data.

In this episode you will learn:

The importance of the humanities in data science [09:18]
How data science “rearranges” power [17:19]
An overview of How Data Happened [20:36]
The controversial nature of Bayes theorem [29:16]
Why we need to consider data ethics [34:00]
How biology came to adopt data science into its field [45:44]
The Data Science Tech Stack at the New York Times [49:18]

Items mentioned in this podcast:

AWS Insiders Podcast
Modelbit
New York R Conference
Google BigQuery
AirFlow scheduling
Docker containers
How Data Happened by Chris Wiggins and Matthew L Jones
Data Science in Context by Alfred Spector, Peter Norvig, Chris Wiggins, Jeannette M. Wing
Chris Wiggins’ Systems Biology page at Columbia University
Chris Wiggins’ Applied Physics and Applied Math page at Columbia University
Information Platforms and the Rise of the Data Scientist by Jeff Hammerbacher
The Moral Character of Cryptographic Work by Phillip Rogaway
Recoding Gender by Janet Abbate
The Empire of Chance by Gerd Gigerenzer, Zeno Swijtink, Theodore Porter, Lorraine Daston, John Beatty, Lorenz Kruger

Follow Chris:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:08

This is episode number 703 with Dr. Chris Wiggins, Associate Professor at Columbia University and Chief Data Scientist at The New York Times. Today’s episode is brought to you by the AWS Insiders podcast and by Modelbit for deploying models in seconds.

00:00:22

Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:52

Welcome back to the SuperDataScience podcast. Today we’ve got a very special live filmed episode of the show with a very special guest, indeed, the exceptionally knowledgeable and astoundingly well-spoken, Dr. Chris Wiggins. Chris is an Associate Professor of Applied Math at Columbia University. He’s also Chief Data Scientist at the New York Times, and he co-authored two fascinating recently published books, which I’ll detail on stage for you at the onset of the live recording. Speaking of which, this episode was recorded live on stage at the New York R Conference. You may hear audience reactions throughout the episode, and there will be questions from audience members at the end. The vast majority of this episode will be accessible to anyone. There are just a couple of questions near the end that cover content on tools and programming languages that are primarily intended for hands-on practitioners. In the episode, Chris magnificently details the history of data and statistics from its infancy centuries ago to the present. He talks about why it’s a problem that most data scientists have limited exposure to the humanities. He talks about how and when Bayesian statistics became controversial. He talks about what we can do to address the key issues facing data science and machine learning today. He also talks about his computational biology research at Columbia and the tech stack used for data science at the globally revered New York Times. You ready for this sensational episode? Let’s go.

00:02:19

Nice, Chris. Welcome.

Chris Wiggins: 00:02:20

Thanks. That’s actually true about the ukulele. I don’t remember telling him that. I don’t know. I don’t, I don’t know how he knows that, but yes.

Jon Krohn: 00:02:26

Yeah. Well, next time we’ll have to bring ukulele out so you can prove it.

Chris Wiggins: 00:02:30

Yeah, it’s fun.

Jon Krohn: 00:02:31

So, in addition, in addition to ukulele phenomena, you’re Associate Professor of Applied Math at Columbia University, and there you founded several departments including being a founding member of the Data Science Institute. You’re the side, the Chief Data Scientist at the New York Times, and you had two recent books published. So, one came out just in March, just a few months ago. It’s called How Data Happened, A History From The Age of Reason to the Age of Algorithms, and then you had another book come out in October, which is Data Science in Context Foundations, Challenges, Opportunities. Let’s start with the most recent book, How Data Happened. First important question is, do you use data as a plural or singular term?

Chris Wiggins: 00:03:16

Yeah, we had to fight about that. I often use it as a plural, but sometimes you can use it as a singular for the concept of it. So, I’ll, I’ll go both ways, so to speak.

Jon Krohn: 00:03:28

Nice, I’m a big plural fan, so it’s great to hear that.

Chris Wiggins: 00:03:33

A long time ago, John Myles White tweeted, you’re not allowed to use data as a plural unless, unless you use agendum as a singular, which when I saw that, I was like, okay, I’m gonna use agendum as a singular. So, I now, I now use that regularly.

Jon Krohn: 00:03:46

So, this book, How Data Happened, I understand that it was inspired by conversations with students in a course that you have at Columbia, so it’s Data – Past, Present, and Future. How did those conversations bring about this book?

Chris Wiggins: 00:03:59

That’s right. So, I mean, many things came together to lead to that project. One was my own interest in how data happened and how these different fields came to be, what they are. I can remember another tweet, so I think it was Hadley Wickham who tweeted in Summer 2012, like, how did statistics get so mathy? And at the time, I had just started reading more about the history of statistics. So, being a practitioner in data doesn’t necessarily mean that you really know the history of how these fields came to be. So, I started doing more of a deep dive on that history, which I really think of as a form of root cause analysis for things, is to understand the history of the thing. So, I started reading about the history of statistics, the history of machine learning, and at that point I met an actual historian, which is my co-author, Matt Jones, who’s a professor at Columbia. So, I had actually seen him give a lecture on the history of machine learning in the spring of 2013, and I think that’s really how he and I met each other. So, we started talking about the history of machine learning, the people involved, and the way it was being applied the next summer, summer of 2014, my friend Kathy O’Neal was working with Mark Hansen at Columbia to develop a new curriculum.

Jon Krohn: 00:05:16

Catherine O’Neil’s Weapons of Math Destruction. Is that the same?

Chris Wiggins: 00:05:18

Yes. So, so Cathy and I had met years earlier. Anyway, just in the, like, like as Jared said, like there just wasn’t that big a New York City data community 15 years ago. So, like, if there were people working in data, you would probably all meet each other. Anyway, so she and Mark Hansen were organizing a new summer program for data journalism in Python, and Matt and I were two of the co-instructors. So, at that point, Matt and I had collaborated, but the conversation you’re alluding to was one where Matt was hosting a group of undergraduates in his house. So, he was like a residential dorm mentor and had a bunch of students over to talk about data and like, what is the deal with data, basically? And the students who showed up were sort of half from engineering and half from Columbia College, which is sort of more of a humanist bent.

00:06:09

And we had a really good conversation about data science, the history of data science, but also data and society. This must have been in, I think November of 2015. And at the end of that, some of the students said, you two should really teach a class on this which we thought was crazy and would never happen because he was in humanities and I was in engineering, and, you know, never does that work out. And then the next year, Columbia said, okay, we’re sponsoring new innovative classes that would cross two different schools. And right away we said, okay, we should totally do that. And so that’s really how the class came to pass. A lot of hard work was Matt and I trying to think about how would you sort of carve off a history of data? What would you include? What would you not include, and how would you organize it?

Jon Krohn: 00:06:58

Yeah. Where’s the beginning?

Chris Wiggins: 00:06:59

Right. So, you’ve got to choose a beginning for data and you could choose, you know, like state formation, because in a lot of ways, part of the integral process of forming a state is to count how many people are in the state or how many in area. And that’s actually the birth of statistics. Right. The word state is right in there. Yeah. And good, because I’m a big fan of words and their meaning and their etymologies. And so for me, one of the natural places to start was when the word statistics enters the English language, which is 1770. And for Matt being a scholar of how ideas change, he was really interested in almost the same time period because of the scientific revolution and the idea that there’s all these facets of human endeavor, which [inaudible 00:07:46] were gonna be the authority of the government or of the church.

00:07:51

And part of the scientific revolution was to say, no, there are ways that the scientific method and experimentation could help us enumerate and understand and argue for what is true about all these different things. So, we started there, we started in late 18th century, mid 19th century around the scientific revolution, which was a time where data was not just about counting, but was being used to argue for what is true. And almost immediately thereafter, what should be true meaning to say, not just about descriptions of things, but about prescriptions of things. Almost immediately people start using data not only to say, “oh, well here’s the way the world is organized”, but trying to make policy decisions really to make statements about how things are to be done. So, it’s a very interesting time Right away, there start to be fights, and it’s much more fun to write a history book that’s all about fights. So, that’s another good reason to start in the 19th century, 18th century, is because it’s not just people counting things and everybody says, “yes, that’s how many, you know, I don’t know, cows we have in France” or what have you. It’s more like, okay, let’s have an actual debate about whether or not data should be allowed to have a seat at the table in understanding this craft.

Jon Krohn: 00:08:57

Very cool. And we’re gonna have some more later on about some of the key philosophies around being prescriptive with data. For now, let’s dig a little bit more into how you described this being a special course where you were bridging humanities and a quantitative discipline engineering. Why is humanities mostly missing from data science curricula, and why is that a problem?

Chris Wiggins: 00:09:23

Okay. Well, okay. So, the first one, you know, I sort of feel like it’s a historical accident and the way that United States higher education, arguably following a German model of research-based universities, carved off different ideas. And we have definitely moved away by a hundred years ago. We had moved away from the idea of a natural philosopher, somebody who would understand all these different aspects to somebody who would be specialized in different things. And I think that’s a very efficient way to run any large org, is to have specialization, including if your org is a university. This is of course well documented if you know, C. P. Snow’s famous essay on the two cultures from I believe the 1950s. That’s sort of part of what’s understood about, you know, the way we educate people is that there are humanists over here and there are scienticians over there, and engineers may be over yonder and they’re not expected to cross-pollinate and to look at the same topic in different ways and try to see a consensus. Again, getting back to that dinner with the students, that was part of what we were experiencing there was, students coming from very different bets trying to understand data as a socio-technical system, so to speak, right? There are technical aspects to data, and then one of the things that’s driving so many people to be interested in data is the impact on society of data and data-empowered algorithms. And so we formed a class that was really meant to teach new stuff, both to the technologists and to the humanists. So, in the first document we wrote about it, we said, there is interesting material that is relevant and important that is being taught neither to the humanist nor to the technologist. And we wanted to carve out a class and later a book about that material.

Jon Krohn: 00:11:07

Very nice. I’m really excited to read the book now in its entirety. It sounds like a fascinating area. Did we address why it’s a problem to not have the humanities in data science?

Chris Wiggins: 00:11:21

No, well done. You, you successfully caught me in not answering that. So, that’s a good question. I mean, it depends on what you’re optimizing for. So, let’s say like you’re trying to get a job, maybe it’s not a problem that you don’t have a lot of humanities on your CV, because your hiring manager might say, okay, well, I’m looking for somebody who’s got this class, that class, and the other class ticks off those bot boxes. And that’s a self-perpetuating system in which it’s, you know, somebody’s looking for these classes, you have these classes problem solved, which is to say there is no problem. I would say there’s some problems that are not particular to data, and there’s some problems that are particular to data. So, problems that are not particular to data, but I think for technologists in general, is that we under-appreciate a set of skills I’ll call collaboration skills, which include understanding how to, as Jeff, so, Jeff Hammerbacher has this essay from 2009 about Information Platforms and the Rise of the Data Scientist is the name of the essay. I encourage you to check it out. And in 2009, he talks about why they created a new job title data scientist at Facebook, and he says, “In our team, the things that people work on were quite diverse. You might be doing hypothesis testing, building an intense data product, doing a regression, and communicating the results to the rest organization in a clear and concise fashion.” And if you look back at his paragraph, he says things like building a data processing pipeline in Hadoop, doing a regression in R doing something else in Python. And it’s fun to look back at that paragraph and see, okay, what tools might you use for those things today?

00:12:57

But the last element in that list, communicating in a clear and concise fashion to the rest of the work. That is still true and is certainly something that I think of as part of what data scientists do is to communicate to people, often people who are complimentary to them, not to other data scientists whose brains are sort of shaped the same way, but to people who are complimentary to them. So, in terms of how it’s a problem, I think technologists limit their careers if they don’t see that communication is part of the job and, and is actually in their best interest to think about communication. Related is something that may be called strategy, but is also just why are we doing this? So, I think technologists do themselves a career disservice if they are not thinking about why is this an important technological problem to solve.

00:13:43

And once you understand that, then you can start pointing out that it’s actually not the right solution. When somebody comes to you and says, okay, we should build this thing or do this model. If you really understand why you’re doing that, it often puts the technologist in a place uniquely to say, that is not the right tool for the right job, and that is not the right problem we should be solving. So, in terms of what’s wrong with missing out on a humanities training, I think part of it is the communications, and part of it is the realization that understanding sort of outside the lines and outside the code gives you a view into strategy. Now, the parts that I think are particular to data, one is, you know the way we do data science now really draws on all these disparate fields, right? Very, some of you may have PhDs or masters in data science, but most of us have a PhD in something totally unrelated, us included.

00:14:31

So, part of what’s useful about engaging in data is I think understanding the history and understanding how the way we use data to make sense of things and to declare what is true is itself informed by a variety of traditions, including the natural sciences, statistics, and certainly computer science as well as other things. So, a humanities back, which, which means you’re not afraid to pursue the history. And again, I hope that’s one of the things people benefit from the book. I think it’s also useful to people’s understandings of when, what is the right tool for the right job and what is the right definition of mission accomplished for a different problem. The second thing that I think is useful about a humanities background is as data science is having more and more impact, more and more data scientists are trying to question and come to grips with the ethical impact of what they do.

00:15:22

So, I think a background in a training, in a shared vocabulary in, the ethics, the applied ethics of data, I think is very useful. And I don’t, I don’t like it when I see technologists say, that’s not my department. You know, somebody else is gonna figure out the ethical impacts, maybe somebody in product or something else. I feel like data scientists are uniquely qualified to be a voice in that conversation, and I don’t, I don’t want data scientists to just give up the role of participating in that conversation. That said, just like any other technical material, it’s useful to have some amount of common vocabulary so that you can really enter the arena and engage in those conversations.

Jon Krohn: 00:16:49

This episode is supported by the AWS Insiders podcast: a fast-paced, entertaining and insightful look behind the scenes of cloud computing, particularly Amazon Web Services. I checked out the AWS Insiders show myself, and enjoyed the animated interactions between seasoned AWS expert Rahul (he’s managed over 45,000 AWS instances in his career) and his counterpart Hilary, a charismatic journalist-turned-entrepreneur. Their episodes highlight the stories of challenges, breakthroughs, and cloud computing’s vast potential that are shared by their remarkable guests, resulting in both a captivating and informative experience. To check them out yourself, Search for AWS Insiders in your podcast player. We’ll also include a link in the show notes. My thanks to AWS Insiders for their support.

00:16:52

Yeah, that makes perfect sense. One of the key areas where people are having a big impact with data and data science, making a big social impact today is with the, this huge revolution of large language models. We’re seeing these pop up all over the place, lots of ethical issues popping up. I wonder if this relates to your comment. You’ve said that data and data science rearrange power, and it seems like that potential to rearrange power has dramatically accelerated in the past year. Do you care or elaborate on what you mean by power rearrangement?

Chris Wiggins: 00:17:34

Absolutely. So, that line is stolen from Phillip Rogaway, who’s a cryptographer. So, after the Snowden revelations in 2013, Phillip Rogaway wrote an essay directed at his fellow cryptographers, fellow members of the cryptographic community. And the opening is “Cryptography rearranges power, it changes who can do what from what, therefore, cryptography is inherently political.” And here political doesn’t mean over-relating to voting, it means over-relating to the dynamics of power. And I think for many of us, the impact of data science on information ecosystems writ large over the last six years has made many data scientists realize data science is actually a rearranging power. It’s changing who can do what. And it, and it suddenly is making data scientists realize, wow, there is a political aspect to what we do. Not, again, not political relating to who you vote for, but relating to the dynamics of power.

00:18:33

So, many capabilities, rearrange power, many capabilities, rearrange who can do what from what data science is, is no different, right? As a piece of technology. Some aspects of data science that are special include that the damage done is often probabilistic. Meaning when you go to a website, right, it’s not necessarily the case that you’re gonna be seeing something false, malicious, harmful, right? Somehow effectively dice are being rolled there. And often the damage is invisible as opposed to like a car crash or something where everybody could look at it and say, okay, that’s clearly bad. Sometimes the effects are themselves only revealed statistically or by you know, one-off edge case anecdotes where it’s, where it’s real clear how an algorithm led something really bad to happen. Often they’re, they’re only revealed by careful statistical analyses. So, that phrase, you know, rearranges power is borrowed from Rogaway. And I, and I think the jarring way that he wanted to say to his fellow, you know, technical community, we should realize that our, our capabilities are political, right? He wrote this 10 years ago, I think we wanted to jar the data science community and say, recognize that as your algorithms are powering things, they’re shaping people’s political and personal and professional realities. And even what they think is true, your tools have power and they rearrange power, and therefore they are political.

Jon Krohn: 00:19:58

Yeah. Makes perfect sense to me. And it should feel then empowering to people here in the audience at this data science conference, as well as all the listeners, the data scientists out there the tools that we wield are increasingly powerful and seem to be accelerating in power. So, I think you’ve made an excellent case that these kinds of considerations, humanities, the relationship of data science to politics, and again, in the way that you meant it there in terms of power dynamics, is terrifically important. And so there’s a great case for reading your book. Give us an overview of the three main parts of the book.

Chris Wiggins: 00:20:39

Yeah. So, part of the work with my co-author of thinking about how we were gonna construct one coherent story that would tie one thread through all of the data work over the last 200 years was to break things into three eras, or three parts of the book. So, part one sort of corresponds to what I think people would think of as a history of statistics and the way that making sense of the world through data became part of the academy, including including that story, how making sense of data became an academic field and basically ends in World War II. So, that history is, I think well described in many other books the history of statistics including how it became about policy concerns, including the word statistics. So, you made the mind blown-

Jon Krohn: 00:21:30

I did not know that [crosstalk 00:21:30]

Chris Wiggins: 00:21:31

Sign, which, podcast listeners may not be able to hear Jon actually doing [inaudible 00:21:36]. But to drill down on that, when the word statistics enters the language in English language in 1770, it has nothing to do with math or even numbers, right? It’s about statecraft running a state. And almost immediately you start seeing these fights between self-described high statisticians who understand the greatness of the men that run these countries, dismissive of the vulgar statistics being done by the mere table makers, statisticians who were looking at numbers and constructing, for example, a table where every row is a country and the columns might be area and population and things like that. That form of lesser statistics, or as they called it, it’s often translated as vulgar statistics, was dismissed. And we, we wanted to capture that story also, because so many times in our present day, let’s say over the last two to three decades, there are fields where there’s a craft, and it’s, people understand that craft often without any numbers whatsoever. And then suddenly some technological shift happens, or at least a mindset shift happens, and a set of other people show up and say, we should be able to understand this craft using numbers and later using statistics.

00:22:52

You know, I’m realizing I didn’t get to the answers to the question. Okay. So, part, so part-

Jon Krohn: 00:22:55

It’s fascinating.

Chris Wiggins: 00:22:56

So, part one is the story of statistics as we currently understand it, right? It’s like data and math and how they came together. Part two opens up at Bletchley Park. So, part two really opens up with World War II and the creation of special-purpose digital computation, like the creation of programmable computers, which we all have grown up with and we think what they were around at the time of the dinosaurs really was born of a data science problem, right? So, we wanted to tell the story of Bletchley Park and how dealing with streams of messy real-world data for an extremely heuristic-based problem, where you needed to get the job done often that day, right? Because they were breaking German codes based on settings of a rotor that were changed every day.

00:23:38

People did not care about your pure philosophical ideas about how statistical analysis was to be done. They needed to get that job done right now. On route, they built the first digital programmable computers for solving that problem. That’s a story by the way, that was not known by intentionally occult, meaning it was part of the state’s secrets act that nobody was allowed to talk about what happened at Bletchley Park. And this persisted for decades. The story of the role of data in Bletchley Park was untold. So, part two is about how the birth of data as an engineering problem, which gets taken up from Bletchley Park to Bell Labs across the Atlantic here in the United States and then becomes an industrial concern as many companies take on data, which I should say before the companies took on data, there was a lot of work by the intelligence community.

00:24:33

So, we try to take pains to explain who was funding the birth of computing with data. And it was all, you know, the NSA and the Proto NSA and the intelligence community that was leading to both methodological advances and computational advances, funding, you know, IBM’s nascent computers, which then IBM had to turn around. And once they were too old and they had moved onto a new model funded by the intelligence community, they had to sell that model to somebody. And that somebody was largely industrial concerns. So, you can look back on papers like the paper in 1958 by IBM advocating for the field of business intelligence. And if, you know, like what machines they were using and what machines they were trying to get rid of, because the NSA was now part of the whoever, intelligence community was funding new machines, you can see how that arc took place. So, part two is about data as an engineering concern, and part three is a really about the present day. So, part three is about the battle for data ethics, how we got here, the impact of the advertising economy and persuasion architecture, and understanding our reality and part, and chapter 13 is essentially what are we gonna do about it? What are the powers which are contesting each other right now? And the resolution of those disputes is really going to determine how data impacts our lives.

Jon Krohn: 00:25:50

Very cool. That sounds fascinating. Early on in the book, in chapter two, a term is introduced social physics. That actually sounds like something potentially that would be key in that final section on the battle for data ethics. But you introduce it at the beginning of the book, so it must be important throughout. What is social physics?

Chris Wiggins: 00:26:12

So, when we go back to the 19th century, mid-19th century, before we get to words like correlation or regression, we start out with social physics, which is a term invented by a Belgian astronomer who wanted to put data to work to understand and improve society. So, his name was Adolphe Quetelet. So, Quetelet is mostly, people don’t know him today, but if you’ve heard of him, it’s because of the body mass index, which I’m sure he would spin in his grave to know that that’s what he’s remembered for today. But he wanted to take the methods of physics, which was the field he was coming from, and use those methods to understand society, right? So, he was advocating, he, he was, he played out with different terms. One of them was social physics. Another one was social mechanics, which he took directly from the field celestial mechanics. And he wanted to take the methods of including celestial mechanics and apply them directly to society.

00:27:03

So, there’s several things that I think are useful lessons there. And, and I think it’s one example of how history makes the present strange, so to speak, is you can see in present-day the way people come from other fields, you and I have PhDs in a different field entirely and are influenced by that. Like we come from a certain training and we come from a certain set of epistemic virtues, which is to say, a set of beliefs about what a solved problem looks like. And we want to take those methods and use them for other problems that we think are important. And Quetelet was certainly in that, in that bag. You know, he, he came from astrophysics. He had seen, you know, work by Gauss and others to find the true location of a planet by looking at what we would now call a bell curve or a normal distribution of different measurements. And then the center of those is interpretable as the true location of a planet, for example.

00:27:53

And he wanted to take those ideas and look at society and say, well, if you look at the number of crimes in a country over different years, and they all seem to center around some value, there must be some true intrinsic “crimeiness” of that society. And he wanted to develop that sort of physics-based mechanistic approach to understanding society and its ills by simply importing to the field. There was no field of sociology at the time, but into the field of understanding societies. The methods, including the quantitative methods of physics.

Jon Krohn: 00:28:25

Extremely interesting.

00:28:28

Deploying machine learning models into production doesn’t need to require hours of engineering effort or complex home-grown solutions. In fact, data scientists may now not need engineering help at all! With Modelbit, you deploy ML models into production with one line of code. Simply call modelbit.deploy() in your notebook and Modelbit will deploy your model, with all its dependencies, to production in as little as 10 seconds. Models can then be called as a REST endpoint in your product, or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com, that’s M-O-D-E-L-B-I-T.com

00:29:07

So, that kind of covers a key topic from the first part of the book. In the second part of the book, you mentioned Bletchley Park cryptography. My understanding from chapter six of your book is that Bayes Theorem played a key role in Bletchley Park. Why is Bayes in general controversial in some circles today? And has it always been historically?

Chris Wiggins: 00:29:30

Well, I wouldn’t say always, but certainly for a long time. So, in the 20th Century, Bayes was largely derided by statisticians. Now, why would you use it? Well, because it’s true, meaning, like Bayes’ rule just follows definitionally from the definition of a joint distribution and its relationship to conditional distribution. So, you wouldn’t think there’s anything magic about it. The part that’s magic about it is that often the world hands you data, and what you really want is not the probability of the observations given what’s true. What you really want to know is what’s true. You want to know the probability of what’s true given the data. The only way to do that is to use Bayesian, and therefore, the only way to do it is to have prior belief about what is true. And at Bletchley Park, again, they weren’t coming from any particular anointed tradition in statistics, like there were no statisticians there. They were doing whatever it took to get the job done.

00:30:24

And there’s a, there’s an interchange, which is related by the statistician I. J. Good, where long after Alan Turing’s death I. J. Good says that he was talking to Alan Turing and said to him, “Oh, you’re using Bayes rule, aren’t you?” And Alan Turing says, “I suppose so”, which I find to be an amazingly non-committal result. It’s not clear if he was sort of just like didn’t even know what Bayes rule – no, I’m sure he knew what Bayes rule was. Anyway, so I think part of the story there is that they weren’t, at Bletchley, they weren’t coming from a tradition where they knew that Bayes was considered heretical and to be rejected. The main dominant schools at that time were the mathematical schools of Fisher and Jerzy Neyman, and Carl Pearson sorry, Agon Pearson, the son of Carl Pearson. And both of them hated each other. And the only thing they hated worse than each other was Bayes. They hated Bayesian statistics.

00:31:22

Because you have to have some prior belief about the thing that you really want, which is dismissed as saying, oh, well, therefore you’re being subjective, right? Because if you have prior belief about something before you’ve seen data, you have to quantify how sure you are about something. And so it is, using Bayes was dismissed as a subjective school of statistics. Now, the way we use Bayes today is often in the presence of such large data that the amount of subjectivity is washed out. And being Bayesian often today simply means following the rules of probability, writing down generative models for how you think the data were produced, which isn’t natural. I mean, it’s basically just using probability’s rules. So, it’s not very, it’s not very daring to use Bayes rule. Again, it’s definitional.

00:32:06

The thing that is, draws people’s ire is, particularly when you have small data using Bayes rule means that you’re willing to put in some prior belief about something. So, for example actually, so it has been controversial for a long time, even when Bayes published his essay, so Bayes. Sorry, he didn’t publish his essay. He died. His friend published his essay. So, Bayes, who was a reverend, had this essay about the probability of, I believe it was the miracle of Christ’s resurrection, given that miracles had been reported. And there was an active debate happening at that time about how should we interpret, how should, could we use mathematics and probability to understand, to speak quantitatively about the probability of this miracle? Right? Did it really happen given that was reported? And so he, he wrote an essay about it, and then he died, put it in his desk, and then his friend published the essay.

00:32:58

So, in order to do that, he started off what went on to be an activity in theology where people actually did try to say, well, can we estimate the probability of God’s existence? So, later this happens to a later religious, this actually is not in the book, by the way, right? Actually, do we, do we [inaudible 00:33:16] Anyways, there’s a, there’s a, there’s a bunch, there’s a whole school of philosophy where people try to calculate the probability of this existence of God, but the only way to do it is to put in prior belief about it. So, that’s the part that’s heretical, is being willing to put in prior belief about the thing that you really want to know if it’s true.

Jon Krohn: 00:33:32

God is heretical in statistics.

Chris Wiggins: 00:33:35

So, certainly for Fisher and for Neyman, who were the two people who really set out the statistical synthesis as the way we teach it to our undergrads, any use of Bayes for understanding the probability that something is true is absolutely heretical.

Jon Krohn: 00:33:49

Right? So, that’s a really fascinating story from the second part of your book. Jumping to the third part of your book, chapter 11, you talk about the battle for data ethics. So, what is the issue today around data generally being considered as fact that seems to be causing a lot of problems?

Chris Wiggins: 00:34:09

Yeah, so there’s two halves to what you just asked. One is the use of ethics as a frameword and as a rhetorical tool in discussions about corporations and how they should have power. Ethics as a term has become this luggage term, so to speak, meaning it’s a, it’s an open term where we sort of put all of our desires and hopes, and dreams, and fears, in part because there’s so little regulatory mechanism for constraining corporate power. And I pick on corporate power, not because I’m particularly anti-corporate, but because that’s where data is, that’s where the power of data is centered right now. So, there was clearly a time in the 19th century where data was a state concern and also actually a lot of the 20th-century data was a state concern. And in the 1970s, there was a lot of writing about privacy, not because people were scared of companies, but because they were scared of the government having too much data.

00:35:05

What’s happened since from, in the last 50 years, is the power of data has flipped from being in the hands of government, mostly US government, to corporates. And the checks and balances we have around government, we do not have around corporate power. And ethics as a term has become a term that we’re using for all of this ambiguity as to how corporations should be regulated, who should make sure that companies have some sense of corporate responsibility, I should say consumer protection in addition to trying to enhance shareholder value. So, ethics is being used as a capacious term for all of those hopes, dreams, and fears around corporate use of data. But the other thing that you’re asking there I’m sorry, Jon, what was the other half?

Jon Krohn: 00:35:49

Yeah, the other half of the question was around the issue of data and truthiness. Yeah, truthiness.

Chris Wiggins: 00:35:52

So, that’s the other thing about data is that it has long come with this rhetorical power that if I have data, then I have more credibility, right? And somehow what I say is more true because I have numbers of it. And so one of the things that we wanted to peel open in the book is how much the process of generating data, not finding data, right? Because the data are never raw. Sorry, I just slipped up and used data as a plural. The data are never raw.

Jon Krohn: 00:36:22

Noted.

Chris Wiggins: 00:36:23

The data are cooked, right? And we are, it is not that we can avoid cooking data. It is instead that we should be reflexive and honest about the ways that we have cooked data. Even, for example, before we start modeling, even when we choose what data are to be collected and saved and what data are to be thrown away, we have made some choice there. That’s a subjective design choice. So, part of what we want to do in the book is to draw attention to the myriad subjective design choices in making sense of the world through data, and to allow people to think critically about that. So, that, that is part of the relationship to ethics is really that sort of critical capability when somebody comes at you and says, well, this is true because data said so, to encourage people to think about well, and to be reflexive and honest about all the myriad subjective design choices that go into any analyses of data that we do.

Jon Krohn: 00:37:14

Really cool. We’re gonna talk more about data ethics in a second, in the context of your [inaudible 00:37:20] book, the book before your most recent book. So, all that we’ve been talking about so far was from How Data Happened, which came out in March. In October, you released another book, Data Science in Context -Foundations, Challenges, Opportunities. Writing is tremendously taxing. Debra Williams, my acquisitions editor, is here. So, she knows how taxing I find it to be. And I’m sure lots of, I think to write a good book it should be taxing. But you published two books in under a year, almost six months. So, what compelled you to take on both of these projects at the same time?

Chris Wiggins: 00:38:00

Well, the history project, you know, in some ways began, as I said, in 2015. The second project, which is a more of an applied project, grew from years of conversations with Jeanette Wing at Columbia, and also years of conversations with Alfred Specter, who, again, is just sort of part of the data community here in New York City. So, Alfred I had seen, I think the first time I interacted with Alfred, he gave a talk at Columbia on the promise and perils of data science. And it was refreshing because it was a rare talk by a data professional that was speaking openly about concerns about data based on his experiences at Two Sigma, but also Google beforehand, and the difficult choices he had to make as a leader in a group that was deploying data products and thinking about the impact of data-driven algorithms.

00:38:52

So, I had been really pressing him thereafter to think through more analytically how he thought about the problems, but also what solutions he might proposed, right? And at this, by this point, there was a growing literature in problems with data, like things that made people concerned about data. And I felt less of a well-developed literature and what is, what is to be done about it? Like, can we, can we say something to engineering students in particular about not just, you know, be scared of data, but like, here’s how to think about what are the challenges of data? Here’s how to have a rubric for thinking about different types of challenges, and here’s how to have a rubric for having conversations that are constructive about how to mitigate those risks and to solve those problems as a consensus. So, as far as how it all came to pass, I think a lot of it was the pandemic and that I wasn’t like seeing people and going to meetings or like having lots of fun in New York City. Instead, I was like on Zooms all the time.

00:39:52

And that book, in particular, was really written by Zoom, like me, Peter Norvig, Alfred Spector, Jeanette Wing on a whole mess of Zooms. And the other one similarly, with Matt, I would say a lot of it was a bunch of late night with Google Docs, where Matt, after his kids would go to sleep, he would write, write, write up until about two in the morning or maybe three in the morning. And then before my kids would get up, I would wake up at like four in the morning and start writing. But they were, they were both done very much as, as remote books, remote together. That said, they were done as remote books with people that I had met in person beforehand and, and started those conversations beforehand.

Jon Krohn: 00:40:28

A really cool story. So, I mentioned that I would talk about ethics here as well. So, this book, Data Science in Context also talks about ethics. It also, as you mentioned, it details how to overcome some of these issues. So, how can we address these issues and effectively apply and deploy dependable AI systems? So, can we kind of think of Data Science in Context, perhaps as a technical companion to How Data Happened?

Chris Wiggins: 00:40:55

It’s certainly more technical than How Data Happened. I would say one of the things, I mean, you can only put so much into one book, but the technical book, which Cambridge Press, is much more about how to do data and how to be, how to function as a data professional, data scientist, possibly a, possibly a technical product manager. But really the details of how to think about different algorithms as well as how to mitigate, identify early, and get ahead of potential problems with data-driven products. The history book has to take on a whole history, right? So, there’s a, there’s a lot of stuff that’s covered in the history book that it doesn’t have a technical compliment in the the Cambridge book. They do both talk about ethics for sure, but we try to, in the history book, we’re trying not to be prescriptive about how to do ethics, but to try to capture a real, as we put, as the chapter title says, that battle for data ethics, that there really is a battle going on and people wanting to define ethics and to argue about what ethics means and how ethics is to be done.

00:42:00

And in Cambridge book, we try to define ethics and make an argument for how it is to be done. So, the framing we have in the applied book, the Cambridge book is, I would say very much based on an academic training and 50 years of applied ethics, particularly around human subjects research. The way we think about it is not so much as a checklist, which I think is often very useful if you have like a problem that you’ve solved many, many times, and then, you know, okay, well I should do this, check, check, check, check, check. But a more general set of principles, which is the way people in the human subjects research community have thought about this for 50 years, is that in different problems, you will have to strike a balance among principles that are going to be intention.

00:42:40

Those principles include respect for persons, which includes the autonomy of people to make informed choices, beneficence, which is really about benefits and harms, and how those balance and justice, which includes fairness, but it’s not necessarily limited to fairness or equality. So, we argue for a definition of ethics around those principles. And we argue for thinking hard about what is the process that’s appropriate to your company or your community for designing a process that’s informed by that definition. So, again in the academic community in universities, this has been well-established for almost 50 years now. There’s a definition of ethics, applied ethics, I should say, in terms of a balance among those three principles. And there’s a design process, which is largely around federal funding and the creation of an institutional review board, which, you know, was, was decreed 50 years ago as the appropriate organizational design for academic research universities.

00:43:37

A question many people have been asking for the last, I would say six or seven years, is what is the analog of that in the context of corporate deployment of data-empowered algorithms, which are not necessarily being used for research, right? For trying to understand what’s true. So, we try to argue in the book, you know, still the, these principles we think are, are valuable and they are comprehensive to use a word that was used in the original report that proposed these principles from 1978. But that the organizational design will be very different in different companies. Different organizations may want to have, you know, a person that’s in charge of that, a group that’s in charge of that, a deploy process, which includes ethical audits as part of your checks and balances when developing or proposing a new product.

00:44:23

And also, one thing that’s different from research is developing a product means you can actually monitor it and you can continue to see what is the on users, right? Because these are deployed algorithms with data about people that are used by people, and that can allow you to mitigate and to monitor and to change the algorithms or to shut it off entirely in a way that’s less easy to do for a research project which you expect to end or for a physical product that I sell you, right? So, software as a service is a thing that you can monitor and you can change the API all the time, and people do as opposed to like, if I sell you a hammer, it’s difficult to go and recall the hammer. So, that’s a brief discussion, I hope of the way we thought about it in the Cambridge book.

Jon Krohn: 00:45:08

Sounds like a really practical, useful technical read. I’m sure lots of people are now interested in checking it out. Beyond your books that you’ve recently written we haven’t talked much about your career. So let’s jump into that just a little bit before we open, open up to the audience questions. So, at Columbia, you’re an Associate Professor of Applied Math, application’s clearly key throughout your books as well. Your research particularly centers on computational biology, so things like gene regulatory networks, reconstruction algorithms, biopolymer dynamics, biophysics. How has computation and now more recently the prevalence of AI changed the field of biology as you’ve been working in it?

Chris Wiggins: 00:45:53

So, when I started working in biology in 1993, which was when I started graduate school I was very interested in how the methods and mindset of physics could be used to help understand problems in biology in that sense. Not unlike the story we said earlier about Quetelet and social physics, right? It was a time of vigorous engagement of biological physics, right? So, there was a lot of biological physics when I was in graduate school in the mid-nineties. In the early nineties, I would say biologists found that sort of work may be entertaining as long as it didn’t get in the way of real biology as it was understood. And by the end of my PhD, a transformative thing had happened when people started sequencing free living organisms. So, 1995 was the sequencing of Haemophilus influenza, the first free living organism to have its whole genome sequence.

00:46:45

And right away people who could pay attention to where the puck was going to go knew that that meant if you could sequence Haemophilus, then you could sequence Drosophila, [inaudible 00:46:54] and eventually rice and mice and chickens and humans. And once you can sequence humans, then you can sequence multiple humans, and then you can figure out statistically genotype-phenotype relationships. So, by the time I finished my PhD, the attitude among real biologists about data had completely flipped. And biologists were publishing papers like, we really need to collaborate with people who know how to make sense of data. That said, it’s totally unclear what modeling and making sense of data would mean. It was really a statistically driven problem with effective models, and we would now say a machine methods for making sense of it.

00:47:31

That transformation is, has always been, in my mind, working with people in industry. So, talking to people in industry about the way they have some problem and they understand how readers behave or users in general, and then suddenly they have an abundance of data and try to now re-investigate that question from a statistical perspective. I can’t help but think about the lessons learned in the nineties as biology went through the pain of becoming a data-driven field. So, that’s what really led me into computational biology, was reading these papers about how biologists try to make sense of data, and frankly, not being able to tell what was wheat and what was chaff like. I would read these papers and I just could not make heads or tails of what are the methods and whether I should believe these papers. And eventually, I felt like the only way to really know what was good and what was bad was to get in the ring and just try to start using these methods and answering biological questions. So, over the last 20 years, my research has been less so biopolymers, which is where I really started, and moving into applications of machine learning and biology, biological sequence data, biological image data, biological network data, and working closely with real live biologists to try to think through, how do I reframe questions that are of interest to them as machine learning tasks, execute the machine learning, and then give them some interpretable understanding of their problem armed with the output of that machine learning.

Jon Krohn: 00:48:50

Yeah, it sounds like an amazing few decades to have been in the field and cool that you spotted that puck moving early on in the early nineties. In addition to your work at Columbia, as I mentioned at the onset, you’re also the Chief Data Scientist at The New York Times. You’ve been there for nearly a decade. I am sure that data science has changed a lot over the decade that you’ve been there as well. But what I’d like to focus on in the interest of being able to get to audience questions soon is what is the tech stack like today, particularly for data science at the New York Times?

Chris Wiggins: 00:49:23

It’s good, compared, compared to, what was, compared to what was there when I started. So, when I started if you wanted to make sense of data, you need to write your own MapReduce jobs, hitting buckets of unstructured JSON in S3. And then eventually we moved to, you could start writing jobs in Hive or Pig, good luck. And then eventually it was decided that the right way to do that was to start our own Hadoop instance on-prem. And that was a bad time. And then, I was not, I can’t speak to why that decision was made, but eventually all those machines were dropped into the Hudson River, and we eventually gave all of our data engineering, not of all, most of our data engineering problems, to BigQuery, which means that the MapReduce is still happening, it’s just now that’s Google’s problem.

00:50:17

And we went from jobs that would fail silently to fast, reliable SQL, right? And fast, reliable access to a relational database was really transformative. All the data analysts became much less grumpy. And that, so that tech stack is something that we’ve built on top of a lot, basically the GCP, the Google Cloud Platform tech stack. So, at this point, data scientists are coding in Python, leveraging heavily scikit-learn as well as other open-source tools when necessary data scientists are coding and Go. The data we use are largely data that’s read from BigQuery, and a lot of our out model output is pushing back to BigQuery so that analysts or future data scientists can put those relational databases to work. Sometimes we’re hosting an API, if there’s something that needs to be more performant things are scheduled using Airflow, on Airflow instance, that’s also part of GCP appropriately containerized, and yeah, and basically it’s on top of the GCP stack.

Jon Krohn: 00:51:23

Awesome. Yeah, that does sound like you’re using a lot of the cool tools that we’re using in my company Nebula. And then I think our, yeah, a relatively small headache relative to say the Hadoop clusters that were thrown in the Hudson.

Chris Wiggins: 00:51:37

Yeah, that was a bad time.

Jon Krohn: 00:51:38

All right, let’s open up to audience questions. And I think we have somebody with a microphone around so that they can, so you can ask directly.

Audience member 1: 00:51:47

Wonderful discussion. I really, really appreciate it hearing the history of data. My question is, you know, journalism, good journalism is under a lot of stress these days just in terms of resources, right? We just heard this week with the New York Times is axed their in-house sports section. You know, how do you feel about the future of things which are kind of expensive like data science team that you have at the New York Times? Are you properly resourced? Is the future bright? Is there a strong editorial commitment to generating this data? I know I benefited a lot from the New York Times Covid database in the early days of Covid. I was really, I was really appreciative of being able to see that. So, what’s the future in terms of the commitment to the kind of work you’re doing there?

Chris Wiggins: 00:52:41

Yeah, so the Covid database is a good example. That was extremely laborious that many, many people were involved in that both inside the building and outside the building to have that process working. The New York Times in general benefits from scale, right? It operates at a scale that’s larger than most companies operate. There are many challenges to journalism right now. Some about business and some about trust. So, the one that’s been identified well for the last 15 years, is that the craft of journalism has been intimately associated with the business of newspapering. The business of newspapering, newspapering has been fed by the ad model, and the ad model has been completely disrupted by digital advertising and information platforms that take far and away the lion’s share of new digital advertising revenue. So, the lifeblood of newspapers and magazines has, has really dried up in the last 25 years.

00:53:40

So, I’m optimistic about the fact that people keep doing journalism and they keep experimenting with new business models. I would say at the New York Times, it’s been pretty well documented for a long time, that part of the future bet is to transition from an ad model to a subscription model, and in particular to a digital subscription model that seems to be going well. For more local newspapers, it’s still unclear what is the right business model that’s gonna ensure that we have a repeatable, scalable business model for local news. And lots of local news properties are experimenting with a variety of potential revenue sources, sponsorship, advertising, subscription, events. All sorts of different models are being explored. In the large, I’m optimistic because I think journalism is important and I think people will find a way. It’s still unclear what is gonna be the model that’s gonna work well.

00:54:42

At the Times I would say things are, are certainly much more well-resourced than at most journalism properties. So, there’s a I think a continued commitment to telling the truth and also being creative about the narrative style, including data journalism. Many of the things that we’re working on, I should be clear are not newsroom-facing, they’re also business-facing. So, there’s so many opportunities to develop and deploy machine learning for decisions about the paywall or for innovating on advertising in a way that can be privacy-preserving or building better recommendation engines. These things that are outside the question of whether a company can support a computer-assisted reporting team or a data journalism team, which fortunately the New York Times has scale, it can support both of those as well as, you know, data visualization, even in the opinion group, which is sort of separate from the newsroom team.

00:55:37

So, New York Times [inaudible 00:55:38] operates at a scale where we’re able to support all of those things as to whether different companies can do that, given that most of those journalism companies are operating at smaller scales, that’s more difficult. I’m hopeful, although I don’t have, I’m, it’s not something that I’m spending my time on directly, but I’m hopeful that companies will be able to create extremely affordable, perhaps open-source solutions that allow local journalism properties to do some of this at scale, including subscription models, advertising models, recommendation engines, like it should be possible for small good faith journalism companies, including local journalism companies, to be able to do this for much less spend than was necessary for the New York Times to get into this, you know, say let’s say 11 years ago when the New York Times started a paywall, actually it was 2011, so 12 years ago now, when the New York Times started a digital paywall, that was a real innovation. And at this point, my hope is that there’ll be sufficient services that more journalism properties, even small ones, can invest in this type of tech.

Jon Krohn: 00:56:40

[inaudible 00:56:41] if there’s one last question, we have time for that. Yeah, we got one right here in the center, the gray.

Audience member 2: 00:56:47

Hi there. Thank you very much for the discussion. So, I’m gonna attempt to engage with the applied ethics conversation, although my heart’s going pretty fast. So, I’m a, I’m pretty early career I guess academic technologist. I’m, I, we’re, I’m doing biostatistics at Weill Cornell Medicine, and what I’ve seen at this conference so far is that, you know, that our community is, is kind of a celebration of how academic community can develop things with, with industry. And I mean that, that’s literally celebrated here, right? That, that trained people, sorry, this is, this is really live, that trained people can get their ideas, you know, up in big companies and get lots of renown. So, I’m wondering if, if you can share a little, a little bit with us about maybe an example of something from applied ethics moving from academia into the corporate, the corporate domain, and in the same way that innovative ideas or engineering tools do. So, obviously, it hasn’t happened completely. I’ll, I’ll cut myself off there.

Chris Wiggins: 00:58:08

Yeah, so there’s lots of, there’s lots of challenges to using ethics as an organizing principle and a shared vocabulary in industry. Fortunately, I think you’re hitting at something which is academic training produces a community of people who already have a shared vocabulary for many of these things, particularly academic research around human subjects research people already enter other places, including workforce with a common vocabulary and a common set of norms around integrating and thinking about ethical impact of their work, even when they move into industry. There are many companies that have either directly attempted to bring in an institutional review board process. Often it’s called an organizational review board rather than an institutional review board. Often they adopt principles as understood from the human subjects research community. So, I do think there are good examples of companies where that applied ethical framing was shared by many people and was able to at least organize the conversation.

00:59:17

Ultimately though, these things are set by leadership, right? So, when every time you decide to how to recruit somebody, to promote somebody, to launch a product, those are pause points, if not checkpoints. Those are pause points, which are opportunities to reflect on many things, like, you know, is this gonna enhance the business? Is this gonna be a brand risk? And also how does our conception of applied ethics work here, right? So, ethics is often about not an algorithm, but a decision, right? Is it, the decision as to whether or not to use this algorithm for this context is a moment to ask, is this an ethical decision? And so I think applied ethics can give a shared vocabulary, which can be a useful force. You’re right that many times data scientists are coming from an academic training and they are valuable, which means they have impact, right?

01:00:11

So, when data scientists decide to flee a company, right, it has impact on that company. When data scientists decide how to design and deploy a product, which again, getting back to what we were saying at the beginning doesn’t happen unless data scientists are willing to understand the context and understand how their technology is used and speak up about how that technology should be used, that has impact on companies, right? As somebody said to me a few weeks ago, there was a war for talent and talent won. So, by being able to make sense of the world through data and develop and deploy products, you are gonna have a seat at the table. You’re, you’re not a fungible and easily replaced aspect of corporate function and therefore corporate power. So, I do think it’s very useful for the data science community to invest the time it takes to feel comfortable discussing and coming to consensus about applied ethical questions, and also not to cede ground to somebody else to make those decisions. I think it’s a, I think it’s both a short-term and a long-term disservice to data scientists to cede ground and say, that’s not my department. Somebody else who doesn’t understand data science as well as I do, should be making those decisions, including the applied ethical decisions.

Jon Krohn: 01:01:31

Nice. That was a great question. I can see why it would be, you know, tough to be able to you know, express such a great question and so nicely done, and of course a great answer as well. Before I let you go, Chris, on this show, on the SuperDataScience podcast, I always have two questions for my guests. Do you have a book recommendation for us other than your own books?

Chris Wiggins: 01:01:53

So, maybe one way to flip the question is to ask what are the, what are the books that I cited a lot in my book? I would say the, one of the things that surprised me in learning about the history of the computation was the way that computation with data is laborious. It’s, it’s, it requires labor in a way that pencil and paper mathematics does not. And immediately that labor was gendered. So, one of the things that struck me in researching the book was how as soon as you had people building special-purpose hardware for making sense of data, people said, okay, that’s women’s work, and that’s men’s work. So, there’s a good book about that by Janet Abbate called Recoding Gender. And there’s a particular chapter in that book called Women at the Dawn of Computation, which I think does an amazing job of illustrating how persistent and how early that construct was when people looked at the set of jobs to be done in order to acquire data, make sense of data, analyze the data, deploy it as a, as a piece of hardware, and how immediately people were like, okay, well that should be men’s work and that should be women’s work.

01:02:57

Which by the way, we only have access to, because of some great litigation. There was a bunch of lawsuits around the ENIAC computer and how people tried to turn it into companies. And so you get this great, all of these great interchanges between lawyers saying to women, the women who like designed and programmed the ENIAC saying things like, well, when you say you programmed it, you just mean you plugged in wires, right? And then the women say, no, that means I designed the algorithm and I figured it out, and eventually I plugged in the wires as well. In any event, so that book I think is a good one for helping us understand how data and labor, right, because doing things with data is complex and involves teams, how long that has been, has had, has manifest the same dynamics we see in so many parts of society, including the immediate gendering of labor. So that’s a good one.

01:03:46

I would say Empire of Chance is another good one on the history of statistics. That book, I think does a good job helping us understand the tyranny of “p-valueology” and how the incredible artifice of how it was decided that truth should be an algorithm and that the end of the algorithm, if one certain number is less than 0.05, then a thing is true. And if not, it is false. How that came to be is a series of historical accidents involving fights between mathematical statisticians, decisions by lawmaking bodies, decisions by editors of textbooks, a set of things that has no sacredness to it whatsoever, and yet we teach it to young people as though there actually is a truth making algorithm. And I think breaking that sacredness and realizing that it is in fact profane, is useful, and Empire of Chance is a good book for revealing that history.

Jon Krohn: 01:04:47

Great recommendations and both of them straddle this blend of humanities and statistics and quantitative discipline that we’ve had throughout this whole episode, this whole interview. So, thank you so much, Chris. Fantastic recommendations. Fantastic interview. No doubt tons of people, tons of listeners, would love to be able to track your thoughts after the show as well after this interview. How can people follow you?

Chris Wiggins: 01:05:12

Just this morning, Meta disabled my fake threads account. So, I made a, I made a completely bogus threads account and apparently, their algorithms found me and, and said like, we have now revoked your threads account. So, you can’t follow me there. Twitter, I deleted all of my tweets in December of 2016 when I was just like, holy moly, this place has become a dumpster fire. So, there’s not a lot of Twitter presence for me. I think lately I’ve been tweeting about my books a bit and that’s it. I guess that leaves only LinkedIn at this point [inaudible 01:05:48] like none of the above. So, some things will be happening there, but otherwise, you’ll have to find me in real life here in New York City.

Jon Krohn: 01:05:57

Very well. Thank you very much, Chris. And yeah, thank you for taking the time for this illuminating conversation. Let’s give him a hand.

Chris Wiggins: 01:06:04

Thank you, Jon.

Jon Krohn: 01:06:10

Thank you.

Chris Wiggins: 01:06:11

Thanks for your excellent questions, Jon.

Jon Krohn: 01:06:18

Hope you enjoyed that special live-filmed episode. Chris can go so deep on any topic that came up today, citing specific works and people you can tell we only scratch the surface of his vast knowledge, and boy does he ever explain it well. He’s someone I admire and I aspire to be more like.

01:06:34

In today’s episode, Chris filled us in on how statistics has its roots in the late 18th century with this state’s use of data to make predictions. He talked about how the humanities should be a part of data science curricula because we need to be able to communicate beyond technologists and we need to appreciate the societal impact of our work. He talked about how data and data science are rearranging power, but in a way that’s difficult to track because it’s probabilistic and is sometimes invisible. He talked about how biology has been transformed by data in recent decades, in large part due the genetic sequencing that began in the nineties.

01:07:06

And he talked about how the New York Times Data science team uses a modern and accommodating tech stack, including Google, BigQuery, Airflow scheduling, and Docker containers. As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show video URLs for Chris’s social media profiles, as well as my own social media profiles at www.superdatascience.com/703. If you’d like to get a copy of either of Chris’s books, be it How Data Happened, or Data Science in Context, well, I have good news for you. I will personally ship 10 physical copies of Chris’s books to people who by Friday, August 11th, share what they think of today’s episode on social media. Specifically, in order to make this manageable for me, please convey your thoughts on the episode by commenting on and or resharing the LinkedIn posts that I publish about Chris’s episode from my personal LinkedIn account on the day the episode’s released.

01:08:00

I will pick the 10 book recipients based on the quality of their comment or post. In the past, we found it easy to send physical books to listeners in the US, in Canada, and in the UK. There may be other regions where it’s easy too. If we can’t get a physical book to you easily, we’ll try to buy you a digital version instead. All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another awesome episode for us today. For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors. You can support this show by checking out our sponsors’ links, which are in the show notes.

01:08:40

Or you could rate or review the show on your favorite podcasting platform. Or you could like or comment on the episode on YouTube, or you could recommend the show to a friend or a colleague whom you think would love it. But most importantly, I hope you just keep listening. You can subscribe to be sure not to miss any awesome upcoming episodes. All right, thank you. Cheers. I’m so grateful to have you tuning in and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

August 12, 2025

August 8, 2025

Podcasts SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

Share

SDS 703: How Data Happened: A History, with Columbia Prof. Chris Wiggins

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

SDS 914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

August 12, 2025

SDS 913: LLM Pre-Training and Post-Training 101, with Julien Launay

August 8, 2025

SDS 912: In Case You Missed It in July 2025