Podcasts SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

44 minutes
Data Science, Machine Learning

SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Exploring our vast universe, in this episode Jon Krohn meets with Daniela Huppenkothen at the University of Amsterdam’s astronomy department for a wide-ranging discussion about building instrumentation for telescopes, collecting data from outer space and how to sort astronomy’s problem of enormous amounts of data.

About Daniela Huppenkothen

Daniela is a (Data) Scientist at the SRON Institute for Space Research. She works on the interface of data science and astronomy, developing statistical and machine learning models to infer physical knowledge from (astronomical) data. Much of her work focuses on time series analysis, but she’s also more broadly interested in mitigating systematics in observational data. She is also broadly interested in community building and collaboration management. Currently, Daniela co-leads a project to design and evaluate interventions to build positive, welcoming virtual spaces during participant-driven workshops and hackathons.

Overview

Daniela explains that the difficulty with astronomy, as opposed to other scientific fields, is that it is observational—astronomers are dealing with objects that are often lightyears away, leaving them reliant on the quality of their equipment. Telescopes can reveal wavelengths of light that are otherwise invisible to the naked eye, which astronomers can use to measure changes in brightness and the life cycles of stars over time.

To drill into these data, Daniela recommends taking a three-pronged approach. While important, supervised machine learning alone is not enough because of the chaotic nature of outer space. The answer is to use representation learning to identify whether or not patterns can be made or if data points should be considered mutually independent. The second approach is with automated data processing to identify what the object being observed actually is. The third approach considers causal inference – identifying the origins, causes and sources of light – which Daniela says should be an astronomer’s ultimate goal.

Astronomy enthusiasts will be delighted to hear that a great deal of data gathered from the universe is available to the public, including data challenges where anyone can start to explore the serious questions about our universe. Daniela notes that astronomy data have a significant advantage in that they are not beholden to the ethical implications apparent to, say, biological or personal data.

Listen to hear about the Hack the Hackathon event, how to collect gravitational waves, and what Daniela looks for in PhD candidates.

Items mentioned in this podcast:

SRON Netherlands Institute for Space Research
Open PhD Position
PLAsTiCC Challenge
Starsounder
Rubin Observatory
NASA Open Data Portal
SKA Observatory
The Last Stargazers by Emily Levesque

Follow Daniela:

Did you enjoy the podcast?

Have you ever been involved in a citizen science project?
Download The Transcript

Podcast Transcript

Jon: 00:03

This is episode number 732 with Dr. Daniela Huppenkothen, Scientist at the Netherlands Institute for Space Research.

00:19

Welcome back to the Super Data Science Podcast. Today, Dr. Daniela Huppenkothen joins me on the show to fill us in on how data science and machine learning in particular are essential in the fascinating field of astronomy. Daniela is a scientist at both the University of Amsterdam and SRON, which is a Dutch of abbreviation that translates to the Netherlands Institute for Space Research. She was previously an associate director of the Institute for Data Intensive Research in astronomy and cosmology at the University of Washington, and she was also a data science fellow at New York University. She holds a PhD in astronomy from the University of Amsterdam. Most of today’s episode should be accessible to anyone, but there is some technical content in the second half that may be of greatest interest to hands-on data science practitioners.

01:05

In today’s episode, Daniela details the data earthlings collect in order to be able to observe the universe around us. The three categories of ways machine learning is applied to astronomy and how you could become an astronomer yourself if you’d like to. All right, and let’s jump right into our conversation.

01:23

Daniela, welcome to the Super Data Science Podcast. Thank you for welcoming me to your facilities here at the University of Amsterdam. We are looking out, I’ll try to maybe get some, at least some photos, maybe some video footage that we can put into the video version of this YouTube episode of there’s actual hardware here on top of the building. We’re sitting on the fourth floor looking out of a beautiful giant glass window at the telescopes or the domes that contain the telescopes outside, which is cool.

Daniela: 01:56

Yeah. Thank you for having me. I will mention the caveat that these are not used for research. The Netherlands is cloudy most of the time, so it’s not an ideal place for astronomy, but it’s used for a lot of teaching and outreach.

Jon: 02:09

Maybe better for meteorology then. Well, still, yeah, very cool. And the whole campus here, I’d never been to the University of Amsterdam before and it seems like a really slick campus. Of course, there’s a million bicycles parked out front.

Daniela: 02:26

Yes.

Jon: 02:27

Which is, I’ve only been here one day and I’m still getting used to the volume of cyclists that I need to look out for as I’m crossing the street.

Daniela: 02:34

Yeah. There are a lot of bicyclists everywhere in the Netherlands.

Jon: 02:37

There seems to be like a 10 to one bike to car ratio in most of the places I am.

Daniela: 02:41

Yeah, that’s probably true. And bicycles have a lot of right of way everywhere, which is nice if you’re on a bicycle.

Jon: 02:47

Yeah. So we know each other through Reshama Shaikh, who we were waxing very lyrically about just before recording. Reshama is this supernode globally in data science communities. She’s very active in women in data science, in PyData. And actually it’s something else that you know her through, which was a fascinating thing. Fill us in a little bit on this program.

Daniela: 03:13

Yeah. So I’ve been organizing hackathon style events for astronomers mainly for about nearly 10 years now. And for the last couple of years with a bunch of researchers off hackathons, I’ve been organizing this series of workshops called Hack the Hackathon, where we think more broadly about what hackathons are and how we can make them welcoming and related event like code sprints and game jams and stuff like this. And Reshama attended last year and is on our organizing committee this year. So it’s been really fantastic to work with her on that.

Jon: 03:51

Awesome. Yeah. So Reshama, thank you so much for setting up this connection with Daniela. I know this is going to be an amazing conversation. So you’re at the University of Amsterdam, obviously, where you’re a researcher, but you are also a scientist at something called SRON, S-R-O-N, in all caps. What does that mean? What does it do?

Daniela: 04:13

SRON is an acronym. It stands for Stichting Ruimteonderzoek Nederland, which is Dutch for the Netherlands Institute for Space Research. SRON builds hardware for space telescopes, so telescopes hosted on satellites, and we build instrumentation both for telescopes that look down on the Earth to do climate studies and also telescopes that look up into the universe to study things in our universe. We are a partner to the European Space Agency. We build instrumentation that flies on satellites launched by the European Space Agency, but we also partner, for example, with NASA and with the Japanese Space Agency, JAXA, which just launched a satellite that we’re all very excited about called XRISM.

Jon: 05:11

Very cool. And so yes, that’s actually, we were talking a little bit before we started recording, and you’re actually making hardware for these telescopes?

Daniela: 05:19

Oh, not me personally. I am trying to be the glue between the people who make hardware and the people who are interested in the physics. And I try to take the data that comes out of the hardware and make it useful for the physics we want to learn about the universe, which is really what a data scientist does. But yes, SRON builds instruments, it builds parts of instruments for space telescopes. We’re currently very involved, for example, with the next generation of detectors for X-ray telescopes. There is a mission coming up where, coming up is a relative term, space missions take a long time to develop. This one will launch probably in 2037. And so SRON is one of the key institutes and one of the instruments on this large European mission called Athena.

Jon: 06:16

Very cool. Yes. And that was a perfect lead in to my next topic, which is what kind of data do you collect? How do you observe the universe? We have some idea of what’s happening out there, well beyond our planet. Yeah, it must be vast amounts of data. Yeah. What kinds of data are you collecting?

Daniela: 06:42

Yeah, so I think useful to know, I think the first thing I always try to point out is that astronomy is, for the most part, purely observational science. For the vast majority of our science questions, we can’t run experiments like you would do in other fields, but something in the universe produces light, and then that light travels sometimes millions or billions of light years. And then we build telescope to collect that light. And our task is to try and understand the physical mechanisms that produce that light in the star or black hole or whatever we are observing, whatever center that light’s based on this tiny spec we see in our telescope. And so there are different options. You can look at images. I think this is the most common astronomy data the public is familiar with. You have one of those images on your computer, on your-

Jon: 07:38

Oh yeah.

Daniela: 07:38

Yeah.

Jon: 07:38

So I have on my laptop, because my company, my machine learning company is called Nebula. I have a photo of the James, something taken by the James Webb telescope on this sticker that’s on the top of the laptop. And I completely fumbled that when I walked in and Daniela was like, “Oh, look at that.” And I called it the James Hubbell.

Daniela: 08:04

Two different telescopes, both amazing in different ways. Yeah. So most people are familiar with these beautiful images of Nebula and stuff that get released to the public, but that only works for objects that are larger than a pixel in our camera. And actually most things in the universe are really far away. And so as they get farther away, they appear like what we call a point source. So a single pixel essentially. And so for these kind of sources, what we also do is we split light up by wavelength. So we can look at how much light gets emitted in red or in blue. We can look at wavelengths that are not in the visible. We can look all the way from radio with radio telescopes to X-rays and gamma rays.

08:57

And because we know that different things in the universe emit light at different wavelengths, we can use that to study the kind of processes that generated that light. And then we can also look at how the brightness changes the function of time. If you look at the sky, usually you see a bunch of stars and planets and they don’t really change except from flickering that’s due to the US atmosphere. But actually, if you point a telescope at a star, you’ll very quickly find out that it varies on timescales from seconds all the way to years. The sun, for example, has an 11 year solar cycle. It also has oscillations on much shorter timescales.

Jon: 09:41

I did not know that.

Daniela: 09:42

It has solar flares. And so other stars do that too. The things that I look at, which are black holes, we see changes in brightness on timescales of milliseconds all the way to decades. And so we have telescopes that observe as a function of time. And so we can study the time series to learn something about what’s happening in these sources.

Jon: 10:06

This is a bit of a random tangent, but you could maybe answer this for me better than anyone I’ve ever met. So you sometimes hear, is it true there’s a risk of solar flares just suddenly causing all electronics on Earth to stop working?

Daniela: 10:18

Yeah. So really, really bright solar flares. I don’t know about all things on Earth. So the Earth has its own magnetic field, which actually shields us pretty well from particles, charge particles and high energy radiation. This is why we have to put X-ray telescopes into space because X-rays don’t get through the atmosphere, which is very good for us because X-rays are not super healthy but bad for astronomy, which means we have to build expensive space missions to build X-ray telescopes. But yeah, if you have a really bright solar flare, it will deposit a lot of, especially charged particles into the Earth’s environment. And that, I think, that is a big danger for satellites, for example.

Jon: 11:07

Okay. Well, that’s somewhat reassuring. Because I have this potential doomsday scenario that we could just wake up and the solar flare… I don’t even have to wake up. It could be at any time.

Daniela: 11:18

So if I can tell you a fun story, one of the types of data I worked with during my PhD comes from a source called a magnetar, and this is the leftover of a stellar explosion. Some stars when they explode, they leave behind a random called a neutron star. And it’s basically the densest objects we know in the universe. They’re about the mass of the sun, but compressed to about 10 kilometer radius. And these magnetars have extremely strong magnetic fields, 10 orders of magnitude more than what we can do on Earth. And occasionally they release massive amounts of energy in the form of what we call a giant flare. And they release massive amounts of X-ray radiation in the fraction of 200 milliseconds. And the one in 2004, even though the source was thousands of light years away, the gamma rays and X-rays reacted with electrons in the Earth’s magnetosphere. It was enough that it caused a measurable depression of Earth’s magnetic fields. You could measure the impact of that giant flare from thousands of light years away in the Earth’s magnetic fields.

Jon: 12:42

Wow.

Daniela: 12:43

Doesn’t mean it’s dangerous to us, it’s much too far away, but I thought that was pretty cool. We don’t often get measurable effects on Earth from these things that are very far away.

Jon: 12:54

And it just occurred for 200 milliseconds.

Daniela: 12:56

So the initial spike was 200 milliseconds and then it had this long tail of about 400 seconds or so.

Jon: 13:05

But it was this thing that had been traveling for thousands of light years and they just passed us for a few seconds. That’s wild.

Daniela: 13:08

Yeah. Yeah. One thing I’m really interested in is what’s called fast radio bursts. These are extremely short, extremely bright blips of radio emission, and it was first detected in archival data, so data that had been collected for a different purpose. And then people looked through it and they saw this few millisecond blip of very bright radio, and radio telescopes are subject to a bunch of interference. There was one class of these blips that got tracked down to the observatory’s microwave. If you open the microwave before it’s run out-

Jon: 13:48

Like the microwave for warming up food?

Daniela: 13:50

Yeah. It creates a tiny blip of radio. But that explained one class of these, but it didn’t explain all of them. And then we saw them in other telescopes and we now know that they are real objects, but they’re microsecond to millisecond flashes of radio and we don’t know what produces them.

Jon: 14:08

Oh, really?

Daniela: 14:08

There’s dozens of theories out there, there’s purpose built telescopes, they’re really interesting. They must be very, very energetic, but we don’t don’t know what they are. And so it’s a really fun topic to think about that even now we detect new phenomena we’ve never seen before.

Jon: 14:31

That is super cool. I guess there’s a lot to explore out there, but it’s always, so you mentioned that it’s always light, and I get that there’s lots of wavelengths, so it’s not necessarily visible light. It could be radio is a kind of light. Yeah.

Daniela: 14:45

Yes. Yeah.

Jon: 14:48

Yeah. Yeah. So everything that we do with astronomy is light in some way or another. You’re never collecting sound.

Daniela: 14:55

No.

Jon: 14:56

Okay. Okay.

Daniela: 14:56

If I say yes, I will deeply annoy some of my colleagues here who work on gravitational waves. You might have heard some of the announcements. This was a big prediction from Einstein’s theory of relativity. If you have two objects, very massive objects that spiral, they can produce gravitational waves. These are basically periodic distortions of space time, which sounds very sci-fi, and we can measure them on Earth. It’s extremely hard to measure. There’s actually a lot of machine learning involved in trying to do that. You’re trying to measure a signal that’s orders of magnitude weaker than the noise. And so there you can also, for example, study two black holes that spiral in. You see in gravitational waves. There’s also particle detectors. So cosmic ray detectors which try to detect part particles emitted from astrophysical sources.

Jon: 16:04

Very cool. So astronomical observations can be light, and it sounds like that is maybe one of the more common kinds of data collected.

Daniela: 16:15

Yeah, people have looked at the sky for thousands of years and have looked at things and went, what is this? And there are throughout human history, astronomical observations. For example, we have lots of records from ancient Chinese astronomers who saw what they call guest stars, which we now know are supernova explosions. So explosions of stars that get very, very bright for days, two weeks and then fade away again. Actually, one of our units we use in astronomy has its basis in a unit developed in ancient Greece, 2000 years ago.

Jon: 16:59

Really?

Daniela: 16:59

It’s very annoying to use, but it’s still the standard in astronomy in some parts of it. Yeah.

Jon: 17:09

So light is very common, but gravitational waves are also collected and particles.

Daniela: 17:13

Yeah.

Jon: 17:14

And so just really quickly, this is… I don’t want to dwell on this too much, I want to get to the machine learning applications which are next, but just really quickly, how do you collect a gravitational wave or a particle? How do you observe that?

Daniela: 17:29

Oh, this is a bit outside of my expertise. For gravitational waves what you do is you build two lasers. You build a laser that is sent out, the light is split and sent out in perpendicular directions in a tube that’s like several kilometers long. And there’s a mirror at the other end which reflects it back onto a detector. And if you have two waves and the maximum of the waves hit at the same time, you get a stronger effect. And if they hit not in phase, they cancel each other out, which is called interference. And what rotational waves will do is they will stretch the space time of these arms that the laser goes through and it will change the interference pattern. And based on that, you can then say, aha, an interference, a wave has gone through. The problem is a truck that has gone past might do the same. And so this is why there’s so much noise they have to contend with.

Jon: 18:32

Yeah. The signal is orders of magnitudes smaller than the noise. Yeah, that’s wild.

Daniela: 18:36

One thing that I wanted to mention about astronomy data is that a lot of the data is publicly available, and a lot of it has been under-explored. It’s large, well-designed archives, for example NASA has several of them. There are data challenges that telescopes run where you can explore specific problems in astronomy with things like machine learning. And I like astronomy because it’s a great playground for new algorithms where you don’t have to worry a lot about ethical implications. We want to do good science, we want to do things right, and we want to publish good results, but except in some limited cases, the results we publish are unlikely to directly affect anyone’s life directly. And so there’s this great playground of very big data sets to explore algorithms with.

Jon: 19:44

Perfect. So let’s talk about that. That was an amazing segue. You’re doing my job for me. So once we’ve collected the data and many of these data are published and available publicly for our listeners to be playing around with, what can people then do? How is machine learning or statistical models applied to these astronomical data?

Daniela: 20:07

Yeah. So part of it is the big data problem. Astronomy now has a lot of data. Astronomists used to hunch over photographic plates and make hand measurements and add adaptations. This is not really what we do anymore. There’s, for example, the European Gaia satellite observed over 1.7 billion stars in our galaxy. Nobody’s going to sit down and analyze them by hand. There are, there’s for example, the Vera C. Rubin Observatory, which is a US mission, a US telescope funded by the NSF and the DOE, which is currently being built in Chile, and that will observe something like 40 billion objects. And I also said earlier that the brightness of things changes a lot. And so, one thing that it’ll do is it will generate of the order of 10 million alerts per night, 10 million things, 10 million messages where it says, I have observed this source before and I’ve observed it now and it is brighter or dimmer or there was no source and now there is, or there was a source and there was no source. And so we have to filter out which of those $10 million are actually interesting-

Jon: 21:21

Which ones are actually aliens.

Daniela: 21:23

For example, there’s also things like the Square Kilometer Array, which is a radio telescope currently under construction in Australia and South Africa. And that will generate of the order of an exabyte of data per year. And so dealing with that kind of data volume is a challenge in itself. For me, most interesting, it’s not just big data. Our instruments and data sets are also getting more complicated, and so that makes data processing much harder. It also means we need better theoretical models and simulations to understand what’s happening. And that is also computationally expensive. And so machine learning, I was thinking about this early and I think I can put it into three categories. There is the exploratory data analysis type thing and I said serendipity is very important, which means that we often, whenever we build a new telescope, we see something new that we’ve never seen before.

22:22

And that means if we deploy a supervised machine learning algorithm, we’ll never find those things. And so anomaly detection is really a key part, but then we find something new and like with these fast radio bursts, we don’t know what they are. And so the first thing you do is you try to do a bunch of things that are usually guided by some ideas of what the physics might be, but it’s a lot of exploratory data analysis. And so what we do is what’s called representation learning, where we’re trying to learn useful representations of the data and say, can we actually separate these fast radio bursts that we observed? Are there actually two classes of them or are they all the same? Are there some that are different in some way? And then there’s things like automated data processing where a lot of the classical supervised machine learning is really useful. For example, in the case-

Jon: 23:26

This is now number two?

Daniela: 23:26

Yes, this is number two. Number two is automated data processing, which is more than general, the classical supervised machine learning. And that might be questions for the Vera Rubin Observatory, this thing that we see in our alert, is it real or not? Or was it an astronomical source or was it a small satellite going through our image or a plane or something else? It’s also things like what we call transient classification where we say, this alert, we’ve just seen, is it an exploding star? Is it a regular star that just has a flare? Is it a black hole? What’s happening? And then the third part that I’m really interested in is how can we use machine learning to help us in our causal inference? I think one thing that is useful in the context of machine learning and astronomy is that your ultimate goal is almost always causal inference.

24:27

We want to know what made this light, which I think is different from a lot of predictions problems that you might encounter in other machine learning contexts. And so causal inference is very hard with neural networks, for example, which is very good at finding correlations. But making these correlations mean anything is challenging. And so the things I’m interested in, for example, is neural networks as surrogate models for expensive physical simulations. So where we have a model that takes a number of parameters, say the black hole mass and some other parameters, and it generates the data that we might see. And often these are computationally expensive. And so what we do is we train a neural network on lots of these simulated data sets so that they can do the same, that the neural network can do the same calculations but a thousand times faster. There’s also things like physics informed neural networks where you can put inductive biases to enforce certain types of physically meaningful symmetries in your neural networks and things like that.

Jon: 25:35

Yeah, this ability to use machine learning models to closely approximate some more rigorous analysis is something we’ve seen a lot. Simulations are really great for this. So a common example that I’ve talked about on the show a couple of times recently is with predicting the weather. There’s fluid dynamics simulations that are extremely compute-expensive, but then we can do it with a deep learning model for many words of magnitude less compute and get results faster.

Daniela: 26:05

I really like that problem because, so I did, I have done statistics much more than I’ve done machine learning and astronomy because I think some of the key challenges of using machine learning in astronomy is first of all that we want to understand causal relationships in many cases. And then it is very hard to get ground truth training data because things are very far away. And so for a lot of things we just can’t, we don’t know what branch with training data is. We can simulate data that looks like the real data, but if we could perfectly simulate the real data we’d be done, we wouldn’t have to actually study the universe anymore. We made a perfect simulation of it.

26:52

And so we already know that that data that we might train on is not an accurate reflection of reality. And so I like problems where we can get around these kinds of challenges and building emulators, building these kind of surrogate models I think is a good example because it is one of the few problems that I can think of where you have exact ground truth training data because all you’re trying to do is just approximate your physics model as well as you can.

Jon: 27:25

Very cool. Yeah, that’s a crystal clear explanation. To recap quickly, the kinds of machine learning application categories that you’ve identified as the major ones are exploratory data analysis, things like representation learning, because this is critical for us to be able to discover new things. Like you said, when we have a new telescope, we always discover something new. And if you were to just be using supervised machine learning models in those circumstances, you would miss these new things. So exploratory data analysis is critical. Automated data processing, this is, yeah, another, it almost seems like a given with the amounts of data that you have and the inability to be hand annotating things, you’re going to need automated data processing pipelines that are identifying interesting events happening.

28:16

And then, yeah, it sounds like the one that interests you the most is assisting with causal inference. So allowing us to go in the direction of making scientific breakthroughs of being able to say, this isn’t just some correlation, this is some meaningful thing. It isn’t just a microwave that someone’s using in the facility or a truck driving by. And yeah, it’s cool that neural networks can be used for this kind of thing. So these kinds of neural networks that you’ve been using particularly, you were talking about this use case already of approximating physical simulations and being able to do that a lot more rapidly with neural networks. Is that like a dense network typically, that kind of thing?

Daniela: 28:56

Yes. Yeah. So, so far, these neural networks that we’ve needed are actually pretty simple. We’ve fully connected neural networks with four to five layers, have been totally fine to do this sort of thing. I have a PhD student who is currently working on a project related to black holes where he’s doing that kind of work and he’s been doing a lot of fun exploration related to active learning because the hardest part for us at the moment is not building a neural network that can approximate the data. That’s actually not that hard, but our physics model has 20 parameters and if you wanted to make a grid and you just put 10 points in each dimension, you would already have 10 to the 20 simulations that you would need to run, which we don’t have computational resources for. We would like to finish this in the age of the universe. And so his work has been around how can we figure out where to intelligently put points in that space so that the neural network learns what it needs to learn.

Jon: 30:13

Very cool. That sounds like fascinating research and it’s an ideal segue into my very next question, which is if somebody wants to be doing this research, we have a listener out there who is also fascinated by space. They want to be doing cutting edge things with machine learning, statistical computing, using these vast amounts of data that are being collected by telescopes we have and other instruments for collecting data from space, gravitational waves, particles. I understand that you have a PhD role open.

Daniela: 30:45

Yes, I do. This is actually a PhD role for neural network surrogate models. We are going to study how… So I look at black holes that you can’t actually see black holes because they’re black, they absorb light. And so what we actually see is gas being attracted by the black holes gravity, and as it falls in, it radiates away light. These systems often also produce large scale jets, so part of that matter gets spewed out back into the surroundings. And we’re looking at some of the largest black holes. We know of the order of millions to billions times the mass of the sun. And we look at the effects of those jets from these black holes onto their environment and how they change when and how galaxies form stars. And so we do this, we have very good numerical models for this that we can’t actually run fast enough to model our data.

31:51

And so we are going to build surrogate models with machine learning for these. And one of the telescopes that SRON is involved in called XRISM just launched this September, and that will have some of the best cutting edge data coming in next year to do this kind of science. And that’s what the PhD position will be on using that data and using clever methods for machine learning and statistics to try and do this much better because most of the methods that people currently use to do this kind of modeling, the physics models are very, very good. The statistical methods are about 40 years old.

Jon: 32:33

So if somebody wanted to be able to grow into this role, and there might even be some people out there today who have the skillsets, so they could actually apply today to this PhD role, which we will provide in the show notes so that you can apply if you’re interested. And this is also, it’s a cross-appointment at the University of Amsterdam as well.

Daniela: 32:52

Yeah.

Jon: 32:53

So yeah, very cool role working in a top national research laboratory for astronomy as well as a top university. Sounds incredible. Especially if you like riding a bike. So yeah, so other than being able to ride a bike, what kinds of skills do you look for in your PhD candidate?

Daniela: 33:14

I mostly look for things like curiosity. I think what makes a good researcher is basically curiosity and persistence. I don’t think this is really a project that’s going to bridge astronomy and data science. And so I am equally open to people who come out of astronomy programs and people who come out of data science programs. Any candidate would have to have some affinity for both because it is going to be a project at an astronomy institute, so there’ll be some expectation to get some astronomy out of it. And so it also, you have to want to think really deeply about a topic for four years, which is also not everyone’s cup of tea. But yeah, if someone out there is the kind of person’s like, I really want to think about black holes and X-ray data for four years, I would love to have their application.

Jon: 34:20

Nice. So there’s not any kinds of specific, I guess you said they could come from an astronomy background or data science background. So I guess that means you’re open to these people coming from the astronomy background and not necessarily having python skills or…

Daniela: 34:34

Yeah, that’s right. I think everyone in astronomy these days learns python in master’s program. I think the formal requirement is a master’s degree. I think that is important.

Jon: 34:45

Okay. Yeah. So in these kinds of key skills, curiosity, persistence, but yeah, so then it ends up being implicit in having the master’s in astronomy or data science. There are some kinds of technical skills that we just assume this person has. So Python would be one.

Daniela: 35:04

Python would be one. For someone out of a data science program, I would hope they come with at least working knowledge of machine learning.

Jon: 35:14

Yeah. It makes sense. Very cool.

Daniela: 35:17

For people who don’t want to do a PhD, I do encourage you to go and look at some of the data challenges. The Rubin Observatory ran one a couple of years ago called the PLAsTiCC challenge PLAsTiCC with two C. You can still find it on Kaggle if you want to explore some next generation problems in astronomy. The official challenge has finished, but there’s a lot of those out there that you can find.

Jon: 35:48

Very cool. Yeah, so the Rubin PLAsTiCC Challenge, I guess that could also be a way for people to get a taste of whether a four year PhD in astronomy is the kind of thing for them. Cool. And I understand you have a side project that maybe-

Daniela: 36:03

I do.

Jon: 36:03

… any of our listeners could indulge in. They like listening to podcast episodes. Maybe they would also like listening to stars.

Daniela: 36:10

They would maybe, I have been working for a couple of years now with the digital arts department at the University of Washington where I was at the University of Washington before I came back to the Netherlands with Juan Pampin, who is a researcher and artist who works with sound. And we worked together to take about a few thousand time series from a space telescope called Kepler. Kepler looked at about 200,000 stars for four years continuously, made a measurement every half hour and in some cases every minute, and just measure the brightness in optical light, so invisible light. And we took all of those time series, or not all of them, we took a subset of those and we turned them into sound. And so you can listen to what stars sound like.

37:03

There’s a website, it’s called starsounder.space, and I will show you a diagram which is very typical for the kind of diagram we make in astronomy about stars. There’s an about page that will tell you what it means, and there’s little, some of the stars we put into what we call our curated set, which have a little bit more descriptions and guide you through what you’re hearing. I will issue a warning. We tried to make this… We focused on scientific accuracy over pleasantness. So some of these sounds are quite harsh. You might want to turn down your volume when you try those out. But some sound quite fun, some sound like a choir or church bells. And so-

Jon: 37:46

Oh really?

Daniela: 37:47

Yeah.

Jon: 37:48

Cool. That is really cool. And something that you just reminded me of and something that came up prior to us pressing the record button on this is that I guess in astronomy, whether we’re talking about light data, gravitational data, particle observations, and probably whether we’re talking about exploratory data analysis, automated data processing, or assisting with causal inference, in all of these cases, we’re disproportionately working with time series data, right?

Daniela: 38:17

No, not really actually.

Jon: 38:18

Oh, not really. Not necessarily. Okay.

Daniela: 38:22

I think there’s, I don’t know that I would call it an even split between the different types of data. I’ve also simplified a bunch of things both on the machine learning side. I suspect I have lots of colleagues who would now stand up and go like, but you missed my kind of problem in astronomy, which I’m sorry, in terms of the data modes, it’s not always that clear cut. There’s other data modes, for example, we look at something called polarization, which is the direction of the wave of the light and whether those directions are all the same or not, if you have polarized sunglasses, you might have seen that effect. So that is another type. Sometimes we study both types at the same time, I’m really interested in something called spectral timing where we look at wavelength and time and information at the same time. Some of my colleagues are very interested in exploring both spatial scales and large scale galaxy clusters and spectra. So it’s not that clear cut as I’ve made it out to be.

Jon: 39:30

Okay, got it, got it, got it. Yeah, and I know, it seemed to me like, Oh, these data are being collected over these vast timescales, but you end up, yeah, having things that are spatial as well.

Daniela: 39:41

It also really depends what you’re studying. Galaxy clusters, which are large assemblies of galaxies, they’re some of the largest structures in the universe. You can explore spatially very well and you can explore with spectra, they don’t really change on appreciable timescales that much because any processes take thousands to millions of years. On the other hand, these fast radio bursts, they’re of the order of microseconds to milliseconds long and they actually change quite a lot on those timescales. And so it really depends what type of astronomical object you look at, which type of data you want to collect.

Jon: 40:26

Yeah, it makes sense. And the reason, the thing that queued my mind on this obviously was you talking about the side project and obviously the musical notes have to occur over time. And so just to jog my memory on that, but I’m glad that you broadened this so that my assumption about it being mostly time series data, it turns out to be not true. Cool. All right, so before I let you go, Daniela, this has been a fascinating episode, but I ask all of my guests before I let them go for a book recommendation.

Daniela: 40:52

Yeah, I want to recommend a book that is mostly astronomy and data science in the old-timey way. It’s a book called The Last Stargazers by Emily Levesque, who is a professor at the University of Washington who has written a book about collecting stories from the last hundred years or so of how people observe. And that has changed quite a lot. And there are a lot of stories of people being in remote observatories. Someone used to have these cages up on the telescopes because you had to sit at the focal point, which would be several meters up, and there’s a story about someone getting stuck up there. There’s a story about a telescope mirror with a bullet hole in it. And so there’s a lot of very entertaining, but also very interesting stories about how we used to do astronomy in over the past 100 years or so that I think are really informative because some of these things have gone away, but some of these modes of observation are actually still really being used and really relevant. And so it’s a really fascinating story of how astronomy observations happen.

Jon: 42:10

Nice. That’s cool. Sounds like a great recommendation. It makes these kinds of narratives make learning about these technical concepts a breeze. Awesome. And obviously lots of fascinating content introduced by you to us today, Daniela, including potentially career opportunities out there. How should people connect with you or follow you after this episode?

Daniela: 42:35

Yeah, I think probably at the moment the easiest is LinkedIn. I used to be pretty involved on Twitter, but that has declined a lot in the past couple of years and I’ve not yet decided what to do about this, if anything. So at the moment, I think probably LinkedIn is the easiest form to get in touch or Instagram if you want to see pictures of astronomically themed cakes, which is almost the only thing I post there.

Jon: 43:14

Nice. And so the Instagram is also easy to find by your name?

Daniela: 43:18

I think so, yeah.

Jon: 43:19

Nice. Well, then we’ll be sure to include those cakes in the show notes as well. Thank you so much, Daniela. This has been fantastic. Thank you for welcoming me to the fabulous University of Amsterdam facilities and recording here. And yeah, maybe we can check in in the future and see how machine learning applications to astronomy are coming along.

Daniela: 43:42

Yeah, thank you for having me.

Jon: 43:44

Wonderful. Hope you enjoyed that enlightening conversation with these super well-spoken astronomer, Dr. Daniela Huppenkothen. In today’s episode, Daniela covered how earthlings collect tons and tons of light gravitational and particle data in order to observe the universe around us and how machine learning is used in astronomy to perform EDA to automate data processing and to assist with making causal inferences.

44:06

All right, that’s it for today’s episode. If you enjoyed it, consider supporting the show by sharing, reviewing, or subscribing, but most importantly, just keep on listening. And until next time, keep on rocking it out there. I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon.

Podcasts SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

Podcast Transcript

Share on

Related Podcasts

July 24, 2026

July 21, 2026

July 17, 2026

Podcasts SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

Share

SDS 732: Data Science for Astronomy, with Dr. Daniela Huppenkothen

Podcast Transcript

Share on

Related Podcasts

July 24, 2026

SDS 1012: The Open-Weight 2.8-Trillion Parameter Competing at the Frontier

July 21, 2026

SDS 1011: The Math Still Matters: Deep Skills in the Age of AI, with Dr. Catherine Williams

July 17, 2026

SDS 1010: Fable 5 as Advisor: Anthropic’s Two-Model Pattern for Smarter, Cheaper Agents