Podcastskeyboard_arrow_rightSDS 708: ChatGPT Code Interpreter: 5 Hacks for Data Scientists

23 minutes

Data ScienceArtificial Intelligence

SDS 708: ChatGPT Code Interpreter: 5 Hacks for Data Scientists

Podcast Guest: Jon Krohn

Friday Aug 25, 2023

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


On this week’s Five-Minute Friday, host Jon Krohn walks listeners through ChatGPT’s Code Interpreter function, which is available to anyone who has a Plus subscription to ChatGPT. Our accompanying YouTube video might also be helpful if you want to follow along visually. Jon gives five reasons why he is so excited about the code interpreter and why it is likely to become an essential item in any data scientist’s toolkit. 
 

As soon as Jon plugs into ChatGPT the freely available Boston Housing dataset that shows housing prices in the city, the chatbot recognizes the information as the frequently used teaching tool and asks him if he would like a summary of the data or a specific analysis. It can even recognize and amend errors, train the most appropriate and relevant machine learning models for analysis, and recommend improvements for fine-tuning.

Listen (or watch) this episode for everything you need to know about the latest in augmenting our work via ChatGPT!

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?
(00:00): This is Five-Minute Friday on five hacks for data scientists by using the ChatGPT Code Interpreter. 

(00:19): Welcome back to the Super Data Science Podcast. I'm your host, Jon Krohn. For today's episode, I'm going to take you through a tour of the ChatGPT Code Interpreter, and I am, you know, most of our listeners of the show listening in audio-only format, so I will be narrating everything effectively. But if you want to, you can also check out the YouTube version of today's episode where I've got a screen share of everything that I'm talking about that will, you know, if there's any, if you have any doubts about what I was describing to you in an audio-only format, you've got that video option as well. There are some really crazy things that I'm going to show you with this Code Interpreter if you haven't used it before.

(01:05): So yeah, how does the Code Interpreter work? So you have to have a ChatGPT Plus subscription, which I personally do and I highly recommend 'because you can use GPT-4 and save lots of time on all kinds of tasks. Yeah, and especially tasks involving code and tasks involving data science analysis, as you're going to see in today's episode or hear about in today's episode. So if you haven't used the Code Interpreter before, the first thing that you're going to have to do is go down to where your email address is in the bottom left-hand corner of the screen, click on Settings and Beta and then go to the Beta features section of settings and turn on the Code Interpreter. So there's just a little toggle for you to do that. It's pretty straightforward, pretty intuitive. You can create a new chat and then make sure you select the GPT-4 model. And there's like a dropdown menu there. You can choose just the default GPT-4, which is super powerful on its own. You can choose to have plugins, which is kind of a separate thing. You can have like Wolfram Alpha there to do math for you, for example, or you can do internet searches with those plugins. But yeah, today we're focused on the Code Interpreter. So you select the Code Interpreter option from the GPT-4 dropdown menu. 

(02:22): In the interest of time, I've got a chat that I've already run, which is so we don't have to sit here while all of my code runs. What I did as an example, is I found on GitHub, I found a raw file of this Boston Housing dataset, which is a famous dataset for teaching examples of machine learning. It's got house prices from I think the 1970s in Boston. And there's about 500 different regions, so 500 rows of data. And for each of those rows of data, there is a bunch of variables. So one of those variables is the median value of houses in that region. And then there's a bunch of other variables that you can use to predict house price. 

(03:09): So yeah, you can easily find an example, raw kind of CSV file like this online, and you can use lots of different kinds of file types and upload them into the Code Interpreter. And I'll provide a link to the exact CSV file that I had, but it's super flexible. You don't need the one that I provided. You could use your own data or some other dataset that you found. Then I just simply saved that CSV file to my local machine. And then once you've saved your file locally, you there's a plus button where you can upload a file in the place that you would usually just send a message in the regular ChatGPT interface. There's, now that's extra like plus button, so you can upload a file. Yeah, and I just uploaded the Boston CSV file, and then the Code Interpreter says back to me, "I see that you've uploaded a file named Boston CSV, how can I assist you with it?" And I was like, "Oh, this is the Boston House price dataset. Are you familiar with what the variables are on the file?" And ChatGPT says back to me, "Yep, I'm familiar with the Boston Housing dataset. It's a commonly used dataset." And then it gives me an overview of the 14 variables in the file. So it gives me the variable name that's in the dataset, and a one-sentence description of what each of those variables mean. And then it asks me, "Would you like a summary of the data or any specific analyses performed on it?" 

(04:34): So I kicked off this episode by saying I had five hacks for you. We've already actually done the first hack, which is data summarization. So just by uploading the file, it quickly realized what file we were looking at and summarized the key variables for me. The second hack that I have for you is an exploratory data analysis. So ChatGPT asked me, "Would you like a summary of the data or any specific analysis performed on it?" And so let's start off just with a summary of the data. And so I'll say, "Yes, could you please do an EDA on the key variables?" And for our listeners who aren't aware, an EDA is an exploratory data analysis. It says, and ChatGPT says back to me, "Of course, an exploratory data analysis involves understanding the data's main characteristics, often with visual methods."

(05:17): And so then it says, "Okay, let's focus on these seven key variables in the Boston Housing data set and follow these steps. Let's check for missing variables. Let's obtain summary stats. Let's visualize the distributions. Let's examine the relationships between variables, especially between the predictor variables and our target variable, which is the median value of the house prices". And so then it goes through these things step by step and you can see the work, you can see the Python code that was created. So it starts off by checking for missing values and obtaining summary stats. And the first chunk of code that it creates and runs for me, and it runs right in the ChatGPT interpreter, which is super cool. You don't even have to copy this into your own Jupyter Notebook or your local ID or whatever. It'll just run right there in the instance. It runs into an error. So an error gets thrown and it automatically figures out without me having to do anything, the ChatGPT Code Interpreter says, "Ah, it appears there was an error in accessing the specified columns from the dataset. This might suggest that the provided dataset does not have the traditional columns or they might be named differently." And so it looks into it further and it determines that, "Oh, the problem is that the dataset that you uploaded, all of the variable names are in lowercase. And I was expecting them in uppercase", so it converts automatically. This is all happening. I haven't provided any prompts. It just keeps doing this. It just keeps running. It figures out the error, and then it converts everything to lowercase, provides you with the code for how it converted everything to lowercase.

(06:55): And then it does the work that it said it was going to do on that exploratory data analysis. So it looks for missing variables, it determines that there aren't any missing variables. And then it provides summary stats on the seven key variables that it identified, not me. And so for each of those variables, it provides relevant stats, things like the median, the range of the data, the maximum, whatever the key summary stats were on those seven key variables. It just suggested those to me automatically and did a really great job as far as I'm concerned. And then it goes on to create distributions of the seven key variables. So it creates histograms for each of the seven variables, which print out right there on the Code Interpreter so you can see them. And it then also provides a one-sentence summary of each of those histogram plots for each of the seven variables. And then it does scatter plot between our target variable, the median house price, and the six key predictors that it identified for that median house price. So it creates these six scatter plots with correlation coefficients, and it provides a summary of each of those you know, another one, one or two sentence summary of each of these scatter plots and the strength of the correlation between those predictor variables and our target variable, the median house price. 

(08:25): And then at the end of all of that, so that whole time it's been running, all I did was ask it to do an exploratory data analysis. And it did all of the things that I've been describing since, like identifying what appropriate EDA steps are, encountering an error related to you know, not having lowercase variable names and it was expecting lowercase variable names, fixing that error. And then running the code to, you know, determine there are no missing values, calculate some summary statistics, create distributions of each of the key variables. And then create scatter plots with them as well. Really, really cool stuff.

(09:02): So at the end of all that it says "These insights are, are a pretty good starting point for further modeling or analysis. Do you have any additional requests or questions about this dataset?" And so I said, "Excellent, thanks. Can you please process the data for a linear regression where we predict house price?" And you know, one of the really cool things about this for me is that, you know, I'm not even saying predict the specific variable name. So there's a specific variable name for the house price, MEDV, and I'm not, I don't need to specify that. You can just use natural language. You could say, you know, "I want to predict house price. How should I pre-process the data?" Or actually not how should I, can you please? 

(09:41): So that is now my third hack for you. So the first hack was getting a summary of a dataset. Hack two is doing an exploratory data analysis. And now hack three is this automated data pre-processing. So yeah, so it identifies great pre-processing steps for linear regression. It says we'll handle missing data and we already identified that we don't have any missing values. So we can skip that step. We'll do feature scaling, we'll handle categorical data we'll do feature selection and then we'll split the data into a training and test set. And then it goes ahead and does those steps or at least all the steps that are relevant. Because it actually goes ahead and identifies, "Okay, you know, some of these steps aren't relevant for this dataset, so we'll skip those." And then it provides you with the code for how it does the scaling to standardize the data and how it splits the data into a training set and a validation set. 

(10:42): And yeah, it kind of explains some summaries on, "Okay, we, we got a training set now with 404 samples, a test set with 102 samples. There's 13 features in each of those data sets. And all the features have been standardized." Great. So then it says, "You can now proceed with training a linear regression model on this pre-processed data. Would you like me to demonstrate that or is there anything else you'd like to explore?" And that brings me to our fourth hack, which is training machine learning models and evaluating them. So just doing machine learning or statistics, you can just do it right here in the ChatGPT Code Interpreter. So I said, "Fabulous. Now please train an appropriate regression model and validate the model using the test set." And just like everything else, it goes ahead and does it, it provides you with the code. It uses a linear regression model from the scikit-learn library in this case.

(11:38): And it then provides me with some sensible summary stats that are calculated on the test data. So it calculates the mean squared error and the R-Squared score. And it explains what that R-Squared score is to me in this particular instance, if you weren't familiar with mean squared error which is something that we want to minimize, then you could ask for an explanation of that. But I know what the mean squared error is, it's just the cost that we want to minimize with our machine learning algorithm here. So I said, "Do you have any ideas for how we could get an even lower mean squared error?" And so it writes back, "Certainly, improving the performance of a regression model in terms of reducing the mean squared error can be approached in multiple ways. Here's some strategies" and it comes up with a set of amazing strategies. So it comes up with a couple ideas for feature engineering. It comes up with three ideas for ways we can be regularizing the model. It suggests other kinds of modeling approaches, more advanced approaches than just a linear regression. So it suggests, you know, you could do a random forest or gradient-boosted trees and, and it explains why it has these suggestions as well. 

(12:50): So for everything that I've been saying, it explains why. So, you know, for the random force it's saying like, you know, a non-linear model can capture more complex relationships than a linear regression could. It suggests support vector machines for the same reason. It suggests more data processing like outlier removal and transformations that we could apply to the data. It suggests feature selection, it suggests ensembling like stacking, bagging or boosting. It suggests hyper-parameter tuning and you know, just that we should be assessing our model assumptions, which is a really sensible thing to be doing here. So things like homoscedasticity and independence of errors, these are the key things that you need to be looking for. And so it's aware of all these. So in total it has these eight categories of ways that we can improve our machine learning model. And most of those categories are broken down into sub-bullets and lots of separate suggestions with explanations of those suggestions and why they might be helpful. 

(13:44): So then, so then it asks me, "Would you like to explore any of these strategies on the dataset?" And I just said, "Sure, whichever approaches you think would be best, please go for it." And so it says, "All right, so let's implement a few of the strategies to potentially improve the performance. Let's do some feature engineering, we'll add some polynomial features to potentially capture nonlinear relationships. We'll do some regularization. We'll try both ridge and lasso regularization to see if that helps. And we'll try using a random forest instead of just doing linear regression to see how that compares. We'll also do some hyper-parameter." Oh, and then it also says, "For now, we'll skip hyper-parameter tuning. But in practice that could be a good step as well." So then again, it writes all the code. It uses scikit-learn for all of these things for the ridge regression, the lasso regression, creating the polynomial features, doing the random forest. And yeah, then it calculates that mean squared error results. So our previous mean squared error with the first approach that it suggested, or I guess that I suggested the linear regression approach, that got a mean squared error of 24. And then after it did, it ran these four other approaches, so a linear regression with, you know, like the new polynomial feature that it had in, this improved the results. Our mean square error went from 24 down to 14. That's a big jump. And then similarly, there were improvements with ridge regression and lasso regression relative to our baseline model, but the best results came from the random forest and it has some observations on these. 

(15:23): So it provides you with summaries of, you know, why our performance improved due to the polynomial features, due to the ridge regularization, the lasso regularization. And yeah, and it notes that the random forest had the best performance overall. And then it says at the end, "So the random forest model has provided a significant improvement. However, further hyper-parameter tuning, especially for the random forest model, could potentially lead to even better results. Would you like me to proceed with hyper-parameter tuning?" And so I said, "Yeah, please go ahead with the hyper-parameter tuning, but just do it for the random forest model", because it suggested that it could also potentially improve the lasso model with hyper-parameter tuning. And I said, no, just, you know, for simplicity, for the purpose of this demo for this podcast episode, I was thinking let's just do this as quickly as possible. Just go with the random forest model. So then it comes up with a great hyper-parameter tuning approach. It suggests, it suggests these specific random forest hyper-parameters. There's a set of five of these hyper-parameters that it suggests varying, and then it says, let's do a grid search with cross-validation. A very sensible choice here for figuring out what the best approach is. 

(16:35): And then my code was running for a while in the Code Interpreter. And that's because there are a total of 1,080 different fits that it was going to need to do across this grid search that it had come together, that it had come up with. And so I was like, I wonder how long this is going to run for. I hadn't had a scene code run this long in the Code Interpreter before and it maybe ran for 30 seconds or a minute or that kind of thing. And then it threw an error. And it understood exactly why it had the error. So it said, "It seems the grid search took longer than anticipated and was automatically interrupted due to a timeout. To address this, we can take a few measures. We could reduce the search space, we could reduce the search space by selecting fewer hyper-parameter values, or we could use a randomized approach." 

(17:26): And then without me saying anything else, it just goes ahead and says, "For the sake of time, I'll opt for the second approach and do a randomized search instead of a comprehensive grid search. This will give us a good idea about some potentially optimal hyper-parameters without doing the full grid search. Let's proceed with this" Again, as always, it provides me with the code that it's using for that. Now we just have 250 fits to do instead of over a thousand. And so this still runs for a little while, maybe it was like 15 or 20 seconds, and then it produced the results. So it said you know, "This provided me with this set of hyper-parameters that gave us the best results on the training data set. So then let's evaluate how this performs on the held-out test data." 

(18:11): And it finds that it gets a mean squared error on the test data of 10.5, which is actually slightly more than we had with a default of 10.2. But that makes, that doesn't surprise me because we didn't do a comprehensive grid search anyway. We kind of, we rushed ahead because of the limitations in here of how long a process can run before being automatically interrupted so that you're not using ChatGPT just as you're, you know, to do, to be doing a lot of heavy lifting training gigantic transformer architectures in here yourself. So so you could if you wanted to, you could take the comprehensive grid search code, you can copy it out very easily. There's a copy code button so you can grab the Python code and put it right into whatever development environment you prefer yourself. And you could run that comprehensive grid search and maybe get better results. You could also just keep talking to this ChatGPT Code Interpreter instance, and come up with other ideas for ways you could potentially improve performance and see how it goes.

(19:24): So those were, so we've gone through four of my hacks. So the first hack was just getting a summary of some file that you upload. My second hack was doing an exploratory data analysis. Hack three was data pre-processing. Hack four was actually doing machine learning. So we've been doing that for a while now. That's kind of been the meat of the episode. But one fifth and final tip for you is that the Code Interpreter is incredible for explaining code. So you can actually, that same file upload button that you can use to upload data to analyze, you can also use it to upload code or documents. And so, you know, you could potentially upload a document and ask it questions about that document and it will answer the questions about the document. Or you can upload code and ask it to explain the code overall or particular aspects of the code, or adapt the code for some other purpose.

(20:22): In this case, in the interest of time of just doing something really quickly for this podcast episode, I simply asked for an explanation of this particular variable RandomizedSearchCV, which it imported from scikit-learn. And so I said, "Can you please explain all the arguments in the line of code where you called RandomizedSearchCV?" And it says, "Of course, it's a powerful method for hyper-parameter tuning and understanding. Understanding its arguments is key to using it effectively. Here's a breakdown of all the arguments used." So it lists out, it enumerates all of the arguments. And for each of those arguments, it provides a few bullets explaining what, why you would choose that particular hyper-parameter or what, what that particular hyper-parameter is for, as well as some example values that you might want to choose there or, or values that you might want to change. 

(21:16): So that's it. If you haven't used the Code Interpreter before, I hope that this episode blew your mind. I, I, the first time that I used this was completely stunned by what it can do. And yeah yeah, knock yourself out. Use, use some data of your own and get going. Start making your life easier. There's no need to be struggling with your code or coming up with ideas for how you can be improving your machine learning models. Just have a conversation with the ChatGPT Code Interpreter, have some fun. This really, these hacks really do make being a data scientist more fun than ever before.

(22:02): All right, that's it for today. I'll be sure to have links to the data in the show notes. And in case, you want to try to replicate more or less what I did here, there's probably some stochasticity in the model, so you're not going to get exactly the same answers or flow as I did. But you're going to, you could get roughly the same thing. And yeah, check out the YouTube version if you're not watching that right now already where you can see me walk through all of the code. 

(22:30): All right. Until next time my friend, keep on rocking it out there and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon. 

Show all

arrow_downward

Share on