The National Basketball Association (NBA) is one of the most famous sports franchises in the world. The league has been grown As of today, this professional basketball league is composed of 30 teams in North America, with more than 500 players playing each year. As of 2020, NBA players are the world’s best paid athletes by average annual salary per player.
This project will explore how to use each NBA player’s performance to determine the player’s annual salary. The object is to create a linear regression model to predict NBA player’s salary, based on each NBA player’s game stats in the NBA season.
For data collection, I have used “Beautiful Soup” package in Python as my web scraping tool to scrape data online. As there are many websites that provide NBA player’s stats and salaries, I found basketball-reference.com provides the most accurate data.
In this study, I have scraped the website and created two datasets.
- All NBA players stats for 2019–2020 NBA season
- All NBA player’s Salary Data for 2020–2021 NBA season
Then, data preparation was performed using “Pandas” and “Numpy” to ensure the data is clean and ready for the next phase.
After data is ready, I have started some simple Exploratory Data Analysis (EDA) to understand the features. During the EDA stage, I have found out the NBA player’s salary is not normally distributed, which violates the assumption for Linear Regression Model.
During the feature engineering stage, I have decided to performed log-transform and sqrt transform on my target variable “Salary”, it shows that sqrt transform performs better than log transform on “Salary”.
I have created dummy variables for the categorical variable “Position”, and checked multicollinearity between each feature. It seems like majority of the features are highly correlated, many of them have to be dropped in the feature selection stage.
At the beginning of the model creation stage, I have created a simple linear regression model as my baseline, fitting all numeric features. The initial result shows very promising R². The R² on the training set is 0.61, which indicates that 61% of the target variable can be explained by the features in my model. However, the R² on the test set is 0.49, which indicates that the model is complex with too many features and I have been overfitting my model.
In order to solve that, I have performed LassoCV with CV=5 for K-Fold Cross-Validation and model regularization. It’s a very convenient tool for standardizing features, which helps to solve overfitting and reduce complexity.
After model validation and regularization, we have finalized our model and features:
In the end, we have evaluated the model to check Root Mean Squred Error (RMSE) = $1,053,417
In my opinion, the performance of the model is reasonablely good, considering there are many other factors than stats that can influence the player’s salary, such as player’s popularity and type of contract.
Therefore, given more time, I would add more features in my model, collect more historical data with adjusted inflation and also do more fine tuning to the linear regression model.