In the realm of music streaming, predicting a song’s popularity on Spotify merges art with analytics. This document analyzes 176,774 songs using Spotify’s API, examining factors like danceability and loudness to reveal insights into hit potential. Logistic regression, decision trees, and random forest classifiers are utilized, showcasing the significance of numerous attributes in a song's success. This work guides aspiring artists and labels, highlighting the power of data science in the music industry's evolution.

Context

Being known to the general public is the key to success in the music industry. Nowadays, an artist is considered popular if they manage to break into the top charts of Spotify, a digital music, podcast, and video streaming platform. Being able to predict the popularity of a song can help an artist to improve their track and adapt it to trends before its release. For a label, it can guide their selection of which artists to sign or which track on an album to promote as a single. Popularity of songs will therefore be predicted based on statistics about the song’s audio (tempo, key, danceability) and data like artist name and track genre.

Dataset Analysed

The dataset used for this analysis is generated from the Spotify API and spans 176,774 songs. It contains 18 features with information about the song, two of them are non-predictive and one of them, popularity, is this report's goal field. Popularity is a value between 0 and 100 generated by Spotify, its computation is based on the total number of plays compared to other tracks as well as how recent those plays are.

Figure 1. Frequency of Popularity values 0-100

Data preparation

Firstly, duplicate songs had to be dropped. Some tracks with the same individual track ID were listed multiple times with different genre values. By removing these, the dataset was reduced from 232,725 songs to 176,774. Secondly, the non-predictive attributes were removed. These included the track ID, which is just a tag attached to the song so it can be identified and the track name, which, because of its high cardinality, made pre-processing very complex. Extracting key words from the track name was considered as they could have high predictive values, but this was not carried out. The data then only contained predictive values and the goal field which were separated in x and y values, with y being the ‘popularity’ and x being the other features. The data was split, 80% for training and 20% for testing and validation. Repeated k-fold cross validation was used so the testing and validation did not need to be allocated separately. The attributes were sorted into numerical and categorical groups to be processed differently. The categorical group was then split again to separate artist_name. Through using a pipeline and a column transformer the pre-processing steps could be chained and the groups of attributes could be treated differently:

  • The numerical data was cleaned and then scaled so the attributes were inter-comparable. This was done by using SimpleImputer and the StandardScaler.
  • The categorical data, except artist_name, was encoded so each value was represented by an integer. OneHotEncoder allows these variables to be converted into a binary form.
  • The artist_name was encoded differently because of its high cardinality. By using target encoder, the dimensions of the dataset were not altered.

Through using a column transformer, these steps in the pipeline could be chained and the relevant transformation applied to each subset of the attributes.

Classification Prediction

Threshold selection

In order to do classification predictions, the dataset was split between a popularity of ≥ 70 and < 70 and therefore converted to binary values: 1 being popular and 0 being unpopular. The reason for this was that it was observed that a popularity value ≥ 70 roughly aligned with what could be considered ‘viral’ songs. This would mean the predictions output by these models would be more optimal for tasks such as finding the right single to promote, as the cost of failure can be high due to associated marketing costs.

Figure 2. Scatter plot of popularity values with threshold in black

Balancing the dataset

Following this threshold selection, a new dataset was created to train our algorithm on balanced data with the same number of popular songs and unpopular songs. This under-sampling isn’t likely to result in a loss of important information as the balanced dataset still has a total of 7,670 rows.

Figure 3. Unbalanced dataset (left) and balanced dataset (right)

Classification prediction with artist name

A logistic regression model was used to predict popularity with all the features of the binary unbalanced dataset in order to evaluate the importance of the features across the whole dataset.

Table 1. Logistic Regression Results with artist_name

With all the values, it was found that the artist_name variable is more than three times more important than any other attribute. However, as we are trying to predict the popularity of a song according to its audio features to help artists and labels succeed in the music industry, the artist_name variable isn’t really relevant. For example, Drake would not need this algorithm to assume his next song is going to be popular. Therefore, following this model prediction, artist_name was dropped from the x features.

Figure 4. Feature Importance of Logistic Regression Model with artist_name

Interested in learning more? Feel free to reach out...

Classification prediction without artist name

Forward and backward selections were tried out manually but, in this case, there is no need to feature select because there are so many rows and not many columns. There is low risk for overfitting, as indicated by our best models so there is no real need to feature select.

The one-hot encoded features give deeper insight into what the model values when predicting popularity. The most popular genres such as Pop and Rap show the highest importance, however other factors such as loudness take priority over more niche genres.

Figure 5. Feature Importance of Logistic Regression Model without artist_name

The decision tree is also very accurate both in and out of sample. Once again, the decision tree’s features’ importance shows that genre is the main feature to predict if a song is going to be viral. Loudness is once again an important feature. Encoded feature names and coefficients could not be used in this case.

Figure 6. Example tree for Random Forest Classifier with a max depth of 4 and one-hot encoded features

The random forest classifier was surprisingly performant in sample. It was tested multiple times and always came up to a perfect value. Feature importance in this case shows a better repartition across all attributes even though genre remains the main most important feature. Encoded feature names and coefficients could not be used.

Model Comparison

Logistic Regression was found out to be the best model out of the three with the highest accuracy value of 87.09%. The two other models however also had very high accuracies. Precision and recall values are both very close for the three models. LogReg has the highest recall value of 88% which is great as it means the algorithm has a low false negative rate and will therefore not prevent an artist from entering the music market by falsely predicting his/her song as unpopular.

Table 2. Comparison of performance metrics on test set for models without artist_name

A comparison of feature importance is reassuring as for all models the same features have the biggest importance and impact on the prediction.

Table 3. Most important features for models without artist_name

This classification prediction method however lacks an output of a specific popularity. Record labels investing money into a track or artist would benefit from a specific popularity output rather than a binary value. 2 songs could both be classified below 70 but clearly a popularity of 69 would deserve more investment than a song of popularity 0. Therefore, numerical prediction was conducted with two different models: Linear Regression and Random Forest Regressor.

Numerical Prediction

In this section, popularity values between 0 and 100 were predicted by two different models. The previous threshold and binary values were removed to have a more complex prediction. Accuracy will be the only performance metric as the data is now numerical and that simple confusion matrices are not applicable anymore to compute the precision and recall values.

Linear Regression

Firstly, a model predicting the popularity number without artist_name including all the features was made. Then, the artist_name was reintroduced, improving the prediction as shown in Figures 3.1 and 3.2.

Figure 7. Predicted vs Actual Popularity without artist_name (left) and with artist_name (right)

Random Forest Regressor

Finally, a random forest regressor with the artist_name variable and therefore including all the features was used. Its accuracy out of sample came up to 75%, which is satisfying knowing that it predicts an integer between 0 and 100 and therefore has 101 possible popularity outputs for every single song.

Conclusion

The popularity of a song is time dependent and can be seen as very subjective. It was however shown that it can be predicted depending on audio features with an emphasis on the genre of the song, its loudness, instrumentalness and danceability. The Logistic Regression was chosen as the best model to predict if the song of an unknown artist is going to be viral and reach a popularity over 70 with an accuracy of 87.09%, a precision of 86% and a recall of 88%. Regarding the numerical prediction taking into account the artist’s name, the Random Forest Regressor performed the best at predicting the popularity value between 0 and 100 with an accuracy of 74.94%. These models however rely on the data reported by Spotify’s API and are therefore limited as they fully depend on the algorithm calculating the popularity value. Download the full report below.

Download the Full Report

Related Articles