In Search of Spending — Part 2

Predicting Spending in Google’s Online Store — Regression Modeling

This blog is part two of a walkthrough of a recent machine learning project. Part one covered the data and business context. This installment covers the modeling process. View the entire project (including all code and the accompanying slide deck) on Github.

Introduction:

In part one of this series I described our data, did some exploratory data analysis, and walked through some of the preprocessing done to get the data ready for modeling.

The data includes 717k rows, with each row being one visit to Google’s online store between 2016 and 2018. Features describing geography, traffic source, device properties, page views, time, price, and spending are included in the dataset. About 2.5% of visits lead to a purchase, and the average purchase amount is $124 USD.

I added economic indicators like daily S&P 500 values to help control for changes in overall consumer spending. K-Means clustering was also used to add additional features (cluster label, silhouette score) to the data set.

Certain variables examined in the EDA stage could have an impact on spending. We can see a clear positive relationship between features like number of previous visits and number of page views on the purchase amount (and purchase frequency). We’ll keep an eye on the model results to see if these features are important predictors of spending.

Steps / Iterations:

The modeling stage of this project was done in an iterative process of tuning models, trying new models, and engineering new features.

Throughout the modeling process R-Squared and RMSE were taken into account to judge model performance. Cross validation and comparison vs training and test folds were also used to limit overfitting to the training data. A train-test split of 30% was used to test the final model’s performance.

  • Baseline Model — KNN: This untuned model was used as a baseline off of which to judge the performance of new model iterations.
  • Tuned KNN Models: Tuned KNN models performed much better than the untuned model in terms of our metrics. However, feature importances cannot be interpreted from KNN model outputs, so from here we transition to tree-based models.
  • Decision Tree Model: This model performed worse than the tuned KNN model, but well enough to indicate promise in further tree-based models.
  • Random Forest Models: These models performed about as well as the tuned KNN model, but progress slowed after iterative rounds of tuning the models.

Baseline KNN Model:

The KNN regressor was chosen as the baseline model because it is easy to implement and generally fairly effective. Importantly, it does not output feature importances (which are crucial to our understanding of the data), so we hope to beat the results of this model using tree-based models.

At its most basic level the KNN algorithm puts all data points in an n-dimensional space, and makes a prediction on a point by looking at the k ‘nearest’ data points. We can adjust the number of neighbors to use, the way we calculate distance, and we can add in a weight function to give more influence to the closest neighbors.

For the baseline model I used sk-learn’s default hyperparameters of k = 5 nearest neighbors and a uniform weighting function. All of the models throughout this process are optimized to maximize R-Squared.

The mean R-Squared score across the 5 CV folds is 16.1%. This is fairly low but will serve as a good starting point from which to judge the performance of future models.

Tuned KNN Model:

Now that we have a baseline model, the next step is to try to improve the performance of the model by tuning the hyperparameters.

# Create the grid parameter
grid_knn = [{'knn__n_neighbors': [5, 10, 20, 50, 100],
'knn__weights': ['uniform', 'distance']}]
# Create the gridsearch object
gridsearch_knn = GridSearchCV(estimator=pipe_knn,
param_grid=grid_knn,
scoring=['r2', 'neg_root_mean_squared_error'],
refit='r2',
cv=5,
n_jobs=-1,
verbose=8)
#Fit the model
gridsearch_knn.fit(X_train_ohe, y_train)

Here I fit the KNN model using every combination of 5/10/20/50/100 neighbors with either a uniform or a distance based weight function. The number of neighbors adjusts how many data points to consider when predicting spending. The weight parameter tells the algorithm to either treat all of the neighbors equally, or to give more importance to the ‘closer’ neighbors.

The best version in this iteration uses n_neighbors = 50 with a ‘distance’ weight function to yield an R-Squared value of 24.99%. After getting these results we can narrow in the grid to find the optimal parameters.

Test R-Squared Value vs n_neighbors

In the above graph we can see that the test R-Squared value increases with k until it reaches a peak somewhere in the 40–60 range. Next we can run another tuning iteration to see if other values in that range perform better than our current best model using k = 50.

# Create the grid parameter using neighbor values based on our previous graphs.grid_knn_2 = [{'knn__n_neighbors': [45, 50, 55, 60], 
'knn__weights': ['distance']}]

This 2nd iteration resulted in a better model using a k value of 45 with an R-Squared of 25.02%. This increase is very small, indicating that it’s time to move on to other model types.

Decision Tree Models:

Decision Tree Outline — Geeks for Geeks

The next model used is a Decision Tree Regressor. A decision tree works by splitting up the data according to some criteria at each node. Then the resulting data points are split on additional criteria until some ending condition is met. In this case the splits are determined to be those that minimize the Mean Squared Error (MSE).

An important reason to use tree-based models over a KNN model is that these tree-based models allow us access to feature importances. This could show us, for example, that a user’s operating system is not an important predictor of spending, but the number of pages they view is very useful in predicting spending.

This model is a stepping stone in working towards using the (usually) more effective Random Forest Regressor.

After an initial round of tuning, we search through the following grid to find the best values for the depth of the tree and the minimum number of data points at each leaf/split.

# Create the grid parameter
grid_tree2 = [{'tree__max_depth': [15, 20, 25, 30],
'tree__min_samples_split': [30, 35, 40, 45],
'tree__min_samples_leaf': [250, 300, 350],
'tree__criterion': ['mse'],
'tree__max_features': [None]
}]

# Create the grid, with "pipe" as the estimator
gridsearch_tree2 = RandomizedSearchCV(estimator=pipe_tree,
param_distributions=grid_tree2,
scoring=['r2', 'neg_root_mean_squared_error'], #Include RMSE in Results
refit='r2', #Choose best model based on R^2
cv=5,
n_iter=10,
n_jobs=-1,
verbose=8)

The best model here has an R-Squared value of 24.2%. This is slightly worse than the tuned KNN model, but the decrease in R-Squared is worth it because of the inclusion of feature importances.

Best Decision Tree Hyperparameters

Random Forest Models:

A random forest model is the logical next step after the results from a decision tree are less than ideal. The basic idea of a random forest is that we run our data through a bunch of uncorrelated decision trees, and then use their aggregated results to make a prediction. This means that error or bias in any one tree won’t have a large negative effect on our predictions.

Animated Decision Tree— D3

Based on the results of the decision tree model, I continue to use MSE as the main criterion, and impose no limit on the maximum number of features to be used in each tree. I cast a wide net for the other parameters so that we can narrow in on the optimal values in the next iterations.

Due to limits on time & computer power we randomly try 10 combinations of parameters from the grid instead of exhaustively testing every option.

#Create pipeline
pipe_forest = Pipeline([('scl', StandardScaler()),
('forest', RandomForestRegressor(random_state=70, n_jobs=-1, warm_start=True))])

# Create the grid parameter
grid_forest = [{'forest__n_estimators': [100, 150, 200],
'forest__max_depth': [1, 5, 10, 15, 25],
'forest__min_samples_split': [5, 10, 25, 50],
'forest__min_samples_leaf': [5, 10, 25, 50],
'forest__criterion': ['mse'],
'forest__max_features': [None]
}]

# Create the grid, with "pipe" as the estimator
gridsearch_forest = RandomizedSearchCV(estimator=pipe_forest,
param_distributions=grid_forest,
scoring=['r2', 'neg_root_mean_squared_error'], #Include RMSE in Results
refit='r2', #Choose best model based on R^2
return_train_score=True, #Include training results in cv_results
cv=5, #Use 5 folds in CV process
n_iter=10, #Try 10 hyperparameter combinations
n_jobs=4, #Use paralell computing
verbose=8) #Give updates on progress during fitting
Best RF Model — Iteration #1

The best model from this iteration (see above) produces an improved R-Squared value of 32.86%, but there are issues if you look closer…

Top 5 Models — Iteration #1

In the table (above) showing the top 5 model parameters we can see a problem in the two righthand columns. Our model is overfitted to the training data! Overfitting simply means that our model does well with the data we used to train it, but struggles when we introduce new data. This is evident in the gap between R-Squared scores between the training and testing data.

We can combat the overfitting issue by increasing the possible values for the min_samples_split and min_samples_leaf hyperparameters. This means that the each tree will make fewer splits. We can also increase the total number of trees in the forest.

After a few more rounds of tuning we begin to reach a plateau.

grid_forest_v3 = [{'forest__n_estimators': [200],
'forest__max_depth': [35, 40, 45],
'forest__min_samples_split': [200, 300],
'forest__min_samples_leaf': [45, 50, 55],
'forest__criterion': ['mse'],
'forest__max_features': [None]
}]

The best model from the grid above is shown on the left. It has a slightly lower R-Squared score of 32.1%, but if we look below, the overfitting problem is much less severe when using the parameters in this grid. In the first iteration, the difference in scores between training and test data was around 10%. In this iteration we’ve cut that difference down to about 4% without a large loss in R-Squared score. This is a good indicator that the best model here will perform better when presented with new data.

Models — Iteration #3

Final Model Results:

  • Cross Validation: The model had an R-Squared value of 32.1%, meaning that 32.1% of the variation in revenue could be explained by the model. This is lower than optimal and will be improved with further iterations of this project.
  • Test Data: When using the model on the test data the R-Squared value is 32.6%. This indicates that previous overfitting issues were solved. The R-Squared value is still not optimal. The primary goal of further iterations of this project will be to improve this metric.
  • Feature Importances: According to the final model, the number of pages viewed during a visit to the online store is by far the most important predictor of spending.
importances = pd.Series(pipe_best_forest['forest'].feature_importances_, name='importance')feature = pd.Series(list(X_train_ohe.columns), name='feature')feature_importances = pd.concat([feature, importances], axis=1)top_features = feature_importances.sort_values(by='importance', ascending=False).head(10)

In fact, the top 4 most important features all have to do with the user’s experience on the website. Other important features represent economic conditions, the price of the most recently viewed item, what kind of device the customer is using to browse, and how the customer got to Google’s store in the first place.

Conclusions & Recommendations:

  • Increase Page Views: Optimize UI/UX with a focus on increasing the number of pages viewed on average. This could include decreasing load times and improving recommendation systems.
  • Encourage Additional Store Visits: Repeat visitors are more likely to make a purchase. Potential customers could be encouraged to return by offering limited time discounts on their next purchase.
  • Collect Additional Data: The predictive capabilities of the model are limited by the nature of the data available. Notably, information on which products were added to the shopping cart or purchased is unavailable. This additional information is likely a strong predictor of spending, and would allow for more in depth customer segmenting and analysis.
  • Personalize Marketing Efforts: This model could be used to personalize marketing efforts. For example, if the model predicts that a customer will not make a purchase, they could be offered a time sensitive discount to incentivize spending. Additionally, If a customer is predicted to make a $150 purchase, Google could offer rewards points or free shipping on orders over $200.

Limitations:

The Data:

  • Some data points were missing and had to be removed or imputed.
  • Some data available to Google was not available to us.
  • Data is not generalizable to all online stores.
  • It is unknown if this is an exhaustive set of all visits to the store within the 2016–2018 time frame.

The Model:

  • Predictive power is somewhat limited using the current best model.
  • Exhaustive tuning is impossible due to limits on computational power.
  • Predictions get less accurate as spending increases.

Future Improvements:

Additional Data:

  • Using more data to train the model could yield better results.
  • Accessing additional data from Google could allow for more precise modeling.
  • Increasing the time frequency of economic data (i.e. Days → Hours) could add additional predictive capability. Additionally, a ‘lag’ variable could be added to the data.

Improved Modeling:

  • Generalizing the model to use only more common features would allow the model to be deployed for online stores other than Google’s.

Model Deployment:

  • Deploy a version of the model as a web application using Flask or a similar tool.

Thanks for reading! Make sure to check out the Github page for this project to see the full code, analysis, and accompanying slide deck.