Photo by TR on Unsplash

Starbucks Capstone Challenge

Insight and analysis into demographics that will utilize promotional offers, along with predictive modelling.

Project Overview

This project is the final Capstone project of the Udacity Data Science NanoDegree. There were a number of projects offered for the Capstone, however the one chosen was to analyse data from Starbucks.

Starbucks is a large multinational company that sells coffee, tea and other beverages, mainly through their chain of Starbucks and drive-thru outlets. Customers can also purchase and collect items using a mobile app. It is data from this mobile app that will be analysed in this project.

Customers can register to become a member of Starbucks, and consequently be sent a variety of different promotional offers. When received by the customer some will view the offer whilst others will not. Customers can then go on to ‘complete’ the offer providing it is utilised within the validity time period. However it is also possible for a customer to have completed an offer, but not received or viewed the promotion.

Offers provided to customers could range from promotional and informational offers to ‘Buy one get one free’ (BOGO) as well as Discounts. Each offer has a set ‘difficulty’, ‘reward’ and ‘duration’. These represent the cost that the monies the customer must spend to utilise the offer, the discount or reward received from utilising the offer, as well as the period of time that the offer is valid for.

There were 3 files of data available within this project:

· Profiles of the customers, containing some demographic information

· Offers look-up and details on the difficulty, rewards and duration of offer

· Transcriptions: A file containing information on spend, and status of offers sent to customers

The aim is to analyse the different demographics of customers and relative success of offers provided to them.

NB: This is a Capstone project as part of a learning programme, and as such the content and approach have not been peer-reviewed, therefore results cannot be confirmed accurate.

Problem Statement

In this Project, there will be two questions addressed. The first is regarding which customers respond best to which offers. For the second, accepting that there are customers that can complete an offer without having viewed of received an offer, the focus will instead refined to consider the population of customers that complete an offer having viewed them. Consequently, the questions of interest in this project were:

1. What demographics respond best to offers?

2. Is it possible to predict if a customer will complete an offer once it is viewed?

Question 1 can be answered through data exploration to identify which customer groups respond best to which offer types. Descriptive statistics and graphs will be explored. This will provide a view of which groups respond more favorably to certain offer types.

In answering question 2 a predictive model will be constructed. The model will return the likelihood of a customer completing an offer if it is viewed. As part of the output, feature importance will also be calculated. This will provide further information around customer types and their response to offers.

Metrics Used

For the descriptive analysis mean and median used when comparing groups. Where distributions appear skewed, then median was used over mean. Percentages were used to allow comparison between categorical groups.

A predictive model was created with Accuracy, F1 and AUC all being utilised to demonstrate sufficiency of the model. Feature importance as well as Confusion Matrices were also presented.

Choice of Metric: In this project a range of metrics were used. The target variable is a binary response, and it was found that the proportion of customers that achieve the True label was 50.1%, hence there is a balance between the classes that are being predicted. Accuracy will therefore be our primary measure of evaluation, however precision will also be considered as this is a measure of the model’s ability to identify True Positive cases. Within the confusion matrix False Negative rates were also used to evaluate the model. The False Negative rate was of interest as this is where a model believes that the customer does not complete an offer, but actually do. From a business perspective, this is a missed opportunity. In this project, a good model would have a low False Negative Rate.

A brief description of each metric is below:

Accuracy: The accuracy of the model represents its ability to identify the true cases, whether they are positive or negative. It is taken as the total number of cases that were predicted correctly divided by total number of cases. A high accuracy indicates the model is good at identifying true labels.

F1: The F1 score is the harmonic mean between precision and recall, where Precision represents a model’s ability to identify the true positives out of all predicted true cases — i.e how good is the model at separating between True and False Positives. Recall can also be referred to as ‘Sensitivity’ which is a model’s true positive rate.

AUC: Is the Area Under the Curve and represents a model’s ability to distinguish between classes. Where the AUC is 0.5 it indicated that the model is no better than random chance at distinguishing between cases, and the closer the value is to 1 the better the model.

Feature Importance: is used to indicate which features within the model were most useful in the prediction, and is based on the Gini score, or mean decrease impurity.

Implementation

Analysis for this project was carried out using Jupyter v6.1.4, available through Anaconda v1.10.0. Python version 3 was utilised. Standard packages such as Pandas, numpy and datetime were installed. For creating charts matplotlib and seaborn were installed. Several packages from Sklearn were utilised for modelling. These included, from metrics: classification_report and confusion_matrix. From Model_selection: train_test_split and GridSearchCV as well as classifiers LogisticRegression and RandomForestClassifier.

Data

There were 3 datasets provided: Profile, Transcript and Portfolio. Files were provided as .json which were then imported to a Jupyter notebook. A brief description of each file is below:

Portfolio: The portfolio file contained information regarding the type of offers. In total there were 10 offers provided by Starbucks, each with a different level of difficulty, reward and duration of offer.

There were 10 rows and 6 columns in this file. For each offer a unique offer code was provided. This was used to link to the transcript file showing which offer the customer had received and utilised. The offer_type variable contained a label description of the offer, for instance: ‘BOGO’ for ‘Buy one get one free’ and ‘Discount’. The ‘Reward’, ‘Duration’ and ‘Difficulty’ of the offer were captured in fields named the same. These were concatenated together along with offer type to create a new categorical field ‘Offer_difredu’. The channel which the offer was available on was held as a list within the ‘Channel’ field.

Figure 1: Portfolio data in raw form

Profile: The profile dataset contained information about Starbucks’ members, and has 17,000 rows and 5 columns. Within the data there was an ‘id’ variable, representing the customer. This was unique within the data, there we can consider this analysis as a representation of 17,000 customers.

Within the dataset there was information on the customer’s ‘Age’, ‘Income’ and ‘Gender’. In addition there was a field representing the date when the customer became a Starbucks member.

Transcript: There were 306,534 rows in this file representing different stages of an offer cycle as well as if a transaction was made. There were 4 columns in this dataset. The ‘event’ field captures the stages of the offer, and is made up of 4 levels.

Table 1: Frequency of Events within Transcript

The dataset also contained a field called ‘value’. The data within each cell of the frame for this field were held within a dictionary. Python’s method .get() was utilised to access this dictionary. The contents of the dictionary were either the unique id for the offer that was sent to the customer, or the amount of money that the customer transacted with Starbucks. Both levels of the dictionary were extracted to create two new fields within the data, ‘amount’ and ‘offer id’.

‘Time’, was a marker measured in hours for when an activity occurred. For every customer, time starts at 0 and each time an offer is received, viewed, completed or a transaction is made the ‘time’ variable would capture the number of hours since the start that this activity occurred.

A unique id, referred to as ‘person’ was within the data, this was used to join to the profile information.

Figure 2: Transcript data in raw form

Data Pre-Processing

Significant wrangling of the data was required to ensure that missing information was treated, and that offers and transactions associated were within the validity period. Manipulation was also required to form the data into one set of information which could then be used for exploratory data analysis, and modelling. A number of features were also engineered.

Within the transcript file, it was noted that the offer id was only present when an offer was received, viewed or completed. The offer id was not present when a transaction was made. By ordering each customer by time it was possible to see the chronology of the customer’s events and when the transactions occurred. For each customer it was assumed that once an offer was issued, the effect of it was valid until the next offer was sent to the customer. Python’s ‘fill forward’ method was used to complete the missing rows so that the offer id was apparent across all rows until a new offer was issued.

By taking this approach, activity for each offer could then be assigned into windows or ‘blocks’, total monies that were transacted during these blocks could be calculated as well as the status of the offer that was issued at the start of the block (i.e completed, or viewed). Time for each block needed to be adjusted so that each block started at 0 and then added time from this point. This was to show whether any transactions associated with the block were done so within the validity period of the offer. Python’s .shift() method was used to consider the lag time entry.

Features were then created to show the total transaction amount that was within the offer validity period and the amount was outwith the offer validity period. These features were called: ‘in_offer_amt’ and ‘out_offer_amt’

Get_dummies was used on the ‘event’ field within transcript to create binary 0/1 indicator variables. This allowed each customer, for each block to have one row of information regarding the status of the offer that had been issued. The profile, transcript and portfolio files were then merged together using the appropriate id variable and Pandas’ pd.merge().This resulted in a dataframe that contained 29 columns and 72,658.

Two features that were engineered were ‘tenure’ and ‘cum_prev_amt’.

Tenure: It was of interest to investigate whether the length of time that a customer had been a member was significant in determining whether or not a customer would complete an offer once it was viewed. This is a reasonable hypothesis, and tests whether there is brand loyalty — i.e. those that have been with Starbucks the longest are more likely to complete offers than those that have not. It was not known within the data when these offers were issued, therefore it was assumed that the ‘tenure’ of the customer is taken as the difference between the date of membership and today’s date. The field ‘became_member_on’ was required to be converted to a datetime variable to make this calculation.

‘cum_offer_amt’: The cumulative offer amount — this feature was created to test the hypothesis of whether the total amount of monies that had been spent leading up to an offer was significant in predicting whether the customer completes the next offer. Like ‘Tenure’ this variable was to infer aspects of ‘loyalty’ — the more someone spends with a brand the more likely they are to complete an offer next time it is issued.

Imputation

For gender, where there 2175 cases with a missing entry, these were replaced with ‘Unk’. If a customer had not been issued an offer then the missing entry was recorded as ‘no offer’. Similarly, for ‘Difficulty’, ‘Duration’ and ‘Reward’, missing entries were replaced with 0. This decision was made as if there was no offer, then there could be no duration, reward or difficulty.

Out_offer_amt there were 58,495 missing cases. These were set to 0. Similarly features relating to transaction monies were also set to 0 where it was missing. It would not be sensible to input variables with the mean in these cases because the missing value occurs when then is no spend activity. Imputing with the mean would provide a false statistic for the variable.

Complete: It was noted that approximately 12% of the Profile data represented cases where the ‘Age’ was 118. On further inspection it was found that this related to customers where the profile was incomplete. This was where there was no gender or income recorded for the customer. These cases were marked as ‘Incomplete’, and those that did have all demographic information as ‘complete’.

The incomplete cases were explored to understand any useful or interesting aspects of the data, with the aim of determining whether or not this data should be retained with imputation. Through exploratory data analysis it was shown that customers with incomplete profiles also have very low offer completion rates. Similarly mean transaction amounts were also low.

Table 2: Percentage of offers completed and Transaction amounts by offer type for incomplete profiles

It was decided that as these customers had missing information for Gender, Age and Income, along with low engagement with Starbucks, this group should be removed. Consequently the project question of ‘customers likely to complete offers once they are viewed’ would consider the population where the customer has a complete profile only.

Final data set

After cleansing, imputation and creating features, Get_dummies() was used on the remaining categorical fields to create binary 0/1 indicator variables. The original variables were retained for use whilst carrying-out Exploratory Data Analysis with the dummy indicator variables being used within the model. The final dataset contained 63,386 rows and 47 columns. A snap-shot of this dataframe is shown below for one customer.

Figure 3: Final dataset — Starbucks

Exploratory Data Analysis

The profile data originally contained 17,000 customers, however after removing those with incomplete records, and merging with the Transcript and the Profile data, the total number of customers being considered in the analysis was reduced.

Additionally, as customers can be provided with more than one offer throughout the experiment the number of completions each customer has can be greater than one too. Instead of counting distinct customers in this analysis, we are considering the number of ‘touch points’. A touch point is any interaction Starbucks has had with a customer. If a customer has been sent 5 offers over the experiment then there has been 5 ‘touch points’ with this customer, and there could therefore be 5 completions

This analysis therefore considered 14, 825 customers and 63,386 touch points.

Figure 4: Drop-off by offer stage

Figure 4 highlights the drop-off in touch-points by stage of offer. Starting with 63k touchpoints, 96% will receive an offer. Of the 61,245 offers that were received, 45,042 were viewed (73.5%). Within this project we are measuring the proportion of offers viewed that led to completion. From the flow chart it is possible to see that 22,931 completions (50.1%) resulted once an offer was viewed.

‘Bogo’ (Buy one get one free) was the most frequently issued offer, with 13,042 (88%) of distinct customers having received or used this offer. There were 2,058 (13.8%) distinct customers that did not receive any offer.

Which offers were the most successful?

In this project we are considering the %completion once an offer is viewed. Each offer has it’s own ‘Difficulty’ (amount required to be spent to utilise offer), ‘Reward’ (value return for the customer) and ‘Duration’ (length of validity).

The informational offers had the lowest percent of completion once the offer was viewed at 7% and 8.7% respectively. In comparison, the BOGO offer were the value of the reward was greater than the difficulty in utilising the offer had the greatest percentage of completion once the offer was viewed.

Table 3: Percentages of offers completed once viewed by offer type

Which demographics respond best to offers?

Summary statistics and Histograms were created to understand which demographics tend to complete offers once they are viewed.

From earlier analysis it was already determined that customers that had incomplete profiles tend to also have lower completion rates once the offer is viewed.

The average age of those that complete offers was 55 compared to 53 for those that did not compete offers — suggesting age is not important in whether or not a customer completes an offer or not.

Gender: It was found that the response to the type of offer after it was viewed was different depending on whether the customer was Male, Female or ‘Other’. Informational offers had the lowest %completion after being viewed, with little difference between Males and Female customers. For BOGO and Discount offers however Females responded significantly better than Males. The %completion after viewing for Females that received Discount offers was 71.5% compared to 60.1% for Males.

Table 4: Percentage of offers completed after view by gender and offer type

Tenure: A feature was created describing customer Tenure — the length of time between when they became a Starbucks member and ‘today’.

The distribution of those that did not complete offers is skewed more to the left, compared to those that did complete offers suggesting that those that did complete offers after they were viewed tend to have been members of Starbucks for longer.

Figure 5: Distribution of Tenure by offer completion

This is evidenced further with summary statistics. On average, it was found that those that complete offers have a tenure of 100 days greater than those that do not (1,676 days compared to 1,526 days). As the distribution is not symmetrical, the median was also considered. A similar message was apparent. Median number of days tenure for those that did not complete offers was 1,366 compared to 1,613 for those that did complete offers.

Cumulative amounts: One feature that was created was the total cumulative amount of monies spent prior to an offer being issued (cum_offer_amt). It was found that customers that complete offers after viewing them tend to have an higher cumulative amount of spend with Starbucks. The average and median total cumulative amount spent prior to an offer being issued was $35/$11 for those that did not complete offers and $63/£39 for those that did complete offers.

Findings

From this exploratory analysis we can conclude that the demographics that tend to respond best to offers are: those with complete records, females, those that have been members for longer and also those that have a larger cumulative spend with Starbucks

Results

In answering the question, ‘Is it possible to predict if a customer will complete an offer once it is viewed?’ two models were chosen. The first was an initial Baseline model using Logistic Regression, with the second being a Random Forrest.

One of the benefits of a logistic regression is that it is possible to look at the coefficient of each explanatory variable within the model to understand its effect upon the target variable. While this is advantageous interpretation, one of the principal benefits of Random Forests over regression models is that Random Forests do not assume that the data has a linear relationship, and therefore tends to outperform Logistic Regression. Given that this problem was considering a binary response of customers that complete offers once they are viewed vs. those that do not complete offers when they are viewed, Logistic Regression was viewed as a good starting point in answering this question.

Data was split into two groups, a training group which represented 70% of the data and a test set that contained the remaining 30%. The data was selected at random. The data was fitted against both models, with Accuracy, as well as other performance metrics such as F1 and AUC being compared to determine which modelling approach provided the best set of results. Both the Logistic Regression model and the Random Forrest were initially ran with default settings to understand the level of performance. Default settings for the Random Forest model are below:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False))

A function was also written to train and fit both models without having repetitive code (utilising DRY principles). This function also produced a classification report, confusion matrix and AUC plot.

The best performing model then undergone hyper parameter tuning to identify the best set of values for the input parameters that maximises the evaluation metrics.

Prior to modelling, correlations between features were considered to identify possible sources of Multicollinearity.

Baseline Logistic Regression Model

Figure 6: Area under the curve for baseline model

An initial baseline Logistic Regression model was created which resulted in an accuracy score of 0.58, with an average F1-score of 0.58. This model also had an AUC of 0.62. This means that the model is correct 62% of the time in distinguishing between those that complete offers after viewing and those that do not. Looking at the classification report, precision for this model was 0.59, meaning that the model is only able to identify the true cases 60% of the time. The Logistic Regression Model also had a False Negative Rate of 33%. This is where the model does not believe the customer will convert an offer, but they actually do — by not issuing offers to these customers it would be a missed opportunity for Starbucks.

While this model does show some ability to differentiate between classes, a better model was sought.

Random Forest

A Random Forest model was created, which improved on the results of the Logistic Regression model. This model returned an Accuracy score of 73%, with an F1 score of 0.74. The AUC of the model also increase to 0.8, meaning that the model is correct 80% of the time at distinguishing between those that complete offers once they are viewed and those that do not complete offers once they are viewed.

The confusion matrix also revealed that the False Positive Rate decreased from 49% with the Logistic Regression model to 24% with the Random Forest. Precision for this model also increased to 0.74. Both these metrics indicate that this model is better able to identify true cases as well as true negatives. Classifications of cases as being true when they are in fact false had been reduced.

Random Forest with Tuning

In comparing the Random Forrest and the Baseline Logistic Regression model, it was found that the Random Forest model produced better results. In tuning the model, the parameter n_estimators was used. This has a default value of 10.

Figure 7: Area under the curve for final model

To tune the model, Python’s ‘GridSearchCV’ was utilised to identify the optimal number of estimators to be used . A range of values were provided, from the default 10 to 150.

It was found that 150 estimators provided the best set of results with Accuracy, AUC and F1 all increasing. The final model had an Accuracy of 75.7%, meaning that the model is correct 75% of the time in it’s prediction with AUC being 0.83 for this model.

Variable Importance

Variable Importance was also outputted for the final model. This highlights the features of most use within the model. Tenure as well as cum_prev_amt were the two most important features along with Income. This reinforces the findings from the Exploratory Data Analysis that there is a degree of brand loyalty — those that have spent the most with and have been members with Starbuck the longest are more likely to complete offers once they are viewed.

Figure 8: Feature Importance from final model

Features relating to informational offers also appeared important, though this is likely from the low completion rate compared to other offer types.

Model Evaluation and Validation

It is of importance to demonstrate the robustness of the final model. A method of doing this is to preform k-fold cross validation. For the model to be stable and robust, then the accuracy of the model should not depreciate greatly when different subsets of the training data are utilized and validated against, thus verifying that the model has not overfit the data.

The function cross_val_score within Sci-kit Learn’s model_selection module was utilized, with the number of folds (k) requested being 10. It was found that the across 10 folds, there is a stable accuracy of the Random Forest model that included parameter tuning. Accuracy for each fold ranged 74.9% to 76% within the training set, with comparison to the test set having an Accuracy of 75.7%. Although there are small fluctuations in the Accuracy the model is consistent, with each fold also being comparable to the Accuracy gained from the test set.

Figure 9: K-fold cross-validation for Random Forest model with tuning

Conclusion

The Random Forest model with 150 estimators performed best, producing an accuracy score of 0.75 AUC of 0.83 and F1 score of 0.76. The model has a good ability to separate between classes. The model outputs also highlighted key features of importance. It was shown that ‘Tenure’ and ‘cum_offer_amt’ as being the top two features in determining whether or not a customer will complete an offer once it is viewed. During model validation it was also demonstrated that performance of this model is stable. From the exploratory analysis it was shown that customers that have a greater tenure tend to complete offers. Also, those that have a greater cumulative amount of spend with Starbucks prior to an offer being issued tend to complete offers once they are viewed. This suggest some aspect of brand loyalty.

This concept is further reinforced when considering customers that have incomplete profiles. It was found that 12% of customers have incomplete profiles with the percentage of offer completions being low (<10%) after the offer is viewed.

Summary of Project

In this project I have:

· Cleaned and tidied three datasets relating to Starbucks customers, their transactions and offers provided by the Starbucks, then joined together to create one dataset that was used for exploration and modelling

· Created new features within the data such as tenure, and cumulative previous spend amounts to reveal further insight

· Explored the data producing summaries and graphs showing the demographics more likely to complete an offer once it is viewed.

· Built and refined a model that predicts the customers that are likely to complete an offer if it is viewed.

Reflection

There were many aspects of this project that I enjoyed and found interesting. I particularly liked developing new features and thinking critically about why a customer may accept or reject an offer. In developing a new features such as ‘cum_offer_amt’ — the cumulative amount a customer has spent with starbucks prior to their next offer being issued, I was able to increase the predictive accuracy of my model. This was also shown to be the second most import feature in the model along with ‘tenure’, which was another engineered feature.

On challenges, the ‘time’ variable did not reset each time an offer was issued, therefore a variable that represented the length of time an offer had been issued, starting at 0 each time a new offer was issued, had to be built. This would then allow a fair assessment of whether the transactions for the customer were in or out of the validity period. To do this the method I took was to iterate through each customer, constructing ‘windows’ of activity, and then measuring time through each window. Doing this for each customer with DataFrames was time consuming. One method I employed to reduce time spent waiting on code to run, was to write code utilising a small random sample. This allowed progression in writing the program, with the full dataset then being ran against the completed and tested program.

Improvement

There were two areas identified within this project that could be developed to improve the findings of the model. Firstly, the application of normalisation/standardisation before fitting the model. ‘Age’, ‘Income’ and ‘cum_offer_amt’ are all continuous variables, but are measured on different scales. Where there are variables that have very large values relative to the other input variables, then these can dominate some types of algorithm, with results being skewed towards those results. To allow fair comparison Sklearn’s StandardScaler() could be applied. This would subtract the mean from the current observation, then divide by the standard deviation.

Within this data there is also lack of demographic information to help understand the types of customer that respond to offers. Additional features that could be created to help further understand behaviour is the ‘Number of previous completions’ or ‘Number of previous rejections’. The hypothesis being tested would be that the number of previous completions the customer has had is influential on whether they accept the next offer they are sent. This could be used to infer some aspect of loyalty — those that have more previous accepts have a higher likelihood to complete their next offer. A further development would be to consider the price-benefit trade-off, are customers more attracted to certain offers when purchasing certain items?

For more details as well as the code, visit my Github page

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store