11/27/17Kaggle, Claims, Boosting, Genetic Algorithms, FFMs, Entity Embeddings
Lately, I have focused time on sabbatical learning through Kaggle challenges. These forums are a treasure trove of techniques you are unlikely to find in your favorite machine learning reference. Here, I will share some of what I have learned over the coarse of one of the largest Kaggle contests, the Porto Seguro competition.
My first Kaggle competition was the Epilespy prediction task with EEG data. Though there is no free lunch, the Random Forest was widely regarded as your go-to for establishing a quick baseline at the time. More recently, XGBoost reigns for its efficiency and speed (compile for GPU) while Microsoft's LightGBM and Yandex's CatBoost offer alternative implementations with Sklearn-like interfaces. Baidu's Regularized Greedy Forest offers another twist on tree-based/space-splitting algorithms. You may want to consider each for an ensemble when these strategies are successful.
But ensembles benefit greatly when the constituent predictions utilize different learning strategies. With multiple reliable models each failing for different samples, implied by lower correlation in output, we can reduce variance through simple averaging or stacking. Still, some competitors gain extra mileage with exotic feature engineering through genetic algorithms. These mysterious features are an example of nonlinear symbolic regression, effectively pulling off the kernel trick through a stochastic evolutionary search.
One of the pre-Kaggle competition successes for stacking methods came through the Netflix Prize. Top contestants established matrix factorization as a powerful model for collaborative filtering tasks. Michael Jahrer was one of those pioneers and incidentally has dominated the Porto Seguro competition maintaining a substantial margin over the closest competitors. In this contest, matrix factorization is represented by libffm rather than alternating least squares. Field Aware Factorization Machines shine in the highly imbalanced click through rate prediction task. This has been demonstrated over several recent contests while robust enough to be deployed in production.
Porto Seguro features come fairly well anonymized though some creative application of Wolfram|Alpha can be quite revealing. In the absence of additional context provided by a data dictionary, it is a reasonable assumption that pairwise feature interactions may explain much of the variance in target distribution. FFMs are biased to view data through this lens by explicitly modeling pairwise feature interactions.
On the other hand, too many features can be problematic. In Porto Seguro, I encountered Boruta Feature elimination through the experiments of one prolific Kaggler. Another achieves sparse models with a simple C++ implementation of Follow The Regularized Leader (FTRL), also known as a production worthy machine learning algorithm effective for predicting CTR.
The Netflix prize was before platforms like Tensorflow were available to facilitate deep learning. Instead Restricted Boltzmann Machines (RBMs) proved valuable in these ensembles. Today, the Keras API makes it a breeze to implement and iterate on neural network based models though most found these techniques of limited value with the anonymized, tabular insurance claims data while others found larger batch sizes helpful. I was excited to encounter Entity Embeddings, which concatenates embedding layers for each tabular data field. This architecture leverages the power in learning a distributed representation to encode input data with some "field awareness". This can be a more effective representation for categorical data displaying many of the fantastic properties of word embeddings like Word2Vec.
Another of my early independent projects was to explore all applicable techniques from The Elements of Stastical Learning in approaching the Heritage Health Prize. I gained a great deal from that experience by making use of regression techniques to impute age, a feature with great discriminative power. I also mapped categorical variables to their corresponding empirical probabilities associated with target outcomes for feature engineering. Of course, data sparsity can lead to a damaging degree of overfitting when we attempt to deduce patterns based on few samples. In these cases, it is reasonable to smooth the estimated probability by regression to the mean. I used a natural hierarchical network structure in claims interactions to regress these estimates to subpopulation means with a shrinkage term inversely proportional to the sample size. Because this feature engineering technique utilizes information from the target, one must perform this encoding without leakage from a validation set. I learned that this technique is known as likelihood or target encoding and there are additional model choices one might make in smoothing to control overfitting.
Often, interesting outcomes are so rare that datasets acceptable for machine learning display strong class imbalances. This holds for the Heritage Health Prize, the epilepsy detection dataset, Porto Seguro challenge, the CTR contests, and many others. Though Netflix was not a binary classification task, data sparsity presented many of the same manifest challenges. When I approached the Heritage Health Prize, I did not explore upsampling though many have found this effective for the current Porto Seguro contest. Alternatively, algorithms like xgboost include a parameter to apply differentiated learning rates between classes to facilitate learning under highly imbalanced data. Still others survey new modules and explore various sampling techniques like SMOTE.
Early on, I had adopted a mentor's skepticism of Kaggle for competition bias toward overfit, complex solutions. Typically, challenges featured small, tabular datasets and a de-emphasis on domain expertise in feature engineering. What I see in Kaggle now is a valuable community rich with institutional wisdom and I am inspired by deep analyses of my fellow competitors. Further, Kaggle has managed to facilitate access to terabyte-scale datasets and research-level challenges. Vision, Speech, Natural Language Processing, time-series, you can compete for rank or even a million dollars. A great way to kick off a sabbatical, summer, or earn independent study credits.