SmellsLikeML

Kracking Kaggle

12/5/17

Fast Baselines

In two recent projects, over/under-fitting cost my model performance. Specifically, after submitting results for the LSTM based comma speed challenge model I found the model quite overfit. Alternatively, the Porto Seguro competition hosted by Kaggle was one where a conservative approach to model selection/ensembling cost me slots on the leaderboard. To remedy this, I will be taking up a pet project of blazing through dozens of datasets in the next couple weeks and sharing my takeaways. Here, I hope to gain exposure to a variety of data types, reinforcing intuition for modeling choices, and hammering away at the fundamentals while quickly establishing sensible baselines.

During these recent challenges, I spent a lot of time exploring new apis in reaching for greater model diversity. This ultimately limited time to investigate other lines of inquiry like, how I might partition a good validation set. Through a "ML modeling bootcamp", I hope to address these workflow limitations. It is always better to reach the same performance level faster. Likewise, ML is an experimental science and it is generally favorable to iterate quickly on experiments.

Key takeaways from my recent projects include the boilerplate code for Deep Learning in computer vision, hyperparameter tuning, implementing specialized algorithms like FFM, and generating ensembles. I will use this as the jumping off point and extend my collection of modules so that I can work more efficiently through the fundamentals to afford more time to explore second or third order improvements.

More concretely, I have selected a number of interesting Kaggle datasets to explore. My objective is to cover reading the data, producing a jupyter notebook for exploratory data analysis, generate hypotheses and implement baseline models. Then I will survey the results from these past contests and implement key aspects of winning models exploring how and why each was the dominant approach at the time. Finally, I will produce a post on what I gained in this experience repeating over a variety of data types and challenge difficulties.

To start, I perused past Kaggle datasets compiling a list of challenges. Certainly, I do not want to spend more time than necessary getting the data but I will at least need to accept the contest rules on the Kaggle site. To help here, I turn to kaggle-cli for a command-line interface to download the datasets. With this utility, I can accept the rules for each contest, contruct a list of contests I want data for, and programmatically build the directory structures and download corresponding datasets.

For example, you might construct a list of competitions called "twenty_in_twenty.txt" containing lines like:


web-traffic-time-series-forecasting
nips-2017-non-targeted-adversarial training
noass-fisheries-steller-sea-lion-population-count
...

Then I can build a simple script to download data for contests I have already accepted rules for as follows:

#!/usr/bin/env python
import os
import sys
import subprocess

kg_usr = os.environ['kg_usr']
kg_pwd = os.environ['kg_pwd']

contest_lst = sys.argv[1]
with open(contest_lst, 'r') as contest_file:
    contests = contest_file.readlines()
contests = [con.strip() for con in contests]

for con in contests:
    if not os.path.exists(con):
        os.mkdir(con)
        os.chdir(con)
        os.mkdir('input')
        subprocess.call("kg download -u {} -p {} -c {}"
        .format(kg_usr, kg_pwd, con), shell=True)
        os.chdir('../../')

Depending on the datasets, this will take some time. This may also occupy tens of GBs even in compressed form so you may prefer to include a test for sufficient disk space to avoid any issues in filling up the drive. Get the data download script here.

The next step is to prepare for the challenge of exploring/implementing high performance models with an emphasis on rapid development of quality, reusable code.

Getting Groceries

I begin with a comparison of two companies, Instacart and Corporacion Favorita, both in the grocery retail space but each with very different objective. Here we see a measurement bias reflecting two very different business models.

Corporacion Favorita is a national brick-and-mortar grocery retail chain in Ecuador. The contest objective rewards accurate models for total unit sales for different stores over a period of 2 weeks. Good models here can be used to understand which factors drive sales and reduce spoilage since the contest objective of RMSWLE introduces greater penalties for perishables. Contest hosts provide the additional context of national/regional holidays, important events, and oil prices. This higher level operations vantage frames the view that aggregate sales are largely driven by social events/trends.

On the other hand, Instacart provides a service. Their business model is to make the drudgery of inescapable tasks like grocery shopping as pain free as possible with a smart app to coordinate your delivery. As an aggregator of products from different local brick-and-mortar shops, Instacart focuses on a competitive edge in the personalized shopping experience. This interpretation bears out in the information provided in that contest with (meta)data down to the shopping basket contents. Instead of predicting store-level statistics like unit sales, Instacart views success through mean F1 score on predicted items. Ostensibly, the plan is to use this for recommendation to streamline the shopping experience that much further. Rather than emphasizing operational efficiency, Instacart appears to value continued product refinements.

I chose two simple baseline modeling approaches each congruent with its respective problem perspectives.

Language Models

Text normalization is a subtle problem in NLP that does not appear to benefit from RNNs and ever larger training corpora. Imprecision in embeddings and decoding strategies result in performance degradation without domain knowledge based post-processing. Here, I turn to dictionary-based methods and simple data mining techniques to identify rule sets. As a handicap, I chose the Russian language contest rather than the English dataset.

For comparison, I consider the Quora Question Pairs challenge. This challenge requires contestants to identify question pairs with similar intent to help Quora deduplicate posts on their Q&A platform. Expecting to benefit from identifying synonymous concepts, text embedding techniques are a natural model candidate.

AI & Security

Adversarial Attacks offer a mechanism for manipulating the intended performance of Machine Learning models. While many examples focus on perturbing an image to achieve targeted or non-targeted misclassification, the attack method affects the general machine learning algorithm.

Because of the simplicity of class separation boundaries learned by many ML algorithms, a simple iterative approach to perturbing the input until instance misclassification works quite well.

Here, we consider adversarial attacks as an introduction to AI safety and make use of the Kaggle competition datasets to explore these methods.

Vision

Understanding the Amazon from Space offers another remote sensing challenge. Here, we seek to optimize the mean F2 score in this multiclass challenge. The training data includes tif and jpg images of ground cover in the Amazon river basin. With satellite imagery, we can tackle the problem of deforestation with scalable technologies. In this global problem, I apply global techniques assuming color distribution conveys semantic similarity.

Going from macro scale to micro scale, we have the Ultrasound Nerve Segmentation dataset. This challenge aims to make pain managment more effective by leveraging computer vision to minimize trauma to nerve tissue. This image segmentation task challenges contestants to identify target pixels of ultrasound imagery which focus on nervous tissue to avoid in clinical procedures. This work should reduce the necessity of pain blockers. For this challenge, I borrowed a simple idea from an old facial keypoints detection tutorial, average over all training frames.