In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Imputing Missing Age in Heritage Health Prize

Anonymized Member Info

In [2]:
mem = pd.read_csv('Members.csv')
In [3]:
mem.head()
Out[3]:
MemberID AgeAtFirstClaim Sex
0 14723353 70-79 M
1 75706636 70-79 M
2 17320609 70-79 M
3 69690888 40-49 M
4 33004608 0-9 M

Missing Ages

In [4]:
print('Roughly {:.02f} % missing age '.format(100 * mem.AgeAtFirstClaim.isnull().mean()))
Roughly 5.09 % missing age 

Age Distribution

In [5]:
print('For an overview of age distribution')
mem.AgeAtFirstClaim.value_counts().sort_index()
For an overview of age distribution
Out[5]:
0-9      10791
10-19    11319
20-29     8505
30-39    12435
40-49    16111
50-59    13329
60-69    12622
70-79    14514
80+       7621
Name: AgeAtFirstClaim, dtype: int64

Age Imputation with 40-49 is reasonable on these grounds

In [6]:
print('The most commonly appearing age group is {} year olds'.format(mem.AgeAtFirstClaim.mode().values[0]))
The most commonly appearing age group is 40-49 year olds
In [7]:
print('{} year olds are the median age group'.format(sorted(mem.AgeAtFirstClaim.dropna().values)[len(mem)//2]))
40-49 year olds are the median age group

Age and Risk of Hospitalization

In [8]:
y2_dih = pd.read_csv('DaysInHospital_Y2.csv')
In [9]:
mem_rsk = pd.merge(mem, y2_dih)
In [10]:
mem_rsk[['AgeAtFirstClaim', 'DaysInHospital']].groupby('AgeAtFirstClaim', as_index=False).mean()
Out[10]:
AgeAtFirstClaim DaysInHospital
0 0-9 0.171083
1 10-19 0.148277
2 20-29 0.427275
3 30-39 0.311166
4 40-49 0.208548
5 50-59 0.296348
6 60-69 0.492074
7 70-79 0.732125
8 80+ 1.029231

Hospitalization risk increases dramatically with age group. A bad imputation could really throw things off. Need more information...

Hypothesis: The context of medical ailments can be used to impute age

Intuitively, children have pediatric issues, elderly face geriatric ailments. Let's use claims info to guess age

In [11]:
clm = pd.read_csv('Claims.csv')
In [12]:
clm.head()
Out[12]:
MemberID ProviderID Vendor PCP Year Specialty PlaceSvc PayDelay LengthOfStay DSFS PrimaryConditionGroup CharlsonIndex ProcedureGroup SupLOS
0 42286978 8013252.0 172193.0 37796.0 Y1 Surgery Office 28 NaN 8- 9 months NEUMENT 0 MED 0
1 97903248 3316066.0 726296.0 5300.0 Y3 Internal Office 50 NaN 7- 8 months NEUMENT 1-2 EM 0
2 2759427 2997752.0 140343.0 91972.0 Y3 Internal Office 14 NaN 0- 1 month METAB3 0 EM 0
3 73570559 7053364.0 240043.0 70119.0 Y3 Laboratory Independent Lab 24 NaN 5- 6 months METAB3 1-2 SCS 0
4 11837054 7557061.0 496247.0 68968.0 Y2 Surgery Outpatient Hospital 27 NaN 4- 5 months FXDISLC 1-2 EM 0

The PrimaryConditionGroup field lists categorical labels like NEUMENT, METAB3, FXDISLC, ... This is a domain knowledge based topical grouping of the ICD-9 codes into issues pertaining to fracture/dislocation, or pregnancy, or seizures, etc. The Year and DSFS fields offer some temporal ordering of the claims.

In [13]:
pcg_timeline = clm[['MemberID', 'Year', 'DSFS', 'PrimaryConditionGroup']]
In [14]:
pcg_timeline = pcg_timeline.sort_values(by=['MemberID', 'Year', 'DSFS'])

Now we have Member Timelines on Diagnosis categories

In [15]:
pcg_timeline.head(20)
Out[15]:
MemberID Year DSFS PrimaryConditionGroup
1982438 4 Y2 0- 1 month RESPR4
512262 210 Y1 0- 1 month GIOBSENT
1409240 210 Y1 0- 1 month GIOBSENT
2413656 210 Y1 0- 1 month GYNEC1
1915711 210 Y1 1- 2 months MSC2a3
2493322 210 Y1 1- 2 months MSC2a3
248110 210 Y1 3- 4 months PRGNCY
974258 210 Y1 6- 7 months GYNEC1
1878303 210 Y1 9-10 months GYNEC1
526294 210 Y2 0- 1 month MSC2a3
2422016 210 Y2 0- 1 month PRGNCY
9791 210 Y2 3- 4 months PRGNCY
2560310 210 Y2 3- 4 months PRGNCY
2422275 210 Y2 6- 7 months PRGNCY
2574137 210 Y2 9-10 months PRGNCY
2599749 210 Y3 0- 1 month PRGNCY
507 210 Y3 2- 3 months MSC2a3
753764 210 Y3 3- 4 months PRGNCY
1725763 210 Y3 6- 7 months PRGNCY
372916 3197 Y1 0- 1 month GIBLEED
In [16]:
mem_hist = pcg_timeline.drop(['Year','DSFS'], axis=1).groupby('MemberID', as_index=False).aggregate(lambda x: list(x))

Aggregating into a more useful form...

In [17]:
mem_hist.head(10)
Out[17]:
MemberID PrimaryConditionGroup
0 4 [RESPR4]
1 210 [GIOBSENT, GIOBSENT, GYNEC1, MSC2a3, MSC2a3, P...
2 3197 [GIBLEED, RESPR4, RESPR4, RESPR4, RESPR4, NEUM...
3 3457 [MSC2a3]
4 3713 [RESPR4, RENAL3, RENAL3, ARTHSPIN, UTI, UTI, M...
5 3741 [METAB3, METAB3, NEUMENT, MSC2a3, NEUMENT, MET...
6 3889 [HEMTOL, SEIZURE, MSC2a3, STROKE, MSC2a3, STRO...
7 4048 [MSC2a3, RENAL3, MSC2a3, RENAL3, RENAL3, MSC2a...
8 4187 [ARTHSPIN, ARTHSPIN, METAB3, MSC2a3]
9 5187 [NEUMENT, MSC2a3, INFEC4, INFEC4, RESPR4, RESP...
In [18]:
corp = mem_hist.PrimaryConditionGroup.values.tolist()
corp_mems = mem_hist.MemberID.values
corp = [[x for x in s if isinstance(x, str)] for s in corp]

Filtering out missing values here, I compose a corpus and apply Gensim's Word2Vec embeddings

In [19]:
from gensim.models import Word2Vec
In [20]:
embedding = Word2Vec(corp, size=5, window=3, min_count=1, workers=4)

Erring toward the conservative model without throwing away too much. Aiming to reduce 46 categories into a lower dimensional dense representation modeling the semantics of clinical history. Next, Chronic Obstructive Pulmonary Disorder (COPD) is associated with pneumonia (PNEUM), respiratory conditions (RESPR4), and congestive heart failure (CHF)

In [21]:
embedding.most_similar('COPD')
Out[21]:
[('PNEUM', 0.9575843214988708),
 ('ROAMI', 0.8248004913330078),
 ('RESPR4', 0.740555465221405),
 ('CHF', 0.7021240592002869),
 ('CATAST', 0.6807920932769775),
 ('MISCL1', 0.6778270602226257),
 ('AMI', 0.6156846284866333),
 ('CANCRA', 0.5881134867668152),
 ('PERVALV', 0.5299927592277527),
 ('HEMTOL', 0.46958625316619873)]

Next, pregnancy (PRGNCY) is associated with perinatal conditions (PERINTL) and sepsis.

In [22]:
embedding.most_similar('PRGNCY')
Out[22]:
[('PERINTL', 0.8507484197616577),
 ('MISCL5', 0.5697264075279236),
 ('GYNEC1', 0.5654723644256592),
 ('UTI', 0.536348283290863),
 ('INFEC4', 0.5339572429656982),
 ('TRAUMA', 0.42851194739341736),
 ('SEPSIS', 0.292888343334198),
 ('PNCRDZ', 0.28065937757492065),
 ('SEIZURE', 0.2768717110157013),
 ('GIOBSENT', 0.2764700651168823)]

Naturally, SEIZURES are associated with STROKE

In [23]:
embedding.most_similar('SEIZURE')
Out[23]:
[('FLaELEC', 0.932746410369873),
 ('SEPSIS', 0.8229254484176636),
 ('MISCL5', 0.8178076148033142),
 ('STROKE', 0.7924022674560547),
 ('METAB1', 0.7820615768432617),
 ('CATAST', 0.6747638583183289),
 ('NEUMENT', 0.6103649139404297),
 ('PNCRDZ', 0.603363573551178),
 ('RENAL1', 0.5386935472488403),
 ('PERVALV', 0.5242663025856018)]

Let's Get an overview of these associations by T-SNE projection

In [24]:
from sklearn.manifold import TSNE
In [25]:
tsne = TSNE(perplexity=5, early_exaggeration=10, learning_rate=10, method='exact', n_components=2, init='pca', n_iter=10000)
In [26]:
low_dim_embs = tsne.fit_transform(embedding.wv.syn0)
labels = embedding.wv.index2word
In [27]:
def plot_with_labels(low_dim_embs, labels, filename='tsne_pcg.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points', ha='right', va='bottom')
    plt.savefig(filename)

plot_with_labels(low_dim_embs, labels)
In [28]:
embedding.wv.word_vec('FXDISLC')
Out[28]:
array([-8.46420193,  3.78131771,  3.30045986, -5.19004107,  2.38742447], dtype=float32)
In [29]:
embedding.wv.word_vec('TRAUMA')
Out[29]:
array([-4.34122133, -2.22594047,  1.73989558, -3.07826257, -0.13281628], dtype=float32)
In [30]:
embedding.similarity('FXDISLC', 'TRAUMA')
Out[30]:
0.72518325942614803
In [31]:
embedding.similarity('PERINTL', 'PRGNCY')
Out[31]:
0.85074852830052416
In [32]:
tsne_close_score = embedding.similarity('METAB1', 'STROKE')

The t-SNE embedding did not distort the embedding much

In [33]:
for pcg in embedding.wv.index2word:
    if embedding.similarity(pcg, 'STROKE') > tsne_close_score:
        print(pcg)
STROKE

Getting the more useful Embedding Vectors

In [34]:
pcg_seq = [[embedding.wv.word_vec(tm) for tm in sl] for sl in corp]
In [35]:
id2age_dict = dict(mem[['MemberID', 'AgeAtFirstClaim']].values)
age_lbls = list(map(lambda x: id2age_dict[x], corp_mems))
In [36]:
age_lst = sorted([age for age in list(set(age_lbls)) if isinstance(age, str)])
In [37]:
age_dict = {age:idx for idx, age in enumerate(age_lst)}
In [38]:
age_dict
Out[38]:
{'0-9': 0,
 '10-19': 1,
 '20-29': 2,
 '30-39': 3,
 '40-49': 4,
 '50-59': 5,
 '60-69': 6,
 '70-79': 7,
 '80+': 8}
In [39]:
def age_enc(age):
    if isinstance(age, str):
        return np.eye(8)[age_dict[age]-1]
    else:
        return age
In [40]:
age_vals = list(map(lambda x: age_enc(x), age_lbls))

Now we can frame a classification problem

In [41]:
train = list(zip(pcg_seq, age_vals))
In [42]:
train[0]
Out[42]:
([array([-0.58052742, -0.96445072,  3.83547616,  0.34624135, -1.6154387 ], dtype=float32)],
 array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]))
In [43]:
train[6]
Out[43]:
([array([ 2.48150635,  7.14158344,  3.04955673,  0.56842721, -0.17179467], dtype=float32),
  array([-2.18482828, -5.67479515, -2.58596134,  4.41807508, -0.90763897], dtype=float32),
  array([ 0.72910041,  0.59713644, -0.40700144, -0.66461992, -0.26520795], dtype=float32),
  array([-4.74890757, -2.88864422, -6.14804602,  4.19666672,  1.59642756], dtype=float32),
  array([ 0.72910041,  0.59713644, -0.40700144, -0.66461992, -0.26520795], dtype=float32),
  array([-4.74890757, -2.88864422, -6.14804602,  4.19666672,  1.59642756], dtype=float32),
  array([ 0.72910041,  0.59713644, -0.40700144, -0.66461992, -0.26520795], dtype=float32),
  array([ 2.48150635,  7.14158344,  3.04955673,  0.56842721, -0.17179467], dtype=float32),
  array([ 0.72910041,  0.59713644, -0.40700144, -0.66461992, -0.26520795], dtype=float32),
  array([-0.26718751, -1.60156107,  0.19903776,  0.65166146, -0.85772049], dtype=float32),
  array([-2.18482828, -5.67479515, -2.58596134,  4.41807508, -0.90763897], dtype=float32),
  array([-2.18482828, -5.67479515, -2.58596134,  4.41807508, -0.90763897], dtype=float32),
  array([-2.18482828, -5.67479515, -2.58596134,  4.41807508, -0.90763897], dtype=float32)],
 nan)

Perhaps this imputer boosts age-based algorithms, likely introducing less bias than simple imputation schemes

In [44]:
test = []
for idx, itm in enumerate(train):
    try:
        l = len(itm[1])
    except (ValueError, TypeError) as e:
        test.append(itm)
        del train[idx]
In [45]:
len(test), len(train)
Out[45]:
(5491, 107509)
In [46]:
seq_len = [len(itm[0]) for itm in train]
In [47]:
print('There are members with up to {} distinct claims.'.format(max(seq_len)))
There are members with up to 130 distinct claims.
In [48]:
plt.hist(seq_len, bins=30)
Out[48]:
(array([ 24699.,  16801.,  11854.,  10777.,   6267.,   5181.,   5247.,
          3452.,   2966.,   3332.,   2145.,   1672.,   1821.,   1307.,
          1120.,   1241.,    942.,    801.,    889.,    699.,    615.,
           577.,    386.,    421.,    442.,    376.,    343.,    460.,
           342.,    334.]),
 array([   0.        ,    4.33333333,    8.66666667,   13.        ,
          17.33333333,   21.66666667,   26.        ,   30.33333333,
          34.66666667,   39.        ,   43.33333333,   47.66666667,
          52.        ,   56.33333333,   60.66666667,   65.        ,
          69.33333333,   73.66666667,   78.        ,   82.33333333,
          86.66666667,   91.        ,   95.33333333,   99.66666667,
         104.        ,  108.33333333,  112.66666667,  117.        ,
         121.33333333,  125.66666667,  130.        ]),
 <a list of 30 Patch objects>)

Most members have fewer than 40 separate claims

In [49]:
train = [elem for idx, elem in enumerate(train) if type(elem[1]) == np.ndarray]
In [50]:
train = [x for x in train if len(x[0]) > 0]
In [51]:
np.random.shuffle(train)
val = train[:500]
train = train[500:]
In [52]:
def batch_gen(batch_size, data=train, N=10, e_dim=5):
    np.random.shuffle(data)
    for bb in range(len(data)// batch_size):
        X, y = list(zip(*data[bb*batch_size: (bb+1)*batch_size]))
        s = [len(itm) for itm in X]
        out_X = []
        for idx in range(batch_size):
            bb = np.zeros((N, e_dim))
            bb[:min(N, len(X[idx])),:] = X[idx][:N]
            out_X.append(bb)
        yield out_X, list(y), s

Have a generator to batch arrays of specific dims, now I'll set up a tf graph around multi-layer lstms with dropout

In [ ]:
import tensorflow as tf
from sklearn.metrics import confusion_matrix

tf.reset_default_graph()
n_steps = 10
n_inputs = 5
n_outputs = 8
n_layers = 3
n_neurons = 32
n_epochs = 50
batch_size = 32
learning_rate = 1e-5
keep_prob = 0.5
n_steps_per_epoch = len(train) // batch_size

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None, n_outputs])
seq_length = tf.placeholder(tf.int32, [None])

with tf.variable_scope(tf.get_variable_scope(), reuse=False):
    lstm_cells = [tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)
              for layer in range(n_layers)]
cells_drop = [tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob=keep_prob) for cell in lstm_cells]
multi_cell = tf.contrib.rnn.MultiRNNCell(cells_drop)
outputs, states = tf.nn.dynamic_rnn(multi_cell, X, dtype=tf.float32,
                                    sequence_length=seq_length)
top_layer_h_state = states[-1][1]

logits = tf.layers.dense(top_layer_h_state, n_outputs, name="softmax")
xentropy = tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1)), tf.float32))

init = tf.global_variables_initializer()

val_gen = batch_gen(200, data=val)
X_bv, y_bv, seq_bv = next(val_gen)


with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        bb = batch_gen(batch_size)
        for step in range(n_steps_per_epoch):
            X_b, y_b, seq_b = next(bb)
            outputs_val, states_val = sess.run([outputs, states], feed_dict={X: X_b, y: y_b, seq_length: seq_b})
            if step % 1000 == 0:
                val_loss, acc, pred = sess.run([loss, accuracy, logits], feed_dict = {X: X_bv, y: y_bv, seq_length: seq_bv})
                print('epoch {} loss {:.03f}  accuracy {:.03f}'.format(epoch, val_loss, acc))
                cm = confusion_matrix(np.argmax(y_bv, 1), np.argmax(pred, 1))
                print(cm)
epoch 0 loss 2.083  accuracy 0.110
[[ 4  2  2  0  5  5  1  7]
 [ 3  1  3  0  3  5  1  2]
 [ 4  0  0  1  4  3  3  4]
 [ 4  0  2  1  5 10  0 10]
 [ 2  2  5  0  2  8  0  6]
 [ 0  0  6  2  7  4  0  4]
 [ 3  0  6  1 10  2  1  8]
 [ 4  1  3  0  4  4  1  9]]
epoch 0 loss 2.078  accuracy 0.110
[[ 6  3  0  0  2  9  0  6]
 [ 1  0  3  0  1  7  2  4]
 [ 4  1  2  0  3  4  1  4]
 [ 3  4  3  0  6  7  1  8]
 [ 5  0  3  0  3  6  2  6]
 [ 2  2  1  2  7  4  1  4]
 [ 1  0 10  4  5  3  2  6]
 [ 6  1  5  2  1  5  1  5]]
epoch 0 loss 2.082  accuracy 0.145
[[ 5  0  1  1  2 10  1  6]
 [ 4  1  1  2  1  5  1  3]
 [ 3  0  0  0  3  6  1  6]
 [ 3  1  3  2  5  9  0  9]
 [ 3  3  1  0  7  6  2  3]
 [ 2  0  5  1 10  3  0  2]
 [ 2  2  7  1  9  5  2  3]
 [ 6  1  3  1  1  4  1  9]]
epoch 0 loss 2.080  accuracy 0.115
[[4 1 2 1 3 8 2 5]
 [8 0 1 0 3 5 0 1]
 [2 1 2 2 2 3 1 6]
 [3 1 7 1 3 8 1 8]
 [4 0 2 0 5 9 0 5]
 [1 0 1 3 7 4 1 6]
 [0 0 4 2 9 7 1 8]
 [8 0 5 1 5 1 0 6]]
epoch 1 loss 2.080  accuracy 0.180
[[ 3  2  2  0  5  4  3  7]
 [ 2  1  3  1  1  5  0  5]
 [ 4  1  2  1  3  2  0  6]
 [ 2  1  5  6  4  6  1  7]
 [ 2  0  4  0  7  6  0  6]
 [ 0  0  4  3  6  4  2  4]
 [ 3  1  8  1  7  4  1  6]
 [ 4  1  2  2  3  2  0 12]]

Training until predictions cluster near the diagonal (accurate to within an age group or two), we have a more fine-tuned imputation tool .