2020 Kaggle competition by University of Liverpool

In this competition, contestants were challenged to predict the number of open ion channels based on electrophysiological signals from human cells. This is an important problem because potential solutions can have far-reaching impacts. From human diseases to how climate change affects plants, faster detection of ion channels could greatly accelerate solutions to major world problems.

This notebook is my submission to the competition. I placed in top 10% out of 2,620 teams, earning bronze medal.

Specail thanks to Chris Deotte for his amazing analysis on the drift, and noise in the data, and for making the cleaned dataset.
The main idea comes from the published notebook by the team English Ions and Co. It uses WaveNet adaptation for Keras, originating from Siavash, who successfuly implemented it in his notebook.

Very quickly it was found how to remove the noise, and drift quite successfuly. There were a few approaches using different models, and all were good enough to predict with over 0.93 score. In about 1 month to the deadline most of the contestants were tightly packed on the leaderboard with differences in the score on 0.0001 scale. It looked like Wavenet was used by the majoirity.

A little scandal happened after the competition closing: the top 2 teams won after discovering a leak in the test data, which gave them unxepected and significant boost. So on the one hand it is all right - they didn't break any rules, and were actually quite ingenious, because the leak was actually a mistake on the organizers side, and was difficult to discover. On the other hand these solutions really have no scientific value.

The most interesting result though was the revelation of the 3rd place solution: team 'Gilles & Kha Vo & Zidmie' used Hidden Markov Models, and no leak! So one might consider that they are actually the 1st place winners. Here is their brilliant write-up about their approach.

There are a few ideas that I'd like to try later on, especially more noise cleaning. Will see if I could improve my prediction.

Project code

In [1]:
										
"""
Created on Mon May 1 10:24:45 2020

@author: alex
"""
#imports
import os
import tensorflow as tf
from tensorflow.keras.layers import *
import pandas as pd
import numpy as np
import random
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K
from tensorflow.keras import losses, models, optimizers
import tensorflow_addons as tfa

from sklearn.model_selection import GroupKFold
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import seaborn as sns
										
									
In [2]:
										
#This callback implements a cyclical learning rate policy (CLR).
#The method cycles the learning rate between two boundaries with some constant frequency, as detailed in this paper (https://arxiv.org/abs/1506.01186).
from clr_callback import *
										
									
In [3]:
										
# initiating main params
EPOCHS = 180
NNBATCHSIZE = 16
SEED = 1970
LR = 0.0025
SPLITS = 6
BS = 4000
train_groups = {
0: [(0, 1000000)],
1: [(1500000, 2000000), (3500000, 4000000)],
2: [(2500000, 3000000), (4000000, 4500000)],
3: [(1000000, 1500000), (3000000, 3500000)],
4: [(2000000, 2500000), (4500000, 5000000)]}

clr_triangular = CyclicLR(base_lr = LR, max_lr = 0.005,  step_size=4000., mode='triangular2')
										
									
In [4]:
										
# read cleaned data by Chris Deotte
train = pd.read_csv('./input/train_clean.csv', dtype={'time': np.float32, 'signal': np.float32, 'open_channels':np.int32})
test  = pd.read_csv('./input/test_clean.csv', dtype={'time': np.float32, 'signal': np.float32})
test_raw  = pd.read_csv('./input/test.csv', dtype={'time': np.float32, 'signal': np.float32})
sub  = pd.read_csv('./input/sample_submission.csv', dtype={'time': np.float32})

# normalize the data (standard scaler)
def normalize(train, test):
    tim = train.signal.mean()
    tis = train.signal.std()
    train['signal'] = (train.signal - tim) / tis
    test['signal'] = (test.signal - tim) / tis
    return train, test

train, test = normalize(train, test)
										
									
In [5]:
										
def plot_time_data(data_df, title="Time variation data", color='b'):
    plt.figure(figsize=(18,8))
    plt.plot(data_df["time"], data_df["open_channels"], color=color)
    plt.title(title, fontsize=24)
    plt.xlabel("Time [sec]", fontsize=20)
    plt.ylabel("Opened", fontsize=20)
    plt.show()

def plot_time_signal(clean=True, d_range = (0,-1), title="Signal for Test set", color='b'):
    plt.figure(figsize=(18,8))
    if clean:
        plt.plot(test[d_range[0]:d_range[1]].time, test[d_range[0]:d_range[1]].signal, color=color)
    else:
        plt.plot(test_raw[d_range[0]:d_range[1]].time-500, test_raw[d_range[0]:d_range[1]].signal, color=color)
    plt.title(title, fontsize=24)
    plt.xlabel("Time [sec]", fontsize=20)
    plt.ylabel("Signal amp", fontsize=20)
    plt.show()
										
									

Description of Data

The training data is recordings in time. At each 10,000th of a second, the strength of the signal was recorded and the number of ion channels open was recorded. It is our task to build a model that predicts the number of open channels from signal at each time step. Furthermore we are told that the data was recorded in batches of 50 seconds. Therefore each 500,000 rows is one batch. The training data contains 10 batches and the test data contains 4 batches. Let's display the number of open channels and signal strength together for each training batch.

In [6]:
										
plt.figure(figsize=(20,5)); res = 1000
plt.plot(range(0,train.shape[0],res),train.open_channels[0::res])
for i in range(11): plt.plot([i*500000,i*500000],[-5,12.5],'r')
for j in range(10): plt.text(j*500000+200000,10,str(j+1),size=20)
plt.xlabel('Row',size=16); plt.ylabel('Channels Open',size=16);
plt.title('Training Data Open Channels - 10 batches',size=20)
plt.show()
										
									
In [7]:
										
plt.figure(figsize=(20,5)); res = 1000
plt.plot(range(0,train.shape[0],res),train.signal[0::res])
for i in range(11): plt.plot([i*500000,i*500000],[-5,12.5],'r')
for j in range(10): plt.text(j*500000+200000,10,str(j+1),size=20)
plt.xlabel('Row',size=16); plt.ylabel('Signal',size=16);
plt.title('Training Data Signal - 10 batches',size=20)
plt.show()
										
									
In [8]:
										
#Original test data signal
plot_time_signal(False, color='green', title="Original signal for Test set")
										
									
In [9]:
										
#Cleaned test data signal
plot_time_signal(True, color='red', title="Signal of the Test set with removed drift")
										
									
In [10]:
										
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    tf.random.set_seed(seed)

# create batches of 4,000 observations
# create batches of 500,000 observations
def batching(df, batch_size):
    df['group'] = df.groupby(df.index//batch_size, sort=False)['signal'].agg(['ngroup']).values
    df['group'] = df['group'].astype(np.uint16)
    return df

# get lead and lags features
def lag_with_pct_change(df, windows):
    for window in windows:
        df['signal_shift_pos_' + str(window)] = df.groupby('group')['signal'].shift(window).fillna(0)
        df['signal_shift_neg_' + str(window)] = df.groupby('group')['signal'].shift(-1 * window).fillna(0)
    return df

# main module to run feature engineering
def feature_engineering(df, batch_size):
    # create batches
    df = batching(df, batch_size = batch_size)
    # create leads and lags (1, 2, 3 making them 6 features)
    df = lag_with_pct_change(df, [1, 2, 3])
    # create signal ** 2 (this is the new feature)
    df['signal_2'] = df['signal'] ** 2
    # df['signal_3'] = df['signal'] ** 3
    return df

# fillna with the mean and select features for training
def feature_selection(train, test):
    features = [col for col in train.columns if col not in ['index', 'group', 'open_channels', 'time']]
    train = train.replace([np.inf, -np.inf], np.nan)
    test = test.replace([np.inf, -np.inf], np.nan)
    for feature in features:
        feature_mean = pd.concat([train[feature], test[feature]], axis = 0).mean()
        train[feature] = train[feature].fillna(feature_mean)
        test[feature] = test[feature].fillna(feature_mean)
    return train, test, features
										
									
In [11]:
										
# WaveNet with CBR blocks realization
def Classifier(shape_):

def cbr(x, out_layer, kernel, stride, dilation):
    x = Conv1D(out_layer, kernel_size=kernel, dilation_rate=dilation, strides=stride, padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    return x

def wave_block(x, filters, kernel_size, n):
    dilation_rates = [2**i for i in range(n)]
    x = Conv1D(filters = filters,
               kernel_size = 1,
               padding = 'same')(x)
    res_x = x
    for dilation_rate in dilation_rates:
        tanh_out = Conv1D(filters = filters,
                          kernel_size = kernel_size,
                          padding = 'same',
                          activation = 'tanh',
                          dilation_rate = dilation_rate)(x)
        sigm_out = Conv1D(filters = filters,
                          kernel_size = kernel_size,
                          padding = 'same',
                          activation = 'sigmoid',
                          dilation_rate = dilation_rate)(x)
        x = Multiply()([tanh_out, sigm_out])
        x = Conv1D(filters = filters,
                   kernel_size = 1,
                   padding = 'same')(x)
        res_x = Add()([res_x, x])
    return res_x

inp = Input(shape = (shape_))
x = cbr(inp, 64, 7, 1, 1)
x = BatchNormalization()(x)
x = wave_block(x, 16, 3, 12)
x = BatchNormalization()(x)
x = wave_block(x, 32, 3, 8)
x = BatchNormalization()(x)
x = wave_block(x, 64, 3, 4)
x = BatchNormalization()(x)
x = wave_block(x, 128, 3, 1)
x = cbr(x, 32, 7, 1, 1)
x = BatchNormalization()(x)
x = wave_block(x, 64, 3, 1)
x = cbr(x, 32, 7, 1, 1)
x = BatchNormalization()(x)
x = Dropout(0.2)(x)
out = Dense(11, activation = 'softmax', name = 'out')(x)

model = models.Model(inputs = inp, outputs = out)

opt = Adam(lr = LR)
opt = tfa.optimizers.SWA(opt)
model.compile(loss = losses.CategoricalCrossentropy(), optimizer = opt, metrics = ['accuracy'])
return model
										
									
In [12]:
										
train = feature_engineering(train, batch_size = BS)
test = feature_engineering(test, batch_size = BS)
train, test, features = feature_selection(train, test)

seed_everything(SEED)
K.clear_session()
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=config)
tf.compat.v1.keras.backend.set_session(sess)
oof_ = np.zeros((len(train), 11)) # build out of folds matrix with 11 columns, they represent our target variables classes (from 0 to 10)
preds_ = np.zeros((len(test), 11))
target = ['open_channels']
group = train['group']
kf = GroupKFold(n_splits=5)
splits = [x for x in kf.split(train, train[target], group)]

new_splits = []
for sp in splits:
new_split = []
new_split.append(np.unique(group[sp[0]]))
new_split.append(np.unique(group[sp[1]]))
new_split.append(sp[1])
new_splits.append(new_split)
# pivot target columns to transform the net to a multiclass classification estructure (you can also leave it in 1 vector with sparsecategoricalcrossentropy loss function)
tr = pd.concat([pd.get_dummies(train.open_channels), train[['group']]], axis=1)

tr.columns = ['target_'+str(i) for i in range(11)] + ['group']
target_cols = ['target_'+str(i) for i in range(11)]
train_tr = np.array(list(tr.groupby('group').apply(lambda x: x[target_cols].values))).astype(np.float32)
train = np.array(list(train.groupby('group').apply(lambda x: x[features].values)))
test = np.array(list(test.groupby('group').apply(lambda x: x[features].values)))

for n_fold, (tr_idx, val_idx, val_orig_idx) in enumerate(new_splits[0:], start=0):
train_x, train_y = train[tr_idx], train_tr[tr_idx]
valid_x, valid_y = train[val_idx], train_tr[val_idx]
print(f'Our training dataset shape is {train_x.shape}')
print(f'Our validation dataset shape is {valid_x.shape}')

shape_ = (None, train_x.shape[2]) # input is going to be the number of feature we are using (dimension 2 of 0, 1, 2)
model = Classifier(shape_)
# using our lr_schedule function
ES_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
checkpoint = ModelCheckpoint(f'NN_fold_{n_fold}.hdf5', monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
model.fit(train_x,train_y,
          epochs = EPOCHS,
          callbacks = [ES_callback, clr_triangular, checkpoint],
          batch_size = NNBATCHSIZE,
          verbose = 2,
          validation_data = (valid_x,valid_y))
model.load_weights(f'NN_fold_{n_fold}.hdf5')

preds_f = model.predict(valid_x)
f1_score_ = f1_score(np.argmax(valid_y, axis=2).reshape(-1),  np.argmax(preds_f, axis=2).reshape(-1), average = 'macro') # need to get the class with the biggest probability
print(f'Training fold {n_fold + 1} completed. macro f1 score : {f1_score_ :1.5f}')
preds_f = preds_f.reshape(-1, preds_f.shape[-1])
oof_[val_orig_idx,:] += preds_f
te_preds = model.predict(test)
model.save("model-wavenet.h5")
te_preds = te_preds.reshape(-1, te_preds.shape[-1])
preds_ += te_preds / SPLITS
										
									
In [13]:
										
# calculate the oof macro f1_score
f1_score_ = f1_score(np.argmax(train_tr, axis = 2).reshape(-1),  np.argmax(oof_, axis = 1), average = 'macro') # axis 2 for the 3 Dimension array and axis 1 for the 2 Dimension Array (extracting the best class)
print(f'Training completed. oof macro f1 score : {f1_score_:1.5f}')
sample_submission = sub.copy()
sample_submission['open_channels'] = np.argmax(preds_, axis = 1).astype(int)
sample_submission.to_csv('submission_wavenet1970-CLR-ES-1.csv', index=False, float_format='%.4f')
										
									
Training completed. oof macro f1 score : 0.93313
In [14]:
										
# Open channels predictions
plot_time_data(sample_submission, color='gold', title="Predictions")