Error analysis

Summary of findings

About 20 percent of the rows account for 75 percent of the total error.
Worst scores due to bad stocks or bad time_ids? When comparing errors aggregated over stocks and errors aggregated over time_ids, the time_ids show a much higher variance than stock_ids. This could be due to fast changing shocks to the market, imposed on all stocks at once. We could possibly detect these shocks if they occur in the last minute or two of the input data window.
The worst scoring time_id is 25504 and the worst scoring stock is stock_31.
The worst scoring rows are usually due to an extremely low target, since the target is used as a divisor in the competition metric, the rmspe. From the plots I have made, I cannot see any clear signals that could warn me of an extremely low target.
Action to take: Look for aggregations across time_ids that could possibly help our models detect periods of low targets.

Lets read in our predictions and calculate the squared percentage error for analysis

train = pd.read_pickle(TRAIN)
oof_preds = np.load(os.path.join(MODEL, 'oof_predictions.npy'))
train['pred'] = oof_preds
train['spe'] = ((train['target'] - train['pred']) / train['target']) ** 2
train['raw_error'] = train['pred'] - train['target']
train['rv'] = train['log_return_realized_volatility']
print(rdf(train))
train.head(3)

0.21259060811686517

/home/c/miniconda3/envs/opt_nb/lib/python3.7/site-packages/ipykernel_launcher.py:4: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  after removing the cwd from sys.path.
/home/c/miniconda3/envs/opt_nb/lib/python3.7/site-packages/ipykernel_launcher.py:5: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  """
/home/c/miniconda3/envs/opt_nb/lib/python3.7/site-packages/ipykernel_launcher.py:6: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`

	row_id	time_id	target	fold	log_return_max_sub_min	log_return_std	log_return_realized_volatility	log_return_mean_decay	log_return_mean_decay_flip	...	stock_id_order_size_mean_mean	stock_id_order_size_mean_std	stock_id_spread_mean_decay_95_mean	stock_id_spread_mean_decay_95_std	stock_id_spread_pct_momentum_mean	stock_id_spread_pct_momentum_std	pred	spe	raw_error	rv
0	0-5	5	0.004136	1.0	0.001945	0.000184	0.004499	0.000002	1.640986e-05	...	30.978695	9.120817	0.000964	0.000868	0.973997	0.251807	0.003911	0.002965	-0.000225	0.004499
1	0-11	11	0.001445	0.0	0.000872	0.000049	0.001204	-0.000002	1.816128e-08	...	30.978695	9.120817	0.000964	0.000868	0.973997	0.251807	0.001534	0.003801	0.000089	0.001204
2	0-16	16	0.002168	4.0	0.001583	0.000097	0.002369	-0.000006	4.263426e-06	...	30.978695	9.120817	0.000964	0.000868	0.973997	0.251807	0.002236	0.000988	0.000068	0.002369

3 rows × 372 columns

top_err = train.sort_values('spe', ascending=False)
top_err[:100].raw_error.hist()
plt.title('top error histogram')
plt.show()

top_err[:100].raw_error.hist()
plt.title('top error histogram')
plt.show()

top_err[:100]['target'].mean()

0.0008222269300000001

Feature correlation with the target

Lets see how well all the columns correlate with the target. All columns except ‘target’ and ‘pred’ are features that can be used for training

train.corrwith(train['target']).abs().sort_values(ascending=False)[:20]

target                            1.000000
pred                              0.922224
real_vol_mean_decay_0.85_-1       0.885936
real_vol_mean_decay_0.75_-1       0.885581
real_vol_mean_decay_0.9_-1        0.884346
real_vol_mean_decay_0.95_-1       0.881675
real_vol_mean_decay_0.65_-1       0.880945
real_vol_mean_decay_0.99_-1       0.878880
real_vol_mean_decay_0.99_1        0.877295
real_vol_mean_decay_0.85_-1_2     0.876040
real_vol_mean_decay_0.75_-1_2     0.875160
real_vol_mean_decay_0.9_-1_2      0.874569
log_return_realized_volatility    0.873777
rv                                0.873777
log_return_std                    0.873740
real_vol_mean_decay_0.95_1        0.873627
real_vol_mean_decay_0.55_-1       0.873266
real_vol_mean_decay_0.95_-1_2     0.871917
real_vol_mean_decay_0.65_-1_2     0.869714
real_vol_mean_decay_0.99_-1_2     0.869073
dtype: float64

Error distribution

What is the general distribution of errors? Are they low or high variance?

If They have high variance, with a few very large errors, we may be able to identify some way of easily reducing these errors, which would have a large impact on our overall score.

sns.kdeplot(train['spe'])
plt.title('squared percentage error kernel density plot')
plt.show()

What the heck. This looks like I have one or a few large errors and most near zero.

train['spe'].hist()
plt.title('squared percentage error histogram')
plt.show()

Lets see the highest and lowest scoreing rows. They will show that there are a small amount of rows that are greatly increasing the error.

train['spe'].sort_values(ascending=False)

110278    2.533531e+02
108254    5.130760e+01
110479    4.594868e+01
275261    4.387754e+01
107425    1.892384e+01
              ...     
15927     1.529692e-12
335831    1.403667e-12
346117    1.188800e-12
277792    4.652826e-14
259398    8.194316e-16
Name: spe, Length: 428932, dtype: float64

Lets look at the cumulative sum of errors to see how spread out the errors are.

df = train['spe'].sort_values().cumsum().reset_index(drop=True)
df.index /= df.index.values[-1]
df /= df.max()
df.plot()
plt.suptitle('Cumulative sum of sorted errors')
plt.xlabel('fraction of rows')
plt.ylabel('fraction of error')
plt.show()

It looks like the last 20% of the rows carry about 75% of the error.

Comparing stock_id and time_id group error distribution

Lets see if the error is balanced accross stock_ids and time_ids.

Stock_id

print('Errors by stock_id sorted in descending order')
stock = train.groupby('stock_id')['spe'].sum()
stock.sort_values(ascending=False)

Errors by stock_id sorted in descending order

stock_id
31     1092.696514
37      329.793868
18      311.135581
33      296.696029
88      287.739433
          ...     
84      104.070718
96      102.221548
14      100.687571
124      98.935379
56       87.401965
Name: spe, Length: 112, dtype: float64

I was expecting the minimum to be much much lower compared to the maximum spe stock_id. The worst offender, stock_id 31 is only about 3 times worse than the second at 380 and gradually goes down to around 100 for the lowest stock. So stock 31 needs a little special attention, but we need to look at all stocks erros. I am guessing that the time_id aggregations won’t be as balanced.

I am curious to see how the correlation between the target and log_return_realized_volatility (the most standard predictor) compares between the best and worst scoring stock.

train[train.stock_id == 31][['log_return_realized_volatility', 'target']].corr()

	log_return_realized_volatility	target
log_return_realized_volatility	1.000000	0.820801
target	0.820801	1.000000

train[train.stock_id == 56][['log_return_realized_volatility', 'target']].corr()

	log_return_realized_volatility	target
log_return_realized_volatility	1.000000	0.901207
target	0.901207	1.000000

It turns out that the worst performing stock has a lower correlation.

Time_id

print('Errors by time_id sorted in descending order')
time = train.groupby('time_id')['spe'].sum()
worst_times = time.sort_values(ascending=False)
worst_times

Errors by time_id sorted in descending order

time_id
25504    258.750772
24034     69.625326
8534      58.088269
20439     53.063871
27174     49.766181
            ...    
11499      1.334917
17337      1.296191
22893      1.153656
32082      1.127673
4432       1.029422
Name: spe, Length: 3830, dtype: float64

I was correct about time_id group score being more spread than stock_id. The worst time id is about 250 times as bad as the best scoring time_id.

I am curious to see how the correlation between the target and log_return_realized_volatility (the most standard predictor) compares between the best and worst scoring time_id.

train[train.time_id == 25504][['log_return_realized_volatility', 'target']].corr()

	log_return_realized_volatility	target
log_return_realized_volatility	1.000000	0.828771
target	0.828771	1.000000

train[train.time_id == 4432][['log_return_realized_volatility', 'target']].corr()

	log_return_realized_volatility	target
log_return_realized_volatility	1.000000	0.896323
target	0.896323	1.000000

Similar to stock_id, the best time_id group has a higher correlation.

A further look at the worst time_ids: plotting realized volatility, target, and prediction for all stocks

top_5_worst_times = worst_times[:5].index

There is a great imbalance here, with the top time_id accruing about 200 times the error of the bottom. If we can identify the reason for these errors in the hardest hit time ids, we have the best chance to improve our models.

I am suspecting that either the first 10 minute window is low volatitliy, followed by a second 10 minute window with high volatility, or vica versa. For the top time_id, 25504, lets see how the realized volatility compares to the average of the first 10 minutes, and then how the target compares to the average target.

Lets normalize the target and realized volatility of the the first 10 minutes. This way we put the variables on the same scale that is used when calculating correlation between variables.

tar_mean = train['target'].mean()
rv_mean = train.log_return_realized_volatility.mean()
tar_std = train['target'].std()
rv_std = train.log_return_realized_volatility.std()
train['norm_real_vol'] = (train.log_return_realized_volatility - rv_mean) / rv_std
train['norm_target'] = (train.target - tar_mean) / tar_std

Lets look at the top 3 worst scoring time_ids, and then the best scoring time_ids.

fig, axes = plt.subplots(3, 1, figsize=(15, 20))
train[train.time_id == 25504][['norm_real_vol', 'norm_target']].plot(ax=axes[0])
axes[0].set_title('Worst scoring time_id, 25504 normalized target and realized variance')

train[train.time_id == 27174][['norm_real_vol', 'norm_target']].plot(ax=axes[1])
axes[1].set_title('2nd worst scoring time_id, 27174 normalized target and realized variance')

train[train.time_id == 24034][['norm_real_vol', 'norm_target']].plot(ax=axes[2])
axes[2].set_title('3rd worst scoring time_id, 24034 normalized target and realized variance')
plt.show()

fig, axes = plt.subplots(3, 1, figsize=(15, 20))
train[train.time_id == 29850][['norm_real_vol', 'norm_target']].plot(ax=axes[0])
axes[0].set_title('Best scoring time_id, 29850 normalized target and realized variance')

train[train.time_id == 4432][['norm_real_vol', 'norm_target']].plot(ax=axes[1])
axes[1].set_title('2nd best scoring time_id, 4432 normalized target and realized variance')

train[train.time_id == 21116][['norm_real_vol', 'norm_target']].plot(ax=axes[2])
axes[2].set_title('3rd best scoring time_id, 21116 normalized target and realized variance')

plt.show()

I am a bit confused. The 3rd worst scoring time_id looks really bad, but the first 2 look almost like best scoring time_ids. Maybe there is something I can’t see yet that is hurting my model. Lets plot the same plots again, along with our predictions. We will have to scale the predictions with the the same mean and std of the target to put them on the same scale.

train['norm_pred'] = (train.pred - tar_mean) / tar_std

fig, axes = plt.subplots(3, 1, figsize=(15, 20))
train[train.time_id == 25504][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[0])
axes[0].set_title('Worst scoring time_id, 25504 normalized target and realized variance')

train[train.time_id == 27174][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[1])
axes[1].set_title('2nd worst scoring time_id, 27174 normalized target and realized variance')

train[train.time_id == 24034][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[2])
axes[2].set_title('3rd worst scoring time_id, 24034 normalized target and realized variance')
plt.show()

fig, axes = plt.subplots(3, 1, figsize=(15, 20))
train[train.time_id == 29850][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[0])
axes[0].set_title('Best scoring time_id, 29850 normalized target and realized variance')

train[train.time_id == 4432][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[1])
axes[1].set_title('2nd best scoring time_id, 4432 normalized target and realized variance')

train[train.time_id == 21116][['norm_real_vol', 'norm_target', 'norm_pred']].plot(ax=axes[2])
axes[2].set_title('3rd best scoring time_id, 21116 normalized target and realized variance')

plt.show()

I am still confused about the third worst time_id. It looks like the predictions follow the first 10 minute volatility, and should be far off. Maybe I am missing something by normalizing. Lets just look at the raw values of preds and the target, along with the error plotted underneath for reference.

Plot that finally brings some understanding of the large errors

fig, axes = plt.subplots(6, 1, figsize=(15, 35))
train[train.time_id == 25504][['target', 'pred']].plot(ax=axes[0])
axes[0].set_title('Worst scoring time_id, 25504 normalized target and realized variance')
train[train.time_id == 25504][['spe']].plot(ax=axes[1])

train[train.time_id == 27174][['target', 'pred']].plot(ax=axes[2])
axes[2].set_title('2nd worst scoring time_id, 27174 normalized target and realized variance')
train[train.time_id == 27174][['spe']].plot(ax=axes[3])

train[train.time_id == 24034][['target', 'pred']].plot(ax=axes[4])
axes[4].set_title('3rd worst scoring time_id, 24034 normalized target and realized variance')
train[train.time_id == 24034][['spe']].plot(ax=axes[5])

plt.show()

key takeaway: the largest errors are caused when the target is small because the errors are divided by the target in the metric calculation. The top two worst scoring time_ids both have a single prediction with an extremely low target, while the 3rd worst time_id has a generally low target for all stocks.

Looking at rows with the highest error, testing hypothesis that a small target is the culprit

Lets start with the top 50 errors

fig, axes = plt.subplots(2, 1, figsize=(15, 10))
top = train.sort_values('spe', ascending=False).head(50).reset_index()
top[['target', 'pred', 'log_return_realized_volatility']].plot(ax=axes[0])
axes[0].set_title('Top 25 squared percentage errors')
top[['spe']].plot(ax=axes[1])
axes[1].set_title('spe')
plt.show()

It definitely seems that the largest errors are coming from overpredicting the target, along with an especially small target. I think overprediction could be the only times when we have a large penalty from the metric. Lets look at the largest losses where we underpredict.

df = train[train.pred < train.target]
df.sort_values('spe', ascending=False)[['pred', 'target', 'spe']].head(3)

	pred	target	spe
202001	0.002458	0.069165	0.930183
378265	0.001178	0.019343	0.881915
78772	0.001295	0.019935	0.874268

The largest spe is less than .92. For reference, if we predicted 0 for all targets our average spe and overall metric would be 1. Given that the underpredicted errors gave spe penalties in the 5,6, and 7 range at the high end (highest about 200) with relatively small absolute error, and these underpredictions give a max penalty less than 1, I am certain that overpredicting especially small targets is the largest source of error.

Looking for signals of low volatility

Lets look at some of the book and trade data for the top losses. Maybe we can find some features that show that we are about to enter a low volatility period, even if the previous 10 minutes was high.

def add_percentile(df, col): 
    """Adds `{col}_percentile` to compare where each row ranks in terms of `col`."""
    if type(col) == list: 
        for c in col: df = add_percentile(df, c)
    else: 
        df[col + '_percentile'] = (df[col].rank(pct=True) * 100).astype(int)
    return df

add_percentile(train, ['pred', 'target', 'log_return_realized_volatility'])
top_spe = train.sort_values('spe', ascending=False)
top_spe.head()

	row_id	stock_id	time_id	target	fold	log_return_max_sub_min	log_return_std	log_return_realized_volatility	log_return_mean_decay	log_return_mean_decay_flip	...	spe	raw_error	rv	ratio	norm_real_vol	norm_target	norm_pred	pred_percentile	log_return_realized_volatility_percentile
110278	31-25504	31	25504	0.000139	1.0	0.002713	0.000144	0.003528	-1.915218e-06	1.643631e-06	...	253.353148	0.002212	0.003528	283529.781250	-0.196735	-1.274253	-0.520822	37	56
108254	31-8534	31	8534	0.000105	3.0	0.002760	0.000081	0.001986	2.102843e-06	5.847572e-07	...	51.307601	0.000754	0.001986	111114.992188	-0.626816	-1.285737	-1.028943	1	22
110479	31-27174	31	27174	0.000123	1.0	0.002623	0.000109	0.002662	-2.080473e-06	1.383185e-05	...	45.948679	0.000835	0.002662	60701.390625	-0.438237	-1.279647	-0.995348	2	39
275261	81-28319	81	28319	0.000389	0.0	0.002005	0.000123	0.003025	-2.650691e-06	-2.003285e-05	...	43.877539	0.002579	0.003025	407105.843750	-0.336934	-1.189000	-0.310741	52	47
107425	31-1544	31	1544	0.000289	2.0	0.001773	0.000086	0.002103	5.815194e-07	1.080874e-05	...	18.923841	0.001259	0.002103	172874.531250	-0.594068	-1.223036	-0.794324	13	25

5 rows × 379 columns

def plot_book(stock_id, time_id, train=train): 
    """Plots book and trade data for single stock and time combination. 
    This is to try and tease out indicators for especially large errors."""
    mask = (train.stock_id == stock_id) & (train.time_id == time_id)
    target, pred, spe, realized_volatitlity, pred_percentile, target_percentile, \
    realized_volatitlity_percentile, size_sum, ratio \
        = train.loc[mask, ['target', 'pred', 'spe', 
                           'log_return_realized_volatility',
                           'pred_percentile', 
                           'target_percentile', 
                           'log_return_realized_volatility_percentile', 
                           'size_sum', 
                           'ratio'
                          ]].values[0].round(5)
    
    bt = load_bt(DATA_RAW, stock_id, 'train')
    bt['wap'] = (bt['bid_price1'] * bt['ask_size1'] + bt['ask_price1'] * bt['bid_size1']) /\
                          (bt['bid_size1'] + bt['ask_size1'])
    bt['total_bid_size'] = bt['bid_size1'] + bt['bid_size2']
    bt['total_ask_size'] = bt['ask_size1'] + bt['ask_size2']
    book = bt[bt.time_id == time_id].set_index('seconds_in_bucket')
    
    fig, axes = plt.subplots(3, 1, figsize=(15, 15))
    
    cmap = {'ask_price1': 'yellow', 
            'ask_price2': 'khaki',
            'ask_size1': 'yellow', 
            'ask_size2': 'khaki',
            'bid_price1': 'blue', 
            'bid_price2': 'lightblue', 
            'bid_size1': 'blue', 
            'bid_size2': 'lightblue',
            'total_ask_size': 'yellow', 
            'total_bid_size': 'blue', 
            'wap': 'green', 
            'size': 'red', 
            'price': 'red', 
           }
    
    book[['ask_price2', 'ask_price1', 'bid_price1',  'bid_price2', 'wap', 'price']].plot(ax=axes[0], color=cmap)
    book[['bid_size1', 'ask_size1', 'bid_size2','ask_size2']].plot(ax=axes[1], color=cmap)
    book[['total_bid_size', 'total_ask_size']].plot(ax=axes[2], color=cmap)
    book[['size']].plot(marker='o', ax=axes[1], color=cmap)

    
    axes[0].set_title('Price info')
    axes[1].set_title('Size info')
    axes[2].set_title('Total size info')
    axes[0].legend(loc='upper left')
    axes[1].legend(loc='upper left')
    axes[2].legend(loc='upper left')
    
    fig.suptitle(f'\
        Stock: {stock_id}         missing seconds: {600 - len(book)}\n\
        Time_id: {time_id}\n\
        spe: {spe} \n\
        target: {target} \n\
        pred: {pred} \n\
        realized_volatitlity: {realized_volatitlity},\n\n\
        error: {round(pred - target, 5)} \n\
        target_percentile: {target_percentile} \n\
        pred_percentile: {pred_percentile} \n\
        realized_volatitlity_percentile: {realized_volatitlity_percentile}           total sales: {size_sum}         ratio: {ratio}\n')
    plt.tight_layout()
    plt.show()

train['d'] = train['pred'] - train['target']

Plotting rows with a large absolute error

for stock_id, time_id in train.sort_values('d', ascending=False)[['stock_id', 'time_id']].values[: 10]: 
    plot_book(stock_id, time_id)

def mean_decay(x, decay=.9, step=-1, axis=0): 
    """Returns sum with exponential decay, step = -1
    for the end of the array to matter the most."""
    weights = np.power(decay, np.arange(x.shape[axis])[::step]).astype(np.float32)
    return np.sum(weights * x, axis=axis) / weights.sum()

Plotting top 3 instances for each of the top 5 worst scoring time_ids

for t in top_5_worst_times: 
    tmp = train[train.time_id == t].sort_values('spe', ascending=False)
    for stock_id, time_id in tmp[['stock_id', 'time_id']].values[: 3]: 
        plot_book(stock_id, time_id)

Lets look at the most penalized instances and see if we can see a pattern.

low = train.sort_values('log_return_realized_volatility')

Plotting rows with low volatiltiy during the 10 minutes of input data for reference

Lets see what the lowest volatility instances look like before looking at the instances where we went from high volatility to a low target

for stock_id, time_id in low[['stock_id', 'time_id']].values[: 10]: 
    plot_book(stock_id, time_id)

They all have a very low ratio of sales to ask and bid size

Plotting the worst scoring rows of stock 31, the worst scoring stock

for stock_id, time_id in top_spe[top_spe.stock_id != 31][['stock_id', 'time_id']].values[: 10]: 
    plot_book(stock_id, time_id)

Plotting the first 10 rows of stock 31 to compare to the worst scoring rows

for stock_id, time_id in train[train.stock_id == 31][['stock_id', 'time_id']].values[: 10]: 
    plot_book(stock_id, time_id)

Plotting random rows with average measured realized_volatitliy

comp = train[(train.log_return_realized_volatility_percentile < 60) & (train.log_return_realized_volatility_percentile > 40)]
for stock_id, time_id in comp[['stock_id', 'time_id']].values[:10]: 
    plot_book(stock_id, time_id)

I can’t see any obvious patterns in these plots that I could use to create different features than I already have.