Feature engineering

Make tidy features from wide time series data.

sns.set()
plt.rcParams['figure.figsize'] = (14,6)
plt.rcParams['font.size'] = 16

PATH_DATA = 'data'
PATH_DATA_RAW = 'data/raw'
PATH_DATA_FEATURES = 'data/features'
os.listdir(PATH_DATA_RAW)

################## Load data ####################
chunks = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sales_train_evaluation.csv'), chunksize=1000)
df_stv = pd.concat(list(chunks)) # Safe for low RAM situation
df_cal = pd.read_csv(os.path.join(PATH_DATA_RAW, 'calendar.csv'))
df_prices = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sell_prices.csv'))
df_ss = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sample_submission.csv'))

What you can get out of this notebook

Know how to make lag features from the horizontal “rectangle” data representation, which is how the data starts.
Knoweldge of how to utilize numpy to do quick rolling window aggregations.

Making a grid to align all features

This section develops code for make_grid_df which will yield: * A dataframe to align all features * A numpy array where each row is a time series. This data representation can be good for for fast feature engineering.

Add prediction horizon

We will start by adding the prediction horizon to our original data so that feature our features will be generated all training data and our test data at the same time.

last_day = int(df_stv.columns[-1][2:])
pred_horizon = 28
for i in range(last_day + 1, last_day + 1 + pred_horizon): 
    df_stv[f'd_{i}'] = np.nan

Make a tidy grid

We want our data in a tidy format, where we have a row for every product/sales_day combination. To do this, we start be reshaping our data to long format. I will call this our grid_df, on which we will build our features

Using pandas

We can use pandas dataframe .melt method

s = time.time()
start_time = time.time()
DROP_COLS = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
grid_df = df_stv.drop(DROP_COLS, axis=1).melt(id_vars='id', var_name='d', value_name='sales')
# print(f"Total time for melt: {(time.time() - start_time)/60} min")
print(f"Total time for melt: ", time_taken(start_time))


# Saving space
start_time = time.time()
grid_df['d'] = grid_df.d.str[2:].astype(np.int16)
print(f"Total time for day col change: ", time_taken(start_time))

start_time = time.time()
grid_df['id'] = grid_df.id.astype('category')
print(f"Total time for category: ", time_taken(start_time))

print(time_taken(s))
display(grid_df)

del s

Faster grid ceation using numpy

d_cols = [col for col in df_stv.columns if col.startswith('d_')]
g = pd.DataFrame({'id': pd.Series(np.tile(df_stv.id, len(d_cols))).astype('category'), 
                  'd': np.concatenate([[int(s[2:])] * df_stv.shape[0] for s in d_cols]).astype(np.int16), 
                  'sales': df_stv[d_cols].values.T.reshape(-1,)})

print(f'Both grids are the same: {grid_df.equals(g)}')

Isolate numpy array in “rectangle” representation

I will take the sales values as they are to form my base “rectangle” of sales. I think I can take this recatangle and quickly reshape it so that it lines up with grid_df. If I am correct we can use this to create features quickly.

Test: Reshape the basic rectangle so that it matches sales of grid_df

rec = df_stv[d_cols].values
test_sales = rec.T.reshape(-1)
print('test_sales matches sales?? ', (np.nan_to_num(test_sales) == grid_df['sales'].fillna(0)).all())

The competition guide states that leading zeros sales should not be considered, therefore we need to convert these leading zeros to NaNs.

Making a grid to align all features

Add prediction horizon

Make a tidy grid

Using pandas

Faster grid ceation using numpy

Isolate numpy array in “rectangle” representation

nan_leading_zeros

Main function

make_grid_df

Base features

Base categorical variables given by heierarchical levels

add_base

Price features

create_price_fe

add_price_fe

Calander Features

add_cal_fe

Snap features

Simple feature

add_snap_transform_1

A more meaningful mapping?

add_snap_transform_2

Special event features

add_event_features

Main function

fe_base_features

Encoding features with target statistics

encode_target

Main function

fe_encodings

Lags and rolling features

Basic lag features

make_lag_col

add_lags

Pandas shift

Main function

fe_lags

Rolling features

rolling window function

rolling_window

split_array

make_rolling_col

Some more functions for rolling windows

mean_decay

diff_nanmean

diff_mean

add_rolling_cols

Main function

fe_rw_stats

Average for each day of the week

Feature telling how long its been since theres been a sale

get_days_since_sale

add_dow_means

Main function

fe_dow_means

Shifted lag rolling features

add_shift_cols

Main function

fe_shifts_momentum

Dimensionality reduction of lags

Main function

fe_ipca_lags

Lets see all the features we created

Make all features

fe