Brief EDA

import os
import pandas as pd
# For fast testing for Continuous Integration
PATH_DATA = 'data'
PATH_DATA_RAW = 'data/raw'
os.listdir(PATH_DATA_RAW)
['sample_submission.csv',
 'sell_prices.csv',
 'calendar.csv',
 'sales_train_evaluation.csv']
# # no_test
# PATH_DATA = 'data'
# PATH_DATA_RAW = 'data/raw'
# os.listdir(PATH_DATA_RAW)
chunks = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sales_train_evaluation.csv'), chunksize=1000)
df_stv = pd.concat(list(chunks)) # Safe for low RAM situation
df_cal = pd.read_csv(os.path.join(PATH_DATA_RAW, 'calendar.csv'))
df_prices = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sell_prices.csv'))
df_ss = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sample_submission.csv'))

sales_train_evaluation: We have six hierarchical categores and 1941 days of sales data for each of the 30490 items

display(df_stv.head())
display(df_stv.info())
id item_id dept_id cat_id store_id state_id d_1802 d_1803 d_1804 d_1805 ... d_1932 d_1933 d_1934 d_1935 d_1936 d_1937 d_1938 d_1939 d_1940 d_1941
0 FOODS_1_001_CA_1_evaluation FOODS_1_001 FOODS_1 FOODS CA_1 CA 1 0 0 0 ... 2 3 1 0 0 0 1 0 0 0
1 FOODS_1_001_CA_2_evaluation FOODS_1_001 FOODS_1 FOODS CA_2 CA 1 6 0 0 ... 1 0 0 1 1 0 0 1 2 0
2 FOODS_1_001_CA_3_evaluation FOODS_1_001 FOODS_1 FOODS CA_3 CA 0 0 0 0 ... 1 2 2 0 0 1 0 3 2 2
3 FOODS_1_001_CA_4_evaluation FOODS_1_001 FOODS_1 FOODS CA_4 CA 0 1 0 0 ... 1 0 0 0 0 1 1 0 0 0
4 FOODS_1_001_TX_1_evaluation FOODS_1_001 FOODS_1 FOODS TX_1 TX 1 0 2 1 ... 1 1 1 1 5 0 2 2 0 2

5 rows × 146 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Columns: 146 entries, id to d_1941
dtypes: int64(140), object(6)
memory usage: 80.0+ KB
None

calendar gives us some date information. This includes special events like holidays and information about SNAP (ebt foodstamps) acceptance for each day of the training and evaluation periods.

display(df_cal.head())
display(df_cal.info())
date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
0 2016-01-02 11549 Saturday 1 1 2016 d_1800 NaN NaN NaN NaN 1 0 1
1 2016-01-03 11549 Sunday 2 1 2016 d_1801 NaN NaN NaN NaN 1 1 1
2 2016-01-04 11549 Monday 3 1 2016 d_1802 NaN NaN NaN NaN 1 0 0
3 2016-01-05 11549 Tuesday 4 1 2016 d_1803 NaN NaN NaN NaN 1 1 1
4 2016-01-06 11549 Wednesday 5 1 2016 d_1804 NaN NaN NaN NaN 1 1 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          170 non-null    object
 1   wm_yr_wk      170 non-null    int64 
 2   weekday       170 non-null    object
 3   wday          170 non-null    int64 
 4   month         170 non-null    int64 
 5   year          170 non-null    int64 
 6   d             170 non-null    object
 7   event_name_1  18 non-null     object
 8   event_type_1  18 non-null     object
 9   event_name_2  1 non-null      object
 10  event_type_2  1 non-null      object
 11  snap_CA       170 non-null    int64 
 12  snap_TX       170 non-null    int64 
 13  snap_WI       170 non-null    int64 
dtypes: int64(7), object(7)
memory usage: 18.7+ KB
None

sell_prices gives the weekly price for each store/item combination.

display(df_prices.head())
display(df_prices.info())
store_id item_id wm_yr_wk sell_price
0 CA_1 HOBBIES_1_001 11549 8.26
1 CA_1 HOBBIES_1_001 11550 8.26
2 CA_1 HOBBIES_1_001 11551 8.26
3 CA_1 HOBBIES_1_001 11552 8.26
4 CA_1 HOBBIES_1_001 11601 8.26
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1750 entries, 0 to 1749
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   store_id    1750 non-null   object 
 1   item_id     1750 non-null   object 
 2   wm_yr_wk    1750 non-null   int64  
 3   sell_price  1750 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 54.8+ KB
None

sample_submission shows the expected format of a submission. Each item/store combination will have a row with the 28 day forecast as the columns. We submit predictions for the validation period and the test period in the same file.

display(df_ss.head())
display(df_ss.info())
id F1 F2 F3 F4 F5 F6 F7 F8 F9 ... F19 F20 F21 F22 F23 F24 F25 F26 F27 F28
0 HOBBIES_1_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 HOBBIES_2_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 HOUSEHOLD_1_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 HOUSEHOLD_2_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 FOODS_1_001_CA_1_validation 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      140 non-null    object
 1   F1      140 non-null    int64 
 2   F2      140 non-null    int64 
 3   F3      140 non-null    int64 
 4   F4      140 non-null    int64 
 5   F5      140 non-null    int64 
 6   F6      140 non-null    int64 
 7   F7      140 non-null    int64 
 8   F8      140 non-null    int64 
 9   F9      140 non-null    int64 
 10  F10     140 non-null    int64 
 11  F11     140 non-null    int64 
 12  F12     140 non-null    int64 
 13  F13     140 non-null    int64 
 14  F14     140 non-null    int64 
 15  F15     140 non-null    int64 
 16  F16     140 non-null    int64 
 17  F17     140 non-null    int64 
 18  F18     140 non-null    int64 
 19  F19     140 non-null    int64 
 20  F20     140 non-null    int64 
 21  F21     140 non-null    int64 
 22  F22     140 non-null    int64 
 23  F23     140 non-null    int64 
 24  F24     140 non-null    int64 
 25  F25     140 non-null    int64 
 26  F26     140 non-null    int64 
 27  F27     140 non-null    int64 
 28  F28     140 non-null    int64 
dtypes: int64(28), object(1)
memory usage: 31.8+ KB
None

Saving small data

To utilize continuous integration on github, and also use nbdev to test all of our code, we will create a truncated version of our files. They need to be less than 100MB to fit on github and allow fast tests.

Lets only save the last 28 * 6 days of sales data for one item from each dept_id / store_id combination.

df_len = df_stv.shape[1]
df_stv_small = df_stv.iloc[:, list(range(6)) + list(range(df_len - 28 * 5, df_len))]
original_col_order = df_stv_small.columns
df_stv_small = df_stv_small.groupby(['dept_id', 'store_id']).first().reset_index()[original_col_order]
print(f'Now there are only {df_stv_small.shape[0]} rows and {df_stv_small.shape[1]} columns')
df_stv_small.head(3)
Now there are only 70 rows and 146 columns
id item_id dept_id cat_id store_id state_id d_1802 d_1803 d_1804 d_1805 ... d_1932 d_1933 d_1934 d_1935 d_1936 d_1937 d_1938 d_1939 d_1940 d_1941
0 FOODS_1_001_CA_1_evaluation FOODS_1_001 FOODS_1 FOODS CA_1 CA 1 0 0 0 ... 2 3 1 0 0 0 1 0 0 0
1 FOODS_1_001_CA_2_evaluation FOODS_1_001 FOODS_1 FOODS CA_2 CA 1 6 0 0 ... 1 0 0 1 1 0 0 1 2 0
2 FOODS_1_001_CA_3_evaluation FOODS_1_001 FOODS_1 FOODS CA_3 CA 0 0 0 0 ... 1 2 2 0 0 1 0 3 2 2

3 rows × 146 columns

For the small calendar file, we only need data for days greater than or equal to 1802

min_wm_yr_wk = df_cal[df_cal.d == 'd_1802'].wm_yr_wk.values[0]
df_cal_small = df_cal[df_cal.wm_yr_wk >= min_wm_yr_wk]
df_cal_small.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 0 to 169
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          170 non-null    object
 1   wm_yr_wk      170 non-null    int64 
 2   weekday       170 non-null    object
 3   wday          170 non-null    int64 
 4   month         170 non-null    int64 
 5   year          170 non-null    int64 
 6   d             170 non-null    object
 7   event_name_1  18 non-null     object
 8   event_type_1  18 non-null     object
 9   event_name_2  1 non-null      object
 10  event_type_2  1 non-null      object
 11  snap_CA       170 non-null    int64 
 12  snap_TX       170 non-null    int64 
 13  snap_WI       170 non-null    int64 
dtypes: int64(7), object(7)
memory usage: 19.9+ KB

For the small prices file, we can filter using the items we have in df_stv_small, and the minimun wm_yr_wk we just made.

item_mask = df_prices.item_id.isin(set(df_stv_small.item_id.tolist()))
date_mask = df_prices.wm_yr_wk >= min_wm_yr_wk
df_prices_small = df_prices[item_mask & date_mask]
df_prices_small.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1750 entries, 0 to 1749
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   store_id    1750 non-null   object 
 1   item_id     1750 non-null   object 
 2   wm_yr_wk    1750 non-null   int64  
 3   sell_price  1750 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 68.4+ KB

The sample submission should only have the items in df_stv_small

item_list = df_stv_small.id.to_list() + df_stv_small.id.str.replace('evaluation', 'validation').to_list()
df_ss_small = df_ss[df_ss.id.isin(item_list)]
os.makedirs('small_data/raw', exist_ok=True)
df_stv_small.to_csv(os.path.join('small_data/raw', 'sales_train_evaluation.csv'), index=False)
df_cal_small.to_csv(os.path.join('small_data/raw', 'calendar.csv'), index=False)
df_prices_small.to_csv(os.path.join('small_data/raw', 'sell_prices.csv'), index=False)
df_ss_small.to_csv(os.path.join('small_data/raw', 'sample_submission.csv'), index=False)