Brief EDA

import os
import pandas as pd

# For fast testing for Continuous Integration
PATH_DATA = 'data'
PATH_DATA_RAW = 'data/raw'
os.listdir(PATH_DATA_RAW)

['sample_submission.csv',
 'sell_prices.csv',
 'calendar.csv',
 'sales_train_evaluation.csv']

# # no_test
# PATH_DATA = 'data'
# PATH_DATA_RAW = 'data/raw'
# os.listdir(PATH_DATA_RAW)

chunks = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sales_train_evaluation.csv'), chunksize=1000)
df_stv = pd.concat(list(chunks)) # Safe for low RAM situation
df_cal = pd.read_csv(os.path.join(PATH_DATA_RAW, 'calendar.csv'))
df_prices = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sell_prices.csv'))
df_ss = pd.read_csv(os.path.join(PATH_DATA_RAW, 'sample_submission.csv'))

sales_train_evaluation: We have six hierarchical categores and 1941 days of sales data for each of the 30490 items

display(df_stv.head())
display(df_stv.info())

	id	item_id	dept_id	cat_id	store_id	state_id	d_1802	d_1803	d_1804	d_1805	...	d_1932	d_1933	d_1934	d_1935	d_1936	d_1937	d_1938	d_1939	d_1940	d_1941
0	FOODS_1_001_CA_1_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_1	CA	1	0	0	0	...	2	3	1	0	0	0	1	0	0	0
1	FOODS_1_001_CA_2_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_2	CA	1	6	0	0	...	1	0	0	1	1	0	0	1	2	0
2	FOODS_1_001_CA_3_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_3	CA	0	0	0	0	...	1	2	2	0	0	1	0	3	2	2
3	FOODS_1_001_CA_4_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_4	CA	0	1	0	0	...	1	0	0	0	0	1	1	0	0	0
4	FOODS_1_001_TX_1_evaluation	FOODS_1_001	FOODS_1	FOODS	TX_1	TX	1	0	2	1	...	1	1	1	1	5	0	2	2	0	2

5 rows × 146 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Columns: 146 entries, id to d_1941
dtypes: int64(140), object(6)
memory usage: 80.0+ KB

None

calendar gives us some date information. This includes special events like holidays and information about SNAP (ebt foodstamps) acceptance for each day of the training and evaluation periods.

display(df_cal.head())
display(df_cal.info())

	date	wm_yr_wk	weekday	wday	month	year	d	event_name_1	event_type_1	event_name_2	event_type_2	snap_CA	snap_TX	snap_WI
0	2016-01-02	11549	Saturday	1	1	2016	d_1800	NaN	NaN	NaN	NaN	1	0	1
1	2016-01-03	11549	Sunday	2	1	2016	d_1801	NaN	NaN	NaN	NaN	1	1	1
2	2016-01-04	11549	Monday	3	1	2016	d_1802	NaN	NaN	NaN	NaN	1	0	0
3	2016-01-05	11549	Tuesday	4	1	2016	d_1803	NaN	NaN	NaN	NaN	1	1	1
4	2016-01-06	11549	Wednesday	5	1	2016	d_1804	NaN	NaN	NaN	NaN	1	1	1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          170 non-null    object
 1   wm_yr_wk      170 non-null    int64 
 2   weekday       170 non-null    object
 3   wday          170 non-null    int64 
 4   month         170 non-null    int64 
 5   year          170 non-null    int64 
 6   d             170 non-null    object
 7   event_name_1  18 non-null     object
 8   event_type_1  18 non-null     object
 9   event_name_2  1 non-null      object
 10  event_type_2  1 non-null      object
 11  snap_CA       170 non-null    int64 
 12  snap_TX       170 non-null    int64 
 13  snap_WI       170 non-null    int64 
dtypes: int64(7), object(7)
memory usage: 18.7+ KB

None

sell_prices gives the weekly price for each store/item combination.

display(df_prices.head())
display(df_prices.info())

	store_id	item_id	wm_yr_wk	sell_price
0	CA_1	HOBBIES_1_001	11549	8.26
1	CA_1	HOBBIES_1_001	11550	8.26
2	CA_1	HOBBIES_1_001	11551	8.26
3	CA_1	HOBBIES_1_001	11552	8.26
4	CA_1	HOBBIES_1_001	11601	8.26

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1750 entries, 0 to 1749
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   store_id    1750 non-null   object 
 1   item_id     1750 non-null   object 
 2   wm_yr_wk    1750 non-null   int64  
 3   sell_price  1750 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 54.8+ KB

None

sample_submission shows the expected format of a submission. Each item/store combination will have a row with the 28 day forecast as the columns. We submit predictions for the validation period and the test period in the same file.

display(df_ss.head())
display(df_ss.info())

	id	...
0	HOBBIES_1_001_CA_1_validation	...
1	HOBBIES_2_001_CA_1_validation	...
2	HOUSEHOLD_1_001_CA_1_validation	...
3	HOUSEHOLD_2_001_CA_1_validation	...
4	FOODS_1_001_CA_1_validation	...

5 rows × 29 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      140 non-null    object
 1   F1      140 non-null    int64 
 2   F2      140 non-null    int64 
 3   F3      140 non-null    int64 
 4   F4      140 non-null    int64 
 5   F5      140 non-null    int64 
 6   F6      140 non-null    int64 
 7   F7      140 non-null    int64 
 8   F8      140 non-null    int64 
 9   F9      140 non-null    int64 
 10  F10     140 non-null    int64 
 11  F11     140 non-null    int64 
 12  F12     140 non-null    int64 
 13  F13     140 non-null    int64 
 14  F14     140 non-null    int64 
 15  F15     140 non-null    int64 
 16  F16     140 non-null    int64 
 17  F17     140 non-null    int64 
 18  F18     140 non-null    int64 
 19  F19     140 non-null    int64 
 20  F20     140 non-null    int64 
 21  F21     140 non-null    int64 
 22  F22     140 non-null    int64 
 23  F23     140 non-null    int64 
 24  F24     140 non-null    int64 
 25  F25     140 non-null    int64 
 26  F26     140 non-null    int64 
 27  F27     140 non-null    int64 
 28  F28     140 non-null    int64 
dtypes: int64(28), object(1)
memory usage: 31.8+ KB

None

Saving small data

To utilize continuous integration on github, and also use nbdev to test all of our code, we will create a truncated version of our files. They need to be less than 100MB to fit on github and allow fast tests.

Lets only save the last 28 * 6 days of sales data for one item from each dept_id / store_id combination.

df_len = df_stv.shape[1]
df_stv_small = df_stv.iloc[:, list(range(6)) + list(range(df_len - 28 * 5, df_len))]
original_col_order = df_stv_small.columns
df_stv_small = df_stv_small.groupby(['dept_id', 'store_id']).first().reset_index()[original_col_order]
print(f'Now there are only {df_stv_small.shape[0]} rows and {df_stv_small.shape[1]} columns')
df_stv_small.head(3)

Now there are only 70 rows and 146 columns

	id	item_id	dept_id	cat_id	store_id	state_id	d_1802	d_1803	...	d_1932	d_1933	d_1934	d_1935	d_1936	d_1937	d_1938	d_1939	d_1940	d_1941
0	FOODS_1_001_CA_1_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_1	CA	1	0	...	2	3	1	0	0	0	1	0	0	0
1	FOODS_1_001_CA_2_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_2	CA	1	6	...	1	0	0	1	1	0	0	1	2	0
2	FOODS_1_001_CA_3_evaluation	FOODS_1_001	FOODS_1	FOODS	CA_3	CA	0	0	...	1	2	2	0	0	1	0	3	2	2

3 rows × 146 columns

For the small calendar file, we only need data for days greater than or equal to 1802

min_wm_yr_wk = df_cal[df_cal.d == 'd_1802'].wm_yr_wk.values[0]
df_cal_small = df_cal[df_cal.wm_yr_wk >= min_wm_yr_wk]
df_cal_small.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 0 to 169
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          170 non-null    object
 1   wm_yr_wk      170 non-null    int64 
 2   weekday       170 non-null    object
 3   wday          170 non-null    int64 
 4   month         170 non-null    int64 
 5   year          170 non-null    int64 
 6   d             170 non-null    object
 7   event_name_1  18 non-null     object
 8   event_type_1  18 non-null     object
 9   event_name_2  1 non-null      object
 10  event_type_2  1 non-null      object
 11  snap_CA       170 non-null    int64 
 12  snap_TX       170 non-null    int64 
 13  snap_WI       170 non-null    int64 
dtypes: int64(7), object(7)
memory usage: 19.9+ KB

For the small prices file, we can filter using the items we have in df_stv_small, and the minimun wm_yr_wk we just made.

item_mask = df_prices.item_id.isin(set(df_stv_small.item_id.tolist()))
date_mask = df_prices.wm_yr_wk >= min_wm_yr_wk
df_prices_small = df_prices[item_mask & date_mask]
df_prices_small.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1750 entries, 0 to 1749
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   store_id    1750 non-null   object 
 1   item_id     1750 non-null   object 
 2   wm_yr_wk    1750 non-null   int64  
 3   sell_price  1750 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 68.4+ KB

The sample submission should only have the items in df_stv_small

item_list = df_stv_small.id.to_list() + df_stv_small.id.str.replace('evaluation', 'validation').to_list()
df_ss_small = df_ss[df_ss.id.isin(item_list)]

os.makedirs('small_data/raw', exist_ok=True)
df_stv_small.to_csv(os.path.join('small_data/raw', 'sales_train_evaluation.csv'), index=False)
df_cal_small.to_csv(os.path.join('small_data/raw', 'calendar.csv'), index=False)
df_prices_small.to_csv(os.path.join('small_data/raw', 'sell_prices.csv'), index=False)
df_ss_small.to_csv(os.path.join('small_data/raw', 'sample_submission.csv'), index=False)