Reducing the size of the CBOE Data Frame¶

Summary¶

In theory, this should be the last step of the data preparation process. The goal of this script is to take the bloated Pandas data frame I've constructed, and winnow it down the data that I'm interested in using.

In reality, this will end the first attempt at this data prep process. As I've made my way through Learning Python, I want use some of the accumulated knowledge to simplify and modularize the data read, scrubbing, and reduction steps.

In addition, the CBOE recently published corrections to their historical data. So, I will apply the modified process to those new data files.

# =============================================================================
# Set-up
# =============================================================================

import os
import pandas as pd
from datetime import timedelta

# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)


# =============================================================================
# Read the augmented file
# =============================================================================

# Directory & file that houses the concatenated CBOE csv data
# (Note: The change in directory reflects the fact that I now
# have two sets of CBOE data ... so I will segregate the two
# by the date downloaded and the underlying ticker symbol)
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"
pkl_file = os.path.join(proc_dir, r"cboe_mmyy_aug_clean_df.pkl")

# Read the .pkl file
df = pd.read_pickle(pkl_file)

# Echo the columns
list(df.columns.values)

# Echo the frist two rows (transposed)
df.head(2).transpose()

# =============================================================================
# Generate date vectors / cutoffs
# =============================================================================

# Collect min/max quote dates
min_date = df['quote_date'].min()
max_date = df['quote_date'].max()

# Since we may want to compute the VIX up-to the last quote day in the
# history, pull expirations that extend out 37 days past the max_date
vix_date = max_date + timedelta(days=37)

# Retain only expiration cycles that happen before the vix_date
# Shape before: (13057658, 45)
# Shape after: (12471324, 45)
df = df[(df.expiration <= vix_date)]

# =============================================================================
# Remove & rename columns
# =============================================================================

# Drop columns
drop_columns = ['open', 'high', 'low', 'close',
                'underlying_bid_1545', 'underlying_ask_1545',
                'bid_size_eod', 'bid_eod',
                'ask_size_eod', 'ask_eod',
                'underlying_bid_eod', 'underlying_ask_eod',
                'implied_underlying_price_1545',
                'vwap', 'delivery_code',
                'ex_weeknum', 'ex_yymmdd', 'strike_str',
                ]

df.drop(drop_columns, axis=1, inplace=True)

# Rename columns
df.rename(columns={'implied_volatility_1545': 'iv_1545',
                   'active_underlying_price_1545': 'aup_1545',
                   'mid_underlying_price_1545': 'mup_1545',
                   'open_interest': 'oi'}, inplace=True)

# Echo the frist two rows (transposed)
df.head(2).transpose()

# =============================================================================
# Define a minimum set of print columns
# =============================================================================

# Print columns
print_columns = ['quote_date', 'root', 'expiration', 'ex_dayname',
                 'option_type', 'strike', 'mup_1545',
                 'bid_1545', 'ask_1545', 'trade_volume', 'oi',
                 'dte', 'iv_1545', 'delta_1545']


# =============================================================================
# Sort the option chain
# =============================================================================

# Sort by: quote_date > expiration > strike
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Reindex the data
# df.reset_index()

# Echo df
df.head(2).transpose()

# =============================================================================
# Isolate friday expirations (+/- 1 day)
# =============================================================================

# daily date vector spanning the quote_date range
quote_range = pd.date_range(start=min_date, end=vix_date, freq='d')

# Monthly date vector spanning the quote_date range (but end-of-month)
eom_range = list(pd.date_range(start=min_date, end=vix_date, freq='M'))

# Create a pandas dataframe using the vector of dates
qr_df = pd.DataFrame(quote_range, columns=['date'])

# Append the day name for readability
qr_df['day_name'] = qr_df['date'].dt.day_name()

# Append the week number
qr_df['week_num'] = qr_df['date'].apply(lambda d: (d.day-1) // 7 + 1)

# Isolate all Fridays of every month in the series (+/- 1 day)
all_fri = qr_df[(qr_df.day_name == 'Friday')].date
all_thu = all_fri + timedelta(-1)
all_sat = all_fri + timedelta(+1)

# Create a list combining all three date lists as a global 'end-of-week' filter
all_eow = sorted(list(all_thu) + list(all_fri) + list(all_sat))

# Identify the 3rd Friday of every month
third_fri = qr_df[(qr_df.day_name == 'Friday') & (qr_df.week_num == 3)].date
third_thu = third_fri + timedelta(-1)
third_sat = third_fri + timedelta(+1)

# Create a list combining all 3rd-week eow dates
third_eow = sorted(list(third_thu) + list(third_fri) + list(third_sat))

# =============================================================================
# Isolate unique expiration dates
# =============================================================================

# Create a pandas dataframe using the vector of expiration dates
ex_date = sorted(df.expiration.unique())
ex_df = pd.DataFrame(pd.to_datetime(ex_date), columns=['ex_date'])

# Create a day number field
ex_df['ex_day_num'] = ex_df['ex_date'].dt.day

# Append the day name for readability
ex_df['ex_day_name'] = ex_df['ex_date'].dt.day_name()

# Append the week number
ex_df['ex_week_num'] = ex_df['ex_date'].apply(lambda d: (d.day-1) // 7 + 1)

# =============================================================================
# [!] Make a copy of the df DataFrame -- filter on tmp_df only
# =============================================================================

# shape = (12471324, 27)
tmp_df = df.copy()
tmp_df.shape

(12471324, 27)

# =============================================================================
# Apply root filters
# =============================================================================

# Remove quarterlies
quartery_roots = ['SPXQ', 'SPXPM',
                  'SAQ', 'SKQ', 'SLQ',
                  'QSE', 'QSZ', 'QZQ',
                  'SQG', 'SQP', 'SZQ',
                  ]

# shape = (11904431, 27)
excl_01 = tmp_df[(tmp_df.root.isin(quartery_roots))]
tmp_df = tmp_df[~(tmp_df.root.isin(quartery_roots))]


# Remove binary options (B*), Variance Strip (VSTRP), Range Option (SRO)
binary_options = ['BSZ', 'BSF', 'BSK', 'SRO', 'VSTRP']

# shape = (11616379, 27)
excl_02 = tmp_df[(tmp_df.root.isin(binary_options))]
tmp_df = tmp_df[~(tmp_df.root.isin(binary_options))]

# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
# =============================================================================

# Isolate all Friday expirations (+/- 1 day)
# shape = (10015038, 28)
excl_03 = tmp_df[~tmp_df['expiration'].isin(all_eow)]
tmp_df = tmp_df[tmp_df['expiration'].isin(all_eow)]

# =============================================================================
# Exclude weeklies with a PM-settlment on the third Friday of the month
# =============================================================================

# Remove the PM expirations that occur on the third Friday of the month
weekly_roots = ['JXA', 'JXB', 'JXD', 'JXE', 'SPXW']


excl_04 = tmp_df[(tmp_df['expiration'].isin(third_eow) &
                  tmp_df['root'].isin(weekly_roots))]

# shape = (9555202, 28)
tmp_df = tmp_df[~(tmp_df['expiration'].isin(third_eow) &
                  tmp_df['root'].isin(weekly_roots))]

# =============================================================================
# Exclude weeklies with an  end-of-month settlement date on a Thursday
# =============================================================================

excl_05 = tmp_df[((tmp_df['expiration'].isin(eom_range)) &
                  (tmp_df['ex_dayname'] == 'Thursday'))]

# shape = (9313292, 28)
tmp_df = tmp_df[~((tmp_df['expiration'].isin(eom_range)) &
                  (tmp_df['ex_dayname'] == 'Thursday'))]


# =============================================================================
# Apply numeric filters
# =============================================================================

# Remove rows with all zero greeks
zero_greek = ((tmp_df.delta_1545 == 0) &
              (tmp_df.gamma_1545 == 0) &
              (tmp_df.theta_1545 == 0) &
              (tmp_df.vega_1545 == 0) &
              (tmp_df.rho_1545 == 0))

excl_06 = tmp_df[(zero_greek)]
tmp_df = tmp_df[~(zero_greek)]


# Remove rows with artificial error code replacements
error_greek = ((tmp_df.delta_1545 == -999999) |
               (tmp_df.gamma_1545 == -999999) |
               (tmp_df.theta_1545 == -999999) |
               (tmp_df.vega_1545 == -999999) |
               (tmp_df.rho_1545 == -999999))

excl_07 = tmp_df[(error_greek)]
tmp_df = tmp_df[~(error_greek)]

# First isolate the columns to group by
dd = tmp_df[['quote_date',
             'expiration',
             'option_type',
             'strike',
             'dte',
             'root']]

# Isolate the unique (non-root) columns
uniq_dt = dd.groupby(['quote_date',
                      'expiration',
                      'option_type',
                      'strike',
                      'dte']).size()

# Fill-in the group-by matrix
uniq_dt = uniq_dt.reset_index().rename(columns={0: 'cnt'})
uniq_dt = uniq_dt[uniq_dt.cnt > 1]

print(uniq_dt.head(10))

      quote_date expiration option_type  strike  dte  cnt
63373 2004-06-23 2005-06-18           C   500.0  360    2
63374 2004-06-23 2005-06-18           C   600.0  360    2
63375 2004-06-23 2005-06-18           C   700.0  360    2
63376 2004-06-23 2005-06-18           C   750.0  360    2
63377 2004-06-23 2005-06-18           C   800.0  360    2
63378 2004-06-23 2005-06-18           C   850.0  360    2
63380 2004-06-23 2005-06-18           C   900.0  360    2
63382 2004-06-23 2005-06-18           C   950.0  360    2
63384 2004-06-23 2005-06-18           C   995.0  360    2
63385 2004-06-23 2005-06-18           C  1005.0  360    2

# Display the first duplicate (get first two sets of duplicates -- 4 rows in total)
tmp_df[(tmp_df.quote_date == pd.to_datetime('2004-06-23')) &
       (tmp_df.expiration == pd.to_datetime('2005-06-18')) &
       (tmp_df.option_type == 'C')].head(4).transpose()

# =============================================================================
# Save output to a file
# =============================================================================

# Save the output
df = tmp_df.copy()
out_file = os.path.join(proc_dir, r"cboe_mmyy_std_clean_df.pkl")
df.to_pickle(out_file)

These steps bring the filesize down from ~5 Gb to just under ~2 Gb.

The above steps will serve as a sketch for a more modularized data processing script, where I combine all of the features of previous scripts.

	0	1
underlying_symbol	^SPX	^SPX
quote_date	2004-01-30 00:00:00	2004-01-30 00:00:00
root	SYG	SYG
expiration	2004-02-21 00:00:00	2004-02-21 00:00:00
strike	600	600
option_type	C	P
open	0	0
high	0	0
low	0	0
close	0	0
trade_volume	0	0
bid_size_1545	50	0
bid_1545	529.6	0
ask_size_1545	50	10
ask_1545	531.6	0.05
underlying_bid_1545	1132.04	1132.04
underlying_ask_1545	1132.04	1132.04
implied_underlying_price_1545	1129.63	1129.63
active_underlying_price_1545	1129.63	1129.63
implied_volatility_1545	0	0.8085
delta_1545	1	-0.0004
gamma_1545	0	6e-06
theta_1545	-0.039398	-0.007394
vega_1545	0	0.003898
rho_1545	34.4217	-0.027385
bid_size_eod	50	0
bid_eod	529.9	0
ask_size_eod	50	10
ask_eod	531.9	0.05
underlying_bid_eod	1131.13	1131.13
underlying_ask_eod	1131.13	1131.13
vwap	0	0
open_interest	0	40
delivery_code	0	0
ex_dd	21	21
ex_mm	02	02
ex_yy	2004	2004
ex_yymmdd	20040221	20040221
ex_dayname	Saturday	Saturday
ex_weeknum	3	3
strike_str	600	600
dt_ty_sk_key	20040221C600	20040221P600
dt_ty_key	20040221C	20040221P
dte	22	22
mid_underlying_price_1545	1132.04	1132.04

	0	1
underlying_symbol	^SPX	^SPX
quote_date	2004-01-30 00:00:00	2004-01-30 00:00:00
root	SYG	SYG
expiration	2004-02-21 00:00:00	2004-02-21 00:00:00
strike	600	600
option_type	C	P
trade_volume	0	0
bid_size_1545	50	0
bid_1545	529.6	0
ask_size_1545	50	10
ask_1545	531.6	0.05
aup_1545	1129.63	1129.63
iv_1545	0	0.8085
delta_1545	1	-0.0004
gamma_1545	0	6e-06
theta_1545	-0.039398	-0.007394
vega_1545	0	0.003898
rho_1545	34.4217	-0.027385
oi	0	40
ex_dd	21	21
ex_mm	02	02
ex_yy	2004	2004
ex_dayname	Saturday	Saturday
dt_ty_sk_key	20040221C600	20040221P600
dt_ty_key	20040221C	20040221P
dte	22	22
mup_1545	1132.04	1132.04

	10238	10239
underlying_symbol	^SPX	^SPX
quote_date	2004-01-02 00:00:00	2004-01-02 00:00:00
root	SYG	SYG
expiration	2004-01-17 00:00:00	2004-01-17 00:00:00
strike	600	600
option_type	C	P
trade_volume	0	0
bid_size_1545	1	0
bid_1545	506.5	0
ask_size_1545	1	10
ask_1545	508.5	0.05
aup_1545	1107.23	1107.23
iv_1545	1.4717	0.9688
delta_1545	0.9886	-0.0004
gamma_1545	9.4e-05	7e-06
theta_1545	-0.379398	-0.011504
vega_1545	0.064713	0.003389
rho_1545	22.4056	-0.019588
oi	1	131
ex_dd	17	17
ex_mm	01	01
ex_yy	2004	2004
ex_dayname	Saturday	Saturday
dt_ty_sk_key	20040117C600	20040117P600
dt_ty_key	20040117C	20040117P
dte	15	15
mup_1545	1107.8	1107.8

	61806	61808	61810	61812
underlying_symbol	^SPX	^SPX	^SPX	^SPX
quote_date	2004-06-23 00:00:00	2004-06-23 00:00:00	2004-06-23 00:00:00	2004-06-23 00:00:00
root	SPL	SYF	SPL	SYG
expiration	2005-06-18 00:00:00	2005-06-18 00:00:00	2005-06-18 00:00:00	2005-06-18 00:00:00
strike	500	500	600	600
option_type	C	C	C	C
trade_volume	0	0	0	0
bid_size_1545	50	50	50	50
bid_1545	636.7	635.6	545.9	538.7
ask_size_1545	50	50	50	50
ask_1545	640.7	637.6	549.9	540.7
aup_1545	1119.01	1119.01	1119.01	1119.01
iv_1545	0.5749	0.5749	0.3985	0.4484
delta_1545	0.9629	0.9629	0.9713	0.9587
gamma_1545	0.000124	0.000124	0.000145	0.000174
theta_1545	-0.106659	-0.106659	-0.082949	-0.1029
vega_1545	0.919736	0.919736	0.743699	1.00343
rho_1545	424.778	424.778	536.162	517.671
oi	0	0	0	0
ex_dd	18	18	18	18
ex_mm	06	06	06	06
ex_yy	2005	2005	2005	2005
ex_dayname	Saturday	Saturday	Saturday	Saturday
dt_ty_sk_key	20050618C500	20050618C500	20050618C600	20050618C600
dt_ty_key	20050618C	20050618C	20050618C	20050618C
dte	360	360	360	360
mup_1545	1144.2	1144.2	1144.2	1144.2

python-backtest

Saturday, November 9, 2019

008 - Reducing the CBOE data file

Reducing the size of the CBOE Data Frame¶

Summary¶

No comments:

Post a Comment