Reducing the size of the CBOE Data Frame¶
Summary¶
In theory, this should be the last step of the data preparation process. The goal of this script is to take the bloated Pandas data frame I've constructed, and winnow it down the data that I'm interested in using.
In reality, this will end the first attempt at this data prep process. As I've made my way through Learning Python, I want use some of the accumulated knowledge to simplify and modularize the data read, scrubbing, and reduction steps.
In addition, the CBOE recently published corrections to their historical data. So, I will apply the modified process to those new data files.
# =============================================================================
# Set-up
# =============================================================================
import os
import pandas as pd
from datetime import timedelta
# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)
# =============================================================================
# Read the augmented file
# =============================================================================
# Directory & file that houses the concatenated CBOE csv data
# (Note: The change in directory reflects the fact that I now
# have two sets of CBOE data ... so I will segregate the two
# by the date downloaded and the underlying ticker symbol)
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"
pkl_file = os.path.join(proc_dir, r"cboe_mmyy_aug_clean_df.pkl")
# Read the .pkl file
df = pd.read_pickle(pkl_file)
# Echo the columns
list(df.columns.values)
# Echo the frist two rows (transposed)
df.head(2).transpose()
# =============================================================================
# Generate date vectors / cutoffs
# =============================================================================
# Collect min/max quote dates
min_date = df['quote_date'].min()
max_date = df['quote_date'].max()
# Since we may want to compute the VIX up-to the last quote day in the
# history, pull expirations that extend out 37 days past the max_date
vix_date = max_date + timedelta(days=37)
# Retain only expiration cycles that happen before the vix_date
# Shape before: (13057658, 45)
# Shape after: (12471324, 45)
df = df[(df.expiration <= vix_date)]
# =============================================================================
# Remove & rename columns
# =============================================================================
# Drop columns
drop_columns = ['open', 'high', 'low', 'close',
'underlying_bid_1545', 'underlying_ask_1545',
'bid_size_eod', 'bid_eod',
'ask_size_eod', 'ask_eod',
'underlying_bid_eod', 'underlying_ask_eod',
'implied_underlying_price_1545',
'vwap', 'delivery_code',
'ex_weeknum', 'ex_yymmdd', 'strike_str',
]
df.drop(drop_columns, axis=1, inplace=True)
# Rename columns
df.rename(columns={'implied_volatility_1545': 'iv_1545',
'active_underlying_price_1545': 'aup_1545',
'mid_underlying_price_1545': 'mup_1545',
'open_interest': 'oi'}, inplace=True)
# Echo the frist two rows (transposed)
df.head(2).transpose()
# =============================================================================
# Define a minimum set of print columns
# =============================================================================
# Print columns
print_columns = ['quote_date', 'root', 'expiration', 'ex_dayname',
'option_type', 'strike', 'mup_1545',
'bid_1545', 'ask_1545', 'trade_volume', 'oi',
'dte', 'iv_1545', 'delta_1545']
# =============================================================================
# Sort the option chain
# =============================================================================
# Sort by: quote_date > expiration > strike
df.sort_values(['quote_date', 'expiration', 'strike'],
ascending=['True', 'True', 'True'],
inplace=True)
# Reindex the data
# df.reset_index()
# Echo df
df.head(2).transpose()
# =============================================================================
# Isolate friday expirations (+/- 1 day)
# =============================================================================
# daily date vector spanning the quote_date range
quote_range = pd.date_range(start=min_date, end=vix_date, freq='d')
# Monthly date vector spanning the quote_date range (but end-of-month)
eom_range = list(pd.date_range(start=min_date, end=vix_date, freq='M'))
# Create a pandas dataframe using the vector of dates
qr_df = pd.DataFrame(quote_range, columns=['date'])
# Append the day name for readability
qr_df['day_name'] = qr_df['date'].dt.day_name()
# Append the week number
qr_df['week_num'] = qr_df['date'].apply(lambda d: (d.day-1) // 7 + 1)
# Isolate all Fridays of every month in the series (+/- 1 day)
all_fri = qr_df[(qr_df.day_name == 'Friday')].date
all_thu = all_fri + timedelta(-1)
all_sat = all_fri + timedelta(+1)
# Create a list combining all three date lists as a global 'end-of-week' filter
all_eow = sorted(list(all_thu) + list(all_fri) + list(all_sat))
# Identify the 3rd Friday of every month
third_fri = qr_df[(qr_df.day_name == 'Friday') & (qr_df.week_num == 3)].date
third_thu = third_fri + timedelta(-1)
third_sat = third_fri + timedelta(+1)
# Create a list combining all 3rd-week eow dates
third_eow = sorted(list(third_thu) + list(third_fri) + list(third_sat))
# =============================================================================
# Isolate unique expiration dates
# =============================================================================
# Create a pandas dataframe using the vector of expiration dates
ex_date = sorted(df.expiration.unique())
ex_df = pd.DataFrame(pd.to_datetime(ex_date), columns=['ex_date'])
# Create a day number field
ex_df['ex_day_num'] = ex_df['ex_date'].dt.day
# Append the day name for readability
ex_df['ex_day_name'] = ex_df['ex_date'].dt.day_name()
# Append the week number
ex_df['ex_week_num'] = ex_df['ex_date'].apply(lambda d: (d.day-1) // 7 + 1)
# =============================================================================
# [!] Make a copy of the df DataFrame -- filter on tmp_df only
# =============================================================================
# shape = (12471324, 27)
tmp_df = df.copy()
tmp_df.shape
# =============================================================================
# Apply root filters
# =============================================================================
# Remove quarterlies
quartery_roots = ['SPXQ', 'SPXPM',
'SAQ', 'SKQ', 'SLQ',
'QSE', 'QSZ', 'QZQ',
'SQG', 'SQP', 'SZQ',
]
# shape = (11904431, 27)
excl_01 = tmp_df[(tmp_df.root.isin(quartery_roots))]
tmp_df = tmp_df[~(tmp_df.root.isin(quartery_roots))]
# Remove binary options (B*), Variance Strip (VSTRP), Range Option (SRO)
binary_options = ['BSZ', 'BSF', 'BSK', 'SRO', 'VSTRP']
# shape = (11616379, 27)
excl_02 = tmp_df[(tmp_df.root.isin(binary_options))]
tmp_df = tmp_df[~(tmp_df.root.isin(binary_options))]
# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
# shape = (10015038, 28)
excl_03 = tmp_df[~tmp_df['expiration'].isin(all_eow)]
tmp_df = tmp_df[tmp_df['expiration'].isin(all_eow)]
# =============================================================================
# Exclude weeklies with a PM-settlment on the third Friday of the month
# =============================================================================
# Remove the PM expirations that occur on the third Friday of the month
weekly_roots = ['JXA', 'JXB', 'JXD', 'JXE', 'SPXW']
excl_04 = tmp_df[(tmp_df['expiration'].isin(third_eow) &
tmp_df['root'].isin(weekly_roots))]
# shape = (9555202, 28)
tmp_df = tmp_df[~(tmp_df['expiration'].isin(third_eow) &
tmp_df['root'].isin(weekly_roots))]
# =============================================================================
# Exclude weeklies with an end-of-month settlement date on a Thursday
# =============================================================================
excl_05 = tmp_df[((tmp_df['expiration'].isin(eom_range)) &
(tmp_df['ex_dayname'] == 'Thursday'))]
# shape = (9313292, 28)
tmp_df = tmp_df[~((tmp_df['expiration'].isin(eom_range)) &
(tmp_df['ex_dayname'] == 'Thursday'))]
# =============================================================================
# Apply numeric filters
# =============================================================================
# Remove rows with all zero greeks
zero_greek = ((tmp_df.delta_1545 == 0) &
(tmp_df.gamma_1545 == 0) &
(tmp_df.theta_1545 == 0) &
(tmp_df.vega_1545 == 0) &
(tmp_df.rho_1545 == 0))
excl_06 = tmp_df[(zero_greek)]
tmp_df = tmp_df[~(zero_greek)]
# Remove rows with artificial error code replacements
error_greek = ((tmp_df.delta_1545 == -999999) |
(tmp_df.gamma_1545 == -999999) |
(tmp_df.theta_1545 == -999999) |
(tmp_df.vega_1545 == -999999) |
(tmp_df.rho_1545 == -999999))
excl_07 = tmp_df[(error_greek)]
tmp_df = tmp_df[~(error_greek)]
# First isolate the columns to group by
dd = tmp_df[['quote_date',
'expiration',
'option_type',
'strike',
'dte',
'root']]
# Isolate the unique (non-root) columns
uniq_dt = dd.groupby(['quote_date',
'expiration',
'option_type',
'strike',
'dte']).size()
# Fill-in the group-by matrix
uniq_dt = uniq_dt.reset_index().rename(columns={0: 'cnt'})
uniq_dt = uniq_dt[uniq_dt.cnt > 1]
print(uniq_dt.head(10))
# Display the first duplicate (get first two sets of duplicates -- 4 rows in total)
tmp_df[(tmp_df.quote_date == pd.to_datetime('2004-06-23')) &
(tmp_df.expiration == pd.to_datetime('2005-06-18')) &
(tmp_df.option_type == 'C')].head(4).transpose()
# =============================================================================
# Save output to a file
# =============================================================================
# Save the output
df = tmp_df.copy()
out_file = os.path.join(proc_dir, r"cboe_mmyy_std_clean_df.pkl")
df.to_pickle(out_file)
These steps bring the filesize down from ~5 Gb to just under ~2 Gb.
The above steps will serve as a sketch for a more modularized data processing script, where I combine all of the features of previous scripts.
No comments:
Post a Comment