CBOE Data Frame Processing¶
Summary¶
What follows is a long script. In it I methodically step through a 'data reduction' process. The goals of this process are (i) to cut out superflous and/or noisy data, (ii) to append information that will be useful in subsequent analyses, and (iii) to reformat the data such that both call & put information sit side-by-side on a single row (instead of on their own lines).
This is a very linear process. I basically whittle-away at the data frame until I reach a point where I think I've distilled the raw data into a useable dataset. I will need to conduct additional analyses (e.g., doing a sanity check of the greeks), but that is a more involved series of tests. This script is really about reducing the data to its purest form.
Note: Below I make reference to a set of interest rate data. I append those data to the raw CBOE file. I will discuss the steps I took to 'construct' that file in a later post.
# Import modules
import os
import pandas as pd
from datetime import timedelta
# =============================================================================
# Set-up
# =============================================================================
# Alter pandas display settings
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
# Location of the processed CBOE csv files
proc_dir = r'/Users/alexstephens/data/cboe/proc/1911_SPX'
# Define the cleaned, concatenated pickle file
clean_pkl = r'cboe_spx_1911_clean.pkl'
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
# Define the raw dataframe
raw_pkl = r'cboe_spx_1911_clean_df_raw.pkl'
raw_pkl_file = os.path.join(proc_dir, raw_pkl)
# Define the reduced dataframe
red_pkl = r'cboe_spx_1911_clean_df_red.pkl'
red_pkl_file = os.path.join(proc_dir, red_pkl)
# Define the merged dataframe
mrg_pkl = r'cboe_spx_1911_clean_df_mrg.pkl'
mrg_pkl_file = os.path.join(proc_dir, mrg_pkl)
# Location of the interest rate data
fred_dir = r'/Users/alexstephens/data/fred/IRTCM_csv_2/data'
# Define the interest rate dataframe
ir_pkl = r'fred_ir_01M_03M_01Y.pkl'
ir_pkl_file = os.path.join(fred_dir, ir_pkl)
Pull in the raw CBOE data and the interest rate data we'll merge onto the option data.
# =============================================================================
# Main
# =============================================================================
# Read the .pkl file
df = pd.read_pickle(clean_pkl_file)
# (12762972, 34)
df.shape
# Read the interest rate .pkl file
df_ir = pd.read_pickle(ir_pkl_file)
# (12762972, 34)
df_ir.shape
# =============================================================================
# Remove the end-of-day and delivery code columns
# =============================================================================
# Define columns to drop
drop_columns = ['bid_size_eod', 'bid_eod',
'ask_size_eod', 'ask_eod',
'underlying_bid_eod', 'underlying_ask_eod',
'delivery_code',
]
df.drop(drop_columns, axis=1, inplace=True)
# Echo the first four rows (transposed)
df.head(4).transpose()
# =============================================================================
# Add / Modify Columns
# =============================================================================
# Apply upper-case to the option_type field
df['option_type'] = df['option_type'].str.upper()
# Add the day name to ease visualization of the expiration day
df['ex_dayname'] = df['expiration'].dt.day_name()
# Compute the number of days-to-expiration (dte); Use the so-called
# dt() accessor for series to extract the numerical days to expiry
df['dte'] = (df['expiration'] - df['quote_date']).dt.days
# Compute the underlying mid-price (mup)
df['mup_1545'] = 0.5*(df['underlying_bid_1545'] + df['underlying_ask_1545'])
# Echo the first four rows (transposed)
df.head(4).transpose()
# =============================================================================
# Generate date vectors / cutoffs
# =============================================================================
# Collect min/max quote dates
min_date = df['quote_date'].min()
max_date = df['quote_date'].max()
# Since we may want to compute the VIX up-to the last quote day in the
# history, pull expirations that extend out 37 days past the max_date
vix_date = max_date + timedelta(days=38)
# =============================================================================
# Sort the option chain
# =============================================================================
# Sort by: quote_date > expiration > strike
df.sort_values(['quote_date', 'expiration', 'strike'],
ascending=['True', 'True', 'True'],
inplace=True)
This step is important because it isolate options that expire on Fridays. This should be simple, but the early history contains options that have a Saturday expiration. In addition, certain holidays land on a Friday. In those cases, the expiration date is moved to the preceeding Thursday. So the step below uses the timedelta function to identify the Thursday and Saturday surrounding each expiration Friday.
# =============================================================================
# Isolate friday expirations (+/- 1 day)
# =============================================================================
# daily date vector spanning the quote_date range
quote_range = pd.date_range(start=min_date, end=vix_date, freq='d')
# Monthly date vector spanning the quote_date range (but end-of-month)
eom_range = list(pd.date_range(start=min_date, end=vix_date, freq='M'))
# Create a pandas dataframe using the vector of dates
qr_df = pd.DataFrame(quote_range, columns=['date'])
# Append the day name for readability
qr_df['day_name'] = qr_df['date'].dt.day_name()
# Append the week number
qr_df['week_num'] = qr_df['date'].apply(lambda d: (d.day-1) // 7 + 1)
# Isolate all Fridays of every month in the series (+/- 1 day)
all_fri = qr_df[(qr_df.day_name == 'Friday')].date
all_thu = all_fri + timedelta(-1)
all_sat = all_fri + timedelta(+1)
# Create a list combining all three date lists as a global 'end-of-week' filter
all_eow = sorted(list(all_thu) + list(all_fri) + list(all_sat))
# Identify the 3rd Friday of every month
third_fri = qr_df[(qr_df.day_name == 'Friday') & (qr_df.week_num == 3)].date
third_thu = third_fri + timedelta(-1)
third_sat = third_fri + timedelta(+1)
# Create a list combining all 3rd-week eow dates
third_eow = sorted(list(third_thu) + list(third_fri) + list(third_sat))
# =============================================================================
# Isolate unique expiration dates
# =============================================================================
# Create a pandas dataframe using the vector of expiration dates
ex_date = sorted(df.expiration.unique())
ex_df = pd.DataFrame(pd.to_datetime(ex_date), columns=['ex_date'])
# Create a day number field
ex_df['ex_day_num'] = ex_df['ex_date'].dt.day
# Append the day name for readability
ex_df['ex_day_name'] = ex_df['ex_date'].dt.day_name()
# Append the week number
ex_df['ex_week_num'] = ex_df['ex_date'].apply(lambda d: (d.day-1) // 7 + 1)
Just renaming some columns to save on screen space.
# =============================================================================
# Define filters below - use a copy of the original data frame
# =============================================================================
# Make a copy of the data frame; shape = (12762972, 30)
df.shape
df_red = df.copy()
df_red.shape
# Rename columns
df_red.rename(columns={'underlying_symbol': 'u',
'trade_volume': 'tvol',
'underlying_bid_1545': 'u_bid_1545',
'underlying_ask_1545': 'u_ask_1545',
'implied_underlying_price_1545': 'iup_1545',
'active_underlying_price_1545': 'aup_1545',
'implied_volatility_1545': 'iv_1545',
'open_interest': 'oi',
}, inplace=True)
Here is where I start to chip-away at the raw data frame. In each step I split the dataframe into (i) the slimmed-down data set and (ii) the data I exclude. I wanted to be able to review those data if something looked peculiar. In each step I add the resulting shape as an inline comment to be able to track the repeatability of these steps.
# =============================================================================
# Apply date-based filters
# =============================================================================
# Retain only expiration cycles that happen before the vix_date
excl_01 = df_red[(df_red.expiration > vix_date)]
df_red = df_red[(df_red.expiration <= vix_date)] # (12181994, 30)
# =============================================================================
# Apply non-standard-root-based filters
# =============================================================================
# Remove low volume roots; Those with fewer that 250 observations
# across the historical window. Adding manually because I want to
# see what is in the data. (Note: These did not appear in the 1905 data)
low_volume_roots = ['AD', 'BJH', 'BNK', 'BQA', 'DFD', 'DFN', 'EOP', 'FSL',
'GEU', 'HBQ', 'HGC', 'IIQ', 'LZX', 'MLS', 'MYB', 'MZS',
'NXL', 'OBI', 'OSA', 'OSZ', 'QGR', 'QJV', 'QLJ', 'QNE',
'RDA', 'RXA', 'RXS', 'SBO', 'SFU', 'SOY', 'SQR', 'SXX',
'TMQ', 'TXF', 'UMO', 'USL', 'UWB', 'VAS', 'VOP', 'WCO',
'XAV', 'XGI', 'YDS', 'YFS', 'YKH', 'YRN', 'YXD', 'YXI',
'YZP', 'ZPO', ]
excl_02 = df_red[(df_red.root.isin(low_volume_roots))]
df_red = df_red[~(df_red.root.isin(low_volume_roots))] # (12180688, 30)
# Remove quarterly expirations
quarterly_roots = ['SPXQ', 'SPXPM', 'SAQ', 'SKQ', 'SLQ', 'QSE', 'QSZ', 'QZQ',
'SQG', 'SQP', 'SZQ', ]
excl_03 = df_red[(df_red.root.isin(quarterly_roots))]
df_red = df_red[~(df_red.root.isin(quarterly_roots))] # (11621334, 30)
# Remove binary options (B*), Variance Strip (VSTRP), Range Option (SRO)
# (Note: These no longer appear in the 1911 data)
binary_options = ['BSZ', 'BSF', 'BSK', 'SRO', 'VSTRP']
excl_04 = df_red[(df_red.root.isin(binary_options))]
df_red = df_red[~(df_red.root.isin(binary_options))] # (11621334, 30)
# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
excl_05 = df_red[~df_red['expiration'].isin(all_eow)]
df_red = df_red[df_red['expiration'].isin(all_eow)] # (10019993, 30)
# =============================================================================
# Exclude weeklies with a PM-settlment on the third Friday of the month
# =============================================================================
# Remove the PM expirations that occur on the third Friday of the month
weekly_roots = ['JXA', 'JXB', 'JXD', 'JXE', 'SPXW']
excl_06 = df_red[(df_red['expiration'].isin(third_eow) &
df_red['root'].isin(weekly_roots))]
df_red = df_red[~(df_red['expiration'].isin(third_eow) &
df_red['root'].isin(weekly_roots))] # (9560129, 30)
# =============================================================================
# Exclude weeklies with an end-of-month settlement date on a Thursday
# =============================================================================
# Remove the weeklies that land on EOM & Thursday
excl_07 = df_red[((df_red['expiration'].isin(eom_range)) &
(df_red['ex_dayname'] == 'Thursday'))]
df_red = df_red[~((df_red['expiration'].isin(eom_range)) &
(df_red['ex_dayname'] == 'Thursday'))] # (9318255, 30)
# =============================================================================
# Remove options where the number of days-to-expiry are > 360
# =============================================================================
# Remove options with an expiry > 1 year (~ 360 days)
excl_08 = df_red[(df_red['dte'] > 360)]
df_red = df_red[(df_red['dte'] <= 360)] # (8291264, 30)
This step is a little different. I use the groupby() method to tally the number of rows with a given combination of ('quote_date', 'expiration', 'option_type', 'strike', and 'days-to-expiration' (or dte). In theory, we should have one row for each combination. However, there are a couple of collisions in the data. It appears that multiple roots will appear for the combo listed above. This will require some additional cleansing to de-duplicate these rows.
# =============================================================================
# Review of duplicates (quote_date, expiration, option_type, strike)
# =============================================================================
# First isolate the columns to group by
dd = df_red[['quote_date', 'expiration', 'option_type',
'strike', 'dte', 'root']]
# Isolate the unique (non-root) columns
uniq_dt = dd.groupby(['quote_date', 'expiration', 'option_type',
'strike', 'dte']).size()
# Fill-in the group-by matrix
uniq_dt = uniq_dt.reset_index().rename(columns={0: 'cnt'})
uniq_dt = uniq_dt[uniq_dt.cnt > 1]
# Save results to an offline file
uniq_csv = r'xtab_duplicate_rows.csv'
uniq_csv_file = os.path.join(proc_dir, uniq_csv)
uniq_dt.to_csv(uniq_csv_file) # 1694 rows
# Print the head() of the options exhibiting multiple rows
uniq_dt.head(10)
# Print the tail() of the options exhibiting multiple rows
uniq_dt.tail(10)
In this step I exploit the fact that (apart from the small number of examples shown above) each combination of 'quote_date', 'expiration', 'strike', and 'root' should have two entries in the data. One is a call option associated with that combo of attributes, and the other is a put option.
To be able to identify each call and put, I'm constructing a join/merge key using those fields. This may not be the most efficient method, but I'm basically creating a string that is the concatenation of each rows 'quote_date', 'expiration', 'strike', and 'root'. I will then use this key to merge call and put data onto a single row in the history (along with a set of common columns that are not option-specific).
# =============================================================================
# reshape the reduced dataframe into one where the call and put data for a
# give (quote_date, expiration, root, strike are on the same row)
# =============================================================================
# =============================================================================
# Build join key(s)
# =============================================================================
# quote_date string
qd_dd = [str(day) if (day > 9) else ('0' + str(day))
for day in df_red['quote_date'].dt.day]
qd_mm = [str(month) if (month > 9) else ('0' + str(month))
for month in df_red['quote_date'].dt.month]
qd_yy = df_red['quote_date'].dt.year.map(str)
# key string
qd_str = (qd_yy + qd_mm + qd_dd)
# expiration string
ex_dd = [str(day) if (day > 9) else ('0' + str(day))
for day in df_red['expiration'].dt.day]
ex_mm = [str(month) if (month > 9) else ('0' + str(month))
for month in df_red['expiration'].dt.month]
ex_yy = df_red['expiration'].dt.year.map(str)
# srike * 100 string
strike_str = (df_red.strike*100)
strike_str = strike_str.astype(int).astype(str)
# root string
root_str = df_red.root.astype(str)
# key string
key_str = (qd_yy + qd_mm + qd_dd + "_" + ex_yy + ex_mm + ex_dd + "_" +
strike_str + "_" + root_str)
# Append the key(s) to the data frame via an assign()
df_red = df_red.assign(key_str=list(key_str))
df_red = df_red.assign(qd_str=list(qd_str))
In theory each combo of 'quote_date', 'expiration', 'strike', and 'root' should have 2 entries in the data. However, I need to check for cases where a key appears only once. I'll prune those 'stragglers' from the data.
# =============================================================================
# Check for straggler rows
# =============================================================================
# Check that for each key we should have a call and put option chain
key_check = df_red.groupby(['key_str']).size()
key_check.reset_index().rename(columns={1: 'Freq'})
# Isolate the keys with single entries
single_keys = list(key_check[key_check < 2].index) # 156
# =============================================================================
# Exclude straggler rows
# =============================================================================
excl_09 = df_red[(df_red.key_str.isin(single_keys))]
df_red = df_red[~(df_red.key_str.isin(single_keys))] # (8291108, 32)
df_red.reset_index(inplace=True)
# Print some examples of keys that appear only once. Note that I multiplied the strike by 100 when creating the key
single_keys[:10]
Here is where I join historical interest rate data onto the CBOE file. The simpleminded bucketing scheme described in the comments is my approach. I've appended these data to enable a later VIX-style calculation for each option chain in the history. I will describe the sourcing of these data in a later post.
# =============================================================================
# Merge interest rates onto file
# =============================================================================
# Assign an interest rate to each row using a simple bucketing scheme. If the
# days-to-expiration (dte) are [0, 60] then use the 1-month rate. If the dte
# are (60, 120] then use the 3-month rate. If the dte are (120, 360] then use
# the 1-year rate.
# Merge the interest rate data
df_red = df_red.merge(df_ir[['qd_str', 'R01M_cln', 'R03M_cln', 'R01Y_cln']],
how='left', suffixes=('', ''),
left_on='qd_str', right_on='qd_str')
# initialize the column
df_red = df_red.assign(ir=0.00)
# assign the interest rate
df_red.loc[(df_red.dte <= 60), 'ir'] = df_red['R01M_cln']
df_red.loc[(df_red.dte > 60) & (df_red.dte <= 120), 'ir'] = df_red['R03M_cln']
df_red.loc[(df_red.dte > 120) & (df_red.dte <= 360), 'ir'] = df_red['R01Y_cln']
Here I split the data into calls and puts.
# =============================================================================
# Isolate call-only and put-only data frames
# =============================================================================
# Note: After the filer, the number of rows is equal
df_c = df_red[(df_red.option_type == 'C')].copy() # (4145554, 32)
df_p = df_red[(df_red.option_type == 'P')].copy() # (4145554, 32)
# Sort and reset the index of each call & put data frame
# Sort by: quote_date > expiration > strike > root
df_c.sort_values(['quote_date', 'expiration', 'strike', 'root'],
ascending=['True', 'True', 'True', 'True'],
inplace=True)
df_c.reset_index(inplace=True)
df_p.sort_values(['quote_date', 'expiration', 'strike', 'root'],
ascending=['True', 'True', 'True', 'True'],
inplace=True)
df_p.reset_index(inplace=True)
I now define common columns (those that are not option-specific) and option-specific columns. I'll join the common columns, the call data, and the put data using the key described above. This should result in a data frame where each row contains both call and put prices for a given combination of 'quote_date', 'expiration', 'strike', and 'root'.
# =============================================================================
# Merge the call and put data onto single rows
# - Define common columns & option-specific columns (+ the key)
# - Isolate call- and put-only common columns and option-specific columns
# - Merge the option-specific data first (with '_c' and '_p" suffixes)
# - Merge the common data and the option-specific data
# =============================================================================
# Define common columns (those not specific to a given call/put option)
common_cols = ['u', 'quote_date', 'root', 'expiration', 'strike',
'u_bid_1545', 'u_ask_1545', 'mup_1545',
'iup_1545', 'aup_1545',
'ex_dayname', 'dte', 'ir', 'key_str']
# Define columns specific to a given call/put option
option_cols = ['option_type', 'open', 'high', 'low', 'close', 'tvol',
'bid_size_1545', 'bid_1545',
'ask_size_1545', 'ask_1545', 'iv_1545', 'delta_1545',
'gamma_1545', 'theta_1545', 'vega_1545', 'rho_1545', 'vwap',
'oi', 'key_str']
# Pull the call- and put-specific common and option data
df_c_common = df_c[common_cols]
df_p_common = df_p[common_cols]
df_c_option = df_c[option_cols]
df_p_option = df_p[option_cols]
# Merge the option-specific data first
df_mrg_option = df_c_option.merge(df_p_option, how='outer',
suffixes=('_c', '_p'),
left_on='key_str',
right_on='key_str')
# Then merge the option-specific data onto the common columns
df_mrg = df_c_common.merge(df_mrg_option, how='outer',
left_on='key_str', right_on='key_str')
# reindex
df_mrg.reset_index(inplace=True)
Here is what the resulting data frame looks like.
# Echo the first four rows (transposed)
df_mrg.head(4).transpose()
# Echo the last four rows (transposed)
df_mrg.tail(4).transpose()
Save the results.
# =============================================================================
# Save the various ouput files
# =============================================================================
df_raw = df.copy()
df_raw.to_pickle(raw_pkl_file)
df_red.to_pickle(red_pkl_file)
df_mrg.to_pickle(mrg_pkl_file)
No comments:
Post a Comment