CBOE Data Preprocessing (Review)¶
Summary¶
The purpose of this brief exercise is to summarize the preprocessed data. I want to review the November data (aka the '1911' data) along the dimensions that yielded peculiar results in the May (or 1905) data.
Of primary interest is the distribution of the SPX option roots, or the set of character fields that idenitfy a particular option series. In the 1905 data, there were a lot of "odd" roots. Some roots were associated with binary options and variance strip options. Others appeared only a few times in the historical data, etc.
The point is that it is important to do a detailed, offline review of what the 1911 data contain. This script generates several files that will assist in this review.
Based on the initial results echoed below, there are some interesting initial result. Namely:
- Several of the float fields were updated. In the may data, there was no difference in the underlying bid/ask prices. Now those prices exhibit differences (as expected).
- The Volume Weighted-Average Price (VWAP) no longer contains off-the-wall numeric entries (i.e., numbers like 2e+294). So this field appears to have been cleaned-up.
- The Delivery Code still doesn't appear to contain any useful information.
- Perhaps most importantly, the set of root appears to have changed slightly. The binary and variance strip options no longer appear in the history, but there are a handful of option roots that (i) appear infrequently and (ii) appear (???) to be unrelated to the known set of SPX root. For example, the root 'DFD' appears a total of 6 times.
Below I present only a rough, initial look at the data. In a subsequent post, I will summarize the results further and focus on the new set of roots in the historical data.
# Import modules
import os
import sys
import pandas as pd
# Import local modules
local_module_path = '/Users/alexstephens/Development/Python/Option_Data_Analysis/modules'
sys.path.append(local_module_path)
import cboe
# =============================================================================
# Set-up
# =============================================================================
# Alter pandas display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 500)
# Location of the processed CBOE csv files
proc_dir = r'/Users/alexstephens/data/cboe/proc/1911_SPX'
# Define the cleaned, concatenated pickle file
clean_pkl = r'cboe_spx_1911_clean.pkl'
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
# Read the .pkl file
df = pd.read_pickle(clean_pkl_file)
# =============================================================================
# Create cross-tabs for offline review
# =============================================================================
cboe.xtab_to_csv(df.quote_date,
df.root,
os.path.join(proc_dir, r'xtab_quote_date_by_root.csv'))
cboe.xtab_to_csv(df.expiration,
df.root,
os.path.join(proc_dir, r'xtab_expiration_by_root.csv'))
# =============================================================================
# Compute variable summaries for offline review
# =============================================================================
(flt_sum, flt_dict,
str_sum, str_dict) = cboe.check_df_variable_range(data=df,
ctypes=cboe.col_types)
# Write the float summary to a file
flt_sum.transpose().to_csv(os.path.join(proc_dir, r'xtab_float_range.csv'))
# Note the 'check_df_variable_range' function prints a summary of the string
# field
# Print the string data summary
print('-' * 80)
print(str_sum.transpose())
print('-' * 80)
# Print the numeric data summary (a subset of quantiles)
print('-' * 80)
print(flt_sum.loc[['count', 'min', '5%', '50%', '95%', 'max']].transpose())
print('-' * 80)
# =============================================================================
# Review roots
# =============================================================================
# Write a cross tabulation of all roots to a file
cboe.xtab_to_csv(df.root,
df.option_type,
os.path.join(proc_dir, r'xtab_type_by_root.csv'))
# Examine the counts of each root by creating a cross tabulation and
# then isolating those with fewer than 250 observations in the history
root_xtab = pd.crosstab(df.root, 'Dummy_Column')
# Define roots that have low / high volume (or counts) in the history
low_vol_roots = root_xtab[(root_xtab.Dummy_Column < 250)]
hi_vol_roots = root_xtab[(root_xtab.Dummy_Column >= 250)]
# Print the root crosstab
print(root_xtab)
Note: The two functions below are new and have been subsumed into the CBOE module
def xtab_to_csv(var1, var2, output_file):
"""
Purpose:
Take two DataFrame variables and do a cross tab. Then dump
the results in a csv file for offline review.
Input:
var1: A DataFrame variable
var2: A DataFrame varibale
Output:
output_file: The fully qualified name of the output file
"""
xtab = pd.crosstab(var1, var2)
xtab.to_csv(output_file)
def check_df_variable_range(data, ctypes):
"""
Purpose:
Take a DataFrame, and compute summary statistics for each column
based on the type of column (float, string)
Input:
data: A DataFrame housing historical option chain data
ctypes: A dictionary mapping column names to data types
Output:
float_summary: A DataFrame summarizing quantiles
float_dict: The raw column-level quantile data (for floats)
string_summary: A DataFrame summarizing quantiles
string_dict: The raw column-level quantile data (for strings)
"""
# Define quantiles
qtiles = [0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]
# Loop over each column & print the summary info
float_dict = dict()
string_dict = dict()
for key, value in ctypes.items():
if (value == float):
float_dict[key] = data[key].describe(percentiles=qtiles)
elif (value == str):
string_dict[key] = data[key].describe()
else:
print('BadKey')
exit()
# Create a string summary table
key_count = 0
for key, value in string_dict.items():
print('-' * 80)
print('Printing summary for column {}'.format(key))
print(string_dict[key])
print('-' * 80)
if key_count == 0:
string_summary = pd.DataFrame(string_dict[key])
key_count += 1
else:
string_summary = pd.merge(string_summary,
pd.DataFrame(string_dict[key]),
left_index=True,
right_index=True,
suffixes=('', '2'))
# Create a float summary table
key_count = 0
for key, value in float_dict.items():
if key_count == 0:
float_summary = pd.DataFrame(float_dict[key])
key_count += 1
else:
float_summary = pd.merge(float_summary,
pd.DataFrame(float_dict[key]),
left_index=True,
right_index=True,
suffixes=('', '2'))
return float_summary, float_dict, string_summary, string_dict
No comments:
Post a Comment