Sunday, November 17, 2019

011 - CBOE Data Preprocessing (Review)

11_cboe_data_processing_review

CBOE Data Preprocessing (Review)

Summary

The purpose of this brief exercise is to summarize the preprocessed data. I want to review the November data (aka the '1911' data) along the dimensions that yielded peculiar results in the May (or 1905) data.

Of primary interest is the distribution of the SPX option roots, or the set of character fields that idenitfy a particular option series. In the 1905 data, there were a lot of "odd" roots. Some roots were associated with binary options and variance strip options. Others appeared only a few times in the historical data, etc.

The point is that it is important to do a detailed, offline review of what the 1911 data contain. This script generates several files that will assist in this review.

Based on the initial results echoed below, there are some interesting initial result. Namely:

  1. Several of the float fields were updated. In the may data, there was no difference in the underlying bid/ask prices. Now those prices exhibit differences (as expected).
  2. The Volume Weighted-Average Price (VWAP) no longer contains off-the-wall numeric entries (i.e., numbers like 2e+294). So this field appears to have been cleaned-up.
  3. The Delivery Code still doesn't appear to contain any useful information.
  4. Perhaps most importantly, the set of root appears to have changed slightly. The binary and variance strip options no longer appear in the history, but there are a handful of option roots that (i) appear infrequently and (ii) appear (???) to be unrelated to the known set of SPX root. For example, the root 'DFD' appears a total of 6 times.

Below I present only a rough, initial look at the data. In a subsequent post, I will summarize the results further and focus on the new set of roots in the historical data.

In [19]:
# Import modules
import os
import sys
import pandas as pd

# Import local modules
local_module_path = '/Users/alexstephens/Development/Python/Option_Data_Analysis/modules'
sys.path.append(local_module_path)
import cboe

# =============================================================================
# Set-up
# =============================================================================

# Alter pandas display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 500)

# Location of the processed CBOE csv files
proc_dir = r'/Users/alexstephens/data/cboe/proc/1911_SPX'

# Define the cleaned, concatenated pickle file
clean_pkl = r'cboe_spx_1911_clean.pkl'
clean_pkl_file = os.path.join(proc_dir, clean_pkl)

# Read the .pkl file
df = pd.read_pickle(clean_pkl_file)
In [20]:
# =============================================================================
# Create cross-tabs for offline review
# =============================================================================
cboe.xtab_to_csv(df.quote_date,
                 df.root,
                 os.path.join(proc_dir, r'xtab_quote_date_by_root.csv'))

cboe.xtab_to_csv(df.expiration,
                 df.root,
                 os.path.join(proc_dir, r'xtab_expiration_by_root.csv'))
In [21]:
# =============================================================================
# Compute variable summaries for offline review
# =============================================================================
(flt_sum, flt_dict,
 str_sum, str_dict) = cboe.check_df_variable_range(data=df,
                                                   ctypes=cboe.col_types)
# Write the float summary to a file
flt_sum.transpose().to_csv(os.path.join(proc_dir, r'xtab_float_range.csv'))

# Note the 'check_df_variable_range' function prints a summary of the string
# field
--------------------------------------------------------------------------------
Printing summary for column underlying_symbol
count     12762972
unique           1
top           ^SPX
freq      12762972
Name: underlying_symbol, dtype: object
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Printing summary for column quote_date
count                12762972
unique                   3626
top       2018-03-26 00:00:00
freq                    13700
first     2004-01-02 00:00:00
last      2018-05-29 00:00:00
Name: quote_date, dtype: object
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Printing summary for column root
count     12762972
unique          91
top           SPXW
freq       6003969
Name: root, dtype: object
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Printing summary for column expiration
count                12762972
unique                    960
top       2017-12-15 00:00:00
freq                   206695
first     2004-01-17 00:00:00
last      2020-12-18 00:00:00
Name: expiration, dtype: object
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Printing summary for column option_type
count     12762972
unique           2
top              C
freq       6381514
Name: option_type, dtype: object
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Printing summary for column delivery_code
count     0
unique    0
Name: delivery_code, dtype: int64
--------------------------------------------------------------------------------
In [22]:
# Print the string data summary
print('-' * 80)
print(str_sum.transpose())
print('-' * 80)
--------------------------------------------------------------------------------
                      count unique
underlying_symbol  12762972      1
quote_date         12762972   3626
root               12762972     91
expiration         12762972    960
option_type        12762972      2
delivery_code             0      0
--------------------------------------------------------------------------------
In [23]:
# Print the numeric data summary (a subset of quantiles)
print('-' * 80)
print(flt_sum.loc[['count', 'min', '5%', '50%', '95%', 'max']].transpose())
print('-' * 80)
--------------------------------------------------------------------------------
                                    count           min           5%        50%          95%           max
strike                         12762972.0  5.000000e+00   700.000000  1710.0000   2690.00000  4.850000e+04
open                           12762972.0  0.000000e+00     0.000000     0.0000     52.40000  4.482300e+03
high                           12762972.0  0.000000e+00     0.000000     0.0000     54.75000  9.105000e+03
low                            12762972.0  0.000000e+00     0.000000     0.0000     50.40000  2.751000e+03
close                          12762972.0  0.000000e+00     0.000000     0.0000     52.65000  2.751000e+03
trade_volume                   12762972.0 -2.075096e+09     0.000000     0.0000    650.00000  3.025715e+09
bid_size_1545                  12762972.0 -5.100000e+01     0.000000    55.0000    486.00000  2.040400e+04
bid_1545                       12762972.0  0.000000e+00     0.000000    51.4000    861.70000  7.800000e+03
ask_size_1545                  12762972.0  0.000000e+00     1.000000    75.0000    606.00000  3.366500e+04
ask_1545                       12762972.0  0.000000e+00     0.100000    53.9000    869.70000  3.952600e+03
underlying_bid_1545            12762972.0  6.743500e+02  1097.860000  2057.8500   2711.68000  2.868850e+03
underlying_ask_1545            12762972.0  6.749600e+02  1098.260000  2058.2500   2712.24000  2.869430e+03
implied_underlying_price_1545  12762972.0  0.000000e+00  1077.130500  2045.1187   2700.96280  2.869604e+03
active_underlying_price_1545   12762972.0  6.752100e+02  1098.020000  2058.2500   2711.95000  2.869190e+03
implied_volatility_1545        12762972.0  0.000000e+00     0.020000     0.2042      0.66360  3.998010e+01
delta_1545                     12762972.0 -1.000000e+00    -0.981800     0.0000      1.00000  1.000000e+00
gamma_1545                     12762972.0  0.000000e+00     0.000000     0.0004      0.00470  9.627000e-01
theta_1545                     12762972.0 -4.641463e+02    -0.523400    -0.1018      0.00000  0.000000e+00
vega_1545                      12762972.0  0.000000e+00     0.000000     0.5355      5.85130  1.871020e+01
rho_1545                       12762972.0 -3.177816e+04  -671.032845     0.0000    822.61746  4.381029e+03
bid_size_eod                   12762972.0 -6.400000e+01     0.000000    44.0000    450.00000  2.040400e+04
bid_eod                        12762972.0  0.000000e+00     0.000000    51.2000    861.30000  1.873089e+07
ask_size_eod                   12762972.0  0.000000e+00     1.000000    51.0000    555.00000  2.951100e+04
ask_eod                        12762972.0  0.000000e+00     0.100000    54.0000    869.90000  1.367453e+04
underlying_bid_eod             12762972.0  6.756900e+02  1096.190000  2056.2800   2682.14000  2.835400e+03
underlying_ask_eod             12762972.0  6.763600e+02  1097.240000  2060.3100   2735.12000  2.898180e+03
vwap                           12762972.0  0.000000e+00     0.000000     0.0000     52.40000  4.738666e+03
open_interest                  12762972.0  0.000000e+00     0.000000    22.0000  17187.00000  1.745682e+09
--------------------------------------------------------------------------------
In [24]:
# =============================================================================
# Review roots
# =============================================================================
# Write a cross tabulation of all roots to a file
cboe.xtab_to_csv(df.root,
                 df.option_type,
                 os.path.join(proc_dir, r'xtab_type_by_root.csv'))

# Examine the counts of each root by creating a cross tabulation and
# then isolating those with fewer than 250 observations in the history
root_xtab = pd.crosstab(df.root, 'Dummy_Column')

# Define roots that have low / high volume (or counts) in the history
low_vol_roots = root_xtab[(root_xtab.Dummy_Column < 250)]
hi_vol_roots = root_xtab[(root_xtab.Dummy_Column >= 250)]
In [25]:
# Print the root crosstab
print(root_xtab)
col_0  Dummy_Column
root               
AD                6
BJH              40
BNK               6
BQA              18
DFD               6
DFN              42
EOP              18
FSL               8
GEU              12
HBQ               6
HGC             106
IIQ               4
JXA            5422
JXB            5774
JXD            5439
JXE            1962
LZX               8
MLS               8
MYB              52
MZS              96
NXL               2
OBI               4
OSA              56
OSZ              58
QGR               6
QJV              62
QLJ               6
QNE              16
QSE           19029
QSZ           13327
QZQ           15597
RDA               4
RXA              32
RXS              96
SAQ           18101
SBO             114
SFU               2
SKQ            6167
SLQ           17436
SOY              90
SPB           33955
SPL           38850
SPQ          126108
SPT          158569
SPV           16704
SPX         4789130
SPXPM         58688
SPXQ         365596
SPXW        6003969
SPZ           79135
SQG            2756
SQP           15610
SQR               2
SVP           24212
SXB          108621
SXG           64010
SXJ            3642
SXM           66698
SXX              46
SXY          130642
SXZ           98682
SYF            6582
SYG           63076
SYU           31748
SYV            4926
SYZ           26164
SZD           12248
SZJ           35537
SZP          162065
SZQ           27047
SZT           56022
SZU           12248
SZV           30172
TMQ              18
TXF              14
UMO               2
USL               6
UWB              10
VAS               6
VOP              12
WCO               2
XAV             112
XGI              52
YDS               8
YFS               4
YKH               2
YRN              12
YXD               2
YXI               6
YZP               2
ZPO               4

Note: The two functions below are new and have been subsumed into the CBOE module

In [26]:
def xtab_to_csv(var1, var2, output_file):
    """
    Purpose:
        Take two DataFrame variables and do a cross tab.  Then dump
        the results in a csv file for offline review.
    Input:
        var1: A DataFrame variable
        var2: A DataFrame varibale
    Output:
        output_file: The fully qualified name of the output file
    """
    xtab = pd.crosstab(var1, var2)
    xtab.to_csv(output_file)


def check_df_variable_range(data, ctypes):
    """
    Purpose:
        Take a DataFrame, and compute summary statistics for each column
        based on the type of column (float, string)
    Input:
        data: A DataFrame housing historical option chain data
        ctypes: A dictionary mapping column names to data types
    Output:
        float_summary: A DataFrame summarizing quantiles
        float_dict: The raw column-level quantile data (for floats)
        string_summary: A DataFrame summarizing quantiles
        string_dict: The raw column-level quantile data (for strings)
    """
    # Define quantiles
    qtiles = [0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]

    # Loop over each column & print the summary info
    float_dict = dict()
    string_dict = dict()
    for key, value in ctypes.items():
        if (value == float):
            float_dict[key] = data[key].describe(percentiles=qtiles)
        elif (value == str):
            string_dict[key] = data[key].describe()
        else:
            print('BadKey')
            exit()

    # Create a string summary table
    key_count = 0
    for key, value in string_dict.items():
        print('-' * 80)
        print('Printing summary for column {}'.format(key))
        print(string_dict[key])
        print('-' * 80)
        if key_count == 0:
            string_summary = pd.DataFrame(string_dict[key])
            key_count += 1
        else:
            string_summary = pd.merge(string_summary,
                                      pd.DataFrame(string_dict[key]),
                                      left_index=True,
                                      right_index=True,
                                      suffixes=('', '2'))

    # Create a float summary table
    key_count = 0
    for key, value in float_dict.items():
        if key_count == 0:
            float_summary = pd.DataFrame(float_dict[key])
            key_count += 1
        else:
            float_summary = pd.merge(float_summary,
                                     pd.DataFrame(float_dict[key]),
                                     left_index=True,
                                     right_index=True,
                                     suffixes=('', '2'))

    return float_summary, float_dict, string_summary, string_dict

No comments:

Post a Comment