Friday, October 25, 2019

002 - CBOE Data Type Check Review

02_cboe_type_check

Checking the CBOE Historical Option Chain Data

Overview

In the last post, I concatenated all of the CBOE historical data files into a single .csv file.

I know from previous attempts to read the files that there are some noisy entries that I need to scrub from the data. I don't recall what form the noise takes, so in this notebook I will systematically review each (non-string) element.

I will work under the assumption that the field types listed in the previous post apply to all entries in a given field.

So the strategy for this script will be to scan each (non-string) field and confirm that it is either an integer or a float.

In [1]:
# Import modules
import os

# Directory & file that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"
proc_csv = r"cboe_mmyy_all.csv"
proc_file = os.path.join(proc_dir, proc_csv)

Ultimately, the objects listed below will be embedded in a module. For now, I explicitly define the column names and column data types. I define the column names as a list, and the column data types as a dictionary.

Update: I've changed the column names from a list to a tuple to avoid the chance that the list is somehow modified when getting passed to a function.

In [2]:
# CBOE column names
col_names = (
         'underlying_symbol',
         'quote_date',
         'root',
         'expiration',
         'strike',
         'option_type',
         'open',
         'high',
         'low',
         'close',
         'trade_volume',
         'bid_size_1545',
         'bid_1545',
         'ask_size_1545',
         'ask_1545',
         'underlying_bid_1545',
         'underlying_ask_1545',
         'implied_underlying_price_1545',
         'active_underlying_price_1545',
         'implied_volatility_1545',
         'delta_1545',
         'gamma_1545',
         'theta_1545',
         'vega_1545',
         'rho_1545',
         'bid_size_eod',
         'bid_eod',
         'ask_size_eod',
         'ask_eod',
         'underlying_bid_eod',
         'underlying_ask_eod',
         'vwap',
         'open_interest',
         'delivery_code',
)

# Assumed CBOE column types
col_types = {
         'underlying_symbol': str,                      # String
         'quote_date': str,                             # Date
         'root': str,                                   # String
         'expiration': str,                             # Date
         'strike': float,                               # Float
         'option_type': str,                            # String
         'open': float,                                 # Float
         'high': float,                                 # Float
         'low': float,                                  # Float
         'close': float,                                # Float
         'trade_volume': float,                         # Float
         'bid_size_1545': float,                        # Float
         'bid_1545': float,                             # Float
         'ask_size_1545': float,                        # Float
         'ask_1545': float,                             # Float
         'underlying_bid_1545': float,                  # Float
         'underlying_ask_1545': float,                  # Float
         'implied_underlying_price_1545': float,        # Float
         'active_underlying_price_1545': float,         # Float
         'implied_volatility_1545': float,              # Float
         'delta_1545': float,                           # Float
         'gamma_1545': float,                           # Float
         'theta_1545': float,                           # Float
         'vega_1545': float,                            # Float
         'rho_1545': float,                             # Float
         'bid_size_eod': float,                         # Float
         'bid_eod': float,                              # Float
         'ask_size_eod': float,                         # Float
         'ask_eod': float,                              # Float
         'underlying_bid_eod': float,                   # Float
         'underlying_ask_eod': float,                   # Float
         'vwap': float,                                 # Float
         'open_interest': float,                        # Float
         'delivery_code': str,                          # String
}

For now, I will assume that any field categorized as a string is just a string. The fields I really need to check are the float fields. Any deviation from the float type needs to be identified and the location needs to be recorded.

The function defined below attempts to do this by looking at each element and type checking the non-string fields. There has to be a better way to do this, but it seems to work.

Update: I encountered a subtlety when reviewing some of the output. A given row in the dataset can have multiple fields that contain non-standard entries (error codes). If the argument to the bad_value.append() call is a list, then it is just growing the list. So, instead of a list containing 4 objects, it will contain 4 * N objects ... where N is the number of fields with non-standard entries. As a work-around, I will pass a 4-tuple to the bad_value.append() call. This way all of the 4-tuples will be treated individually. However, since I can now have N tuples returned in a call to check_line(), I will need to compensate for this when we scan through the file. I will discuss this two cells down.

In [3]:
def check_line(s, n, cnames, ctypes):
    """
    Takes comma delimited line from a .csv file, splits it, and then tries
    to check the column type based on a passed dictionary of the form
    { <column_name> : <column_type> }

    The returned value consists of N 4-tuples embedded in a list.  Each
    4-tuple contains the location information regarding the non-standard
    column entry (i.e., the row number, column number, column name) and it
    also contains the value of the non-standard field.  The value of N can
    range from 0 (i.e., no non-standard fields) up to the total number of
    integer and float columns in the column data type dictionary.
    """

    # split the input line
    s = line.rstrip().split(sep=',')

    # number of string elements, number of column names
    ns = len(s)
    nc = len(cnames)

    # lists to capture the value and location of unexpected input
    bad_value = []

    # use an index to loop over all split elements
    if (ns == nc):
        for i in range(ns):
            tmp_s = s[i]
            tmp_typ = ctypes[cnames[i]]
            # only test columns that are not expected to house a string,
            # but if there is a type error in that column, then capture the
            # location, column name, and the offending value as a tuple
            if (tmp_typ != str):
                if (tmp_typ == float):
                    try:
                        float(tmp_s)
                    except ValueError:
                        bad_value.append((n, i, cnames[i], tmp_s))
                elif (tmp_typ == int):
                    try:
                        int(tmp_s)
                    except ValueError:
                        bad_value.append((n, i, cnames[i], tmp_s))
                else:
                    print('Warning: Column type not recognized\n')
                    exit()
    else:
        print('Warning: Unequal string length and column length\n')
        exit()

    # Default return is an empty list, else it houses info on the bad value
    return bad_value

Now step through each line in the file and execute the check_line() function.

Update: To address a change to check_files() I now call unexp_values.extend(). This will iterate over all N of the 4-tuples returned from the call to check_line. This will also ensure that we have all of the 4-tuples "lined-up" in the output list. As we will see, this will help facilitate a scan of the output.

In [4]:
# Define lists to hold scan results
unexp_values = []

# Scan the concatenated file for unexpected values in float fields (slow...)
with open(proc_file, 'r') as infile:
    first_line = True
    for line in infile:
        if (first_line):
            first_line = not(first_line)
            line_num = 1
        else:
            tmp_val = check_line(s=line,
                                 n=line_num,
                                 cnames=col_names,
                                 ctypes=col_types)
            if (tmp_val != []):
                # Note the use of extend() here.  The output of check_line()
                # can be N 4-tuples, so we use extend() to iterate over each
                # tuple and append it to the output object
                unexp_values.extend(tmp_val)
            line_num += 1

Use comprehensions to isolate the unique rows, column names, and non-standard fields observed in the file.

In [5]:
# Isolate the set of unique columns and strings that caused errors
unexp_rows = set([item[0] for item in unexp_values])
unexp_columns = set([item[2] for item in unexp_values])
unexp_elements = set([item[3] for item in unexp_values])

Echo the results.

In [6]:
# Prine a sample of the results
print('-' * 80)
print('Number of unique affected rows = ', len(unexp_rows))
print('Unique offending columns = ', unexp_columns)
print('Unique offending field entries = ', unexp_elements)
print('-' * 80)
--------------------------------------------------------------------------------
Number of unique affected rows =  777
Unique offending columns =  {'vwap', 'vega_1545', 'theta_1545', 'delta_1545', 'rho_1545', 'gamma_1545'}
Unique offending field entries =  {'-1.#R', '-1.#IND00', '1.#QNAN0', '1.#QNB', '-1.#J'}
--------------------------------------------------------------------------------

Ok, so what do we see in the results.

  1. The number of affected rows is 777. That is consistent with a simple scan of the .csv file using the word count OS command (i.e., grep # cboe_mmyy_all.csv | wc -l)
  2. 777 / 13,057,659 is a small proportion of affected rows (i.e., 99.994% of rows are clean)
  3. There are only 6 columns that contain non-standard elements. The volume-weighted average price and five of the greeks
  4. There are only 5 unique non-standard fields. This should make scanning for them relatively simple.

The next step will be able to purge these from the file.

No comments:

Post a Comment