Thursday, October 24, 2019

001 - Initial Review of the CBOE Data

01_cboe_data_review

Examining the CBOE Historical Option Chain Data

First steps

I have a working knowledge of Python, but not much experience using the various tools (e.g., Anaconda, Sypder, Jupyter Notebooks) that help with learning the language.

So, in some of the first posts I will use Jupyter to explore and review the historical option price data purchased from the CBOE.

Basically, my initial goal is to learn how to use the tools. Optimization of data extraction and cleansing will come later.

CBOE historical data

Before I get too deep into Python, it makes sense to review the data to be used for backtesting option strategies.

I purchased roughly 14 years of SPX option chain data (including the pre-computed greeks) from the CBOE Data Shop.

Specificially, the historial option price data extend from 01/02/2004 to 05/29/2018. As noted on the CBOE End-of-Day Option Quotes with Calcs page, the data ...

provides all of the fields in the end-of-day Option quotes file plus market implied volatility for each option, as well as, the greeks (Delta, Gamma, Theta, Vega and Rho). Implied volatility and Greeks are calculated off of the 1545 timestamp, since it is considered a more accurate snapshot of market liquidity than the end of day market.

The CBOE data (which are packaged as .csv files) contain option prices, greeks, and certain metadata for each available strike associated with an underlying and an expiration cycle on a given trading day. Since that is confusing, it is probably best to just list the available attributes. The data incldue the fields listed below.

  • Underlying Symbol = The ticker symbol of the underlying. Note that indicies are identified with the ^ prefix. (a string)
  • Quote Date = The date of the quotation date (a string: in yyyy-mm-dd format)
  • Root = The root is another name for the OCC option symbol. It is also synonymous with the option class symbol. (a string)
  • Expiration = The option expiration date (a string: in yyyy-mm-dd format)
  • Strike = The option strike price (an integer)
  • Option Type = The option type (a string: "c" = call)
  • Open = The opening price (a float)
  • High = The high price of the day (a float)
  • Low = The low price of the day (a float)
  • Close = The closing price (a float)
  • Trade Volume = The number of contracts traded that day (an integer)
  • Bid Size 1545 = The number of bids at 3:45pm (an integer)
  • Bid 1545 = The bid amount at 3:45pm (a float)
  • Ask Size 1545 = The number of asks at 3:45pm (an integer)
  • Ask 1545 = The ask amount at 3:45pm (a float)
  • Underlying Bid 1545 = The underlying (i.e., SPX) bid amount at 3:45pm (a float)
  • Underlying Ask 1545 = The underlying (i.e., SPX) ask amount at 3:45pm (a float)
  • Implied Underlying Price 1545 = The implied price is defined as the Forward Index minus the cost of carry. The forward index is defined as "call - put + strike" where call/put is the midpoint of the bid/ask. (a float)
  • Active Underlying Price 1545 = The active_underlying is the price used as the stock price input (it could be the implied price, or the last trade price or the midpoint of the bid/ask). (a float)
  • Implied Volatility 1545 = The option implied volatility at 3:45pm (a float)
  • Delta 1545 = The option delta at 3:45pm (a float)
  • Gamma 1545 = The option gamma at 3:45pm (a float)
  • Theta 1545 = The option theta at 3:45pm (a float)
  • Vega 1545 = The option vega at 3:45pm (a float)
  • Rho 1545 = The option rho at 3:45pm (a float)
  • Bid Size Eod = The number of bids at end-of-day (EOD) (an integer)
  • Bid Eod = The bid amount at EOD_ (a float)
  • Ask Size Eod = _The number of asks at end-of-day (EOD) (an integer)
  • Ask Eod = The ask amount at EOD (a float)
  • Underlying Bid Eod = The underlying (i.e., SPX) bid amount at EOD (a float)
  • Underlying Ask Eod = The underlying (i.e., SPX) ask amount at EOD (a float)
  • VWAP = The volume-weighted average price (a float)
  • Open Interest = The option open interest (an integer)
  • Delivery Code = Unknown

Raw Data Files

I requested delivery of the data in monthly snapshots. The data extend from the beginning of 2004 until the end of May 2018, so I'd expect to see ((12 x 14) + 5) files. However, I have no idea how many rows each file contains.

The file sizes increase in size over time. This is likely due to the advent of the weekly expiration cycles as well as other, non-standard expiries.

The script below simple takes all of the downloaded CBOE data files (in .csv format) and counts the total number of lines.

The primary innovation in this script, for me, is the use of modules that help with navigating the file system (import os) and scanning filenames (import fnmatch).

In [2]:
# Import modules
import os
import fnmatch

# Directory that houses the raw CBOE csv data
csv_dir = r"/Users/alexstephens/data/cboe/csv"

# Create a sorted list of the raw *.csv files to process
file_list = os.listdir(csv_dir)
file_suffix = '*.csv'

# For each file, join the full path to the filename
csv_list = []
for file in file_list:
    if fnmatch.fnmatch(file, file_suffix):
        csv_list.append(os.path.join(csv_dir, file))
csv_list.sort()

# Define counters
file_count = 0
line_count = 0

# Count the number of lines in each file
for csv_file in csv_list:
    file_count += 1
    with open(csv_file) as infile:
        for line in infile:
            line_count += 1

# Echo the results
print('-' * 80)
print('Number of files {}'.format(file_count))
print('Number of lines {}'.format(line_count))
print('-' * 80)
print('\n')
--------------------------------------------------------------------------------
Number of files 173
Number of lines 13057831
--------------------------------------------------------------------------------


To me, 13 million lines seems like a lot of data, but in reality is not that big. Since I think I can process a single file, I will concatenate all 173 the files into one large-ish .csv file.

Navigating a single file should allow for a quick review and inital cleansing of the data. Cleansing is an important step, because I know that some fields contain special values, but (again) I don't know how many of these error codes exist in the data.

In [3]:
# Print the head of the file list
csv_list[:5]
Out[3]:
['/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2004-01.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2004-02.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2004-03.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2004-04.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2004-05.csv']
In [4]:
# Print the tail the file list
csv_list[-5:]
Out[4]:
['/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2018-01.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2018-02.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2018-03.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2018-04.csv',
 '/Users/alexstephens/data/cboe/csv/UnderlyingOptionsEODCalcs_2018-05.csv']

Of course, each of the 173 monthly files contains a single header row. We don't want to retain each header row.

Below I use simple (but maybe not efficient) logic to skip the superfluous headers when concatenating the files.

In [5]:
# Set a debugging flag
do_debug = False

# Directory that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"

# Enable the option to concatenate a small sample of 10 files
# using the do_debug flag; otherwise concatenate all files
if (do_debug):
    csv_list = csv_list[0:9]
    proc_csv = r"cboe_mmyy_smp.csv"
else:
    proc_csv = r"cboe_mmyy_all.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Concatenate the individual csv files into a single file. 
# Retain the header from the first file.  For the remaining 
# files, do not output the header
first_file = True
with open(proc_file, 'w') as outfile:
    # Loop over all the .csv files
    for csv_file in csv_list:
        if (first_file):
            first_file = not(first_file)
            with open(csv_file) as infile:
                for line in infile:
                    outfile.write(line)
        else:
            with open(csv_file, 'r') as infile:
                first_line = True
                for line in infile:
                    if (first_line):
                        first_line = not(first_line)
                    else:
                        outfile.write(line)
                        
# Compute the total number of individual csv data rows (+ 1 header)
csv_file_rows = []
for csv_file in csv_list:
    csv_file_rows.append(sum(1 for line in open(csv_file, 'r')))

# Compute total rows (excluding header row for the N-1 csv files)
csv_file_total_rows = sum(csv_file_rows) - (len(csv_file_rows) - 1)

# Compute the total number of rows in the concatenated file
concat_file_total_rows = sum(1 for line in open(proc_file, 'r'))

# Validate row counts
print('-' * 80)
print('Individual File Row Count = {}'.format(csv_file_total_rows))
print('Concatenated File Row Count = {}'.format(concat_file_total_rows))
print('-' * 80)
print('\n')
--------------------------------------------------------------------------------
Individual File Row Count = 13057659
Concatenated File Row Count = 13057659
--------------------------------------------------------------------------------


The size of the output .csv file is ~ 3Gb.