Monday, November 11, 2019

010 - CBOE Data Preprocessing

10_cboe_data_preprocessing

CBOE Data Preprocessing

Summary

The data preprocessing steps have been consolidated into a module called cboe. In addition, I now have two sets of data to process. One downloaded in May (housed in the 1905_SPX directory) and another, corrected version downloaded in November (housed in the 1911_SPX directory).

A cursory review of the data (i.e., a line count) shows the two sets of historical are different. In this script I will preprocess both using the newly-consolidated scripts.

1905_SPX Data Preprocessing

Nothing should change here. The only thing that has changed is how we process the files. The contents of the output should not change.

To confirm this was the case, I saved copies of my previously concatenated and preprocessed data files. After running through the steps below, I did a SHA-256 checksum test on both sets of files. They were identical, so moving the steps to functions (and tweaking the functions) doesn't appear to have modified the output.

(An aside on configuring Jupyter)

The first time I attempted to run the import cboe call below, I received a ModuleNotFoundError. I've tried to modify the JUPYTER_PATH in my config file; however, that doesn't seem to work.

In [1]:
# Import modules
import os
import pandas as pd
import cboe

# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'

# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-6e4f339c7e54> in <module>
      1 # Import modules
      2 import os
----> 3 import cboe
      4 import pandas as pd
      5 

ModuleNotFoundError: No module named 'cboe'

The following approach will work, but it is ugly.

In [6]:
# Import modules
import os
import sys
import pandas as pd

# Import local modules
local_module_path = '/Users/alexstephens/Development/Python/Option_Data_Analysis/modules'
sys.path.append(local_module_path)
import cboe

# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
In [7]:
# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)
Counted lines for 0 files of 173
Counted lines for 10 files of 173
Counted lines for 20 files of 173
Counted lines for 30 files of 173
Counted lines for 40 files of 173
Counted lines for 50 files of 173
Counted lines for 60 files of 173
Counted lines for 70 files of 173
Counted lines for 80 files of 173
Counted lines for 90 files of 173
Counted lines for 100 files of 173
Counted lines for 110 files of 173
Counted lines for 120 files of 173
Counted lines for 130 files of 173
Counted lines for 140 files of 173
Counted lines for 150 files of 173
Counted lines for 160 files of 173
Counted lines for 170 files of 173
Total lines in 173 files = 13057831
In [8]:
# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)
Concatenating 173 files ...
In [9]:
# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
                                           cnames=cboe.col_names,
                                           ctypes=cboe.col_types)
Scanned 500000 rows ...
Scanned 1000000 rows ...
Scanned 1500000 rows ...
Scanned 2000000 rows ...
Scanned 2500000 rows ...
Scanned 3000000 rows ...
Scanned 3500000 rows ...
Scanned 4000000 rows ...
Scanned 4500000 rows ...
Scanned 5000000 rows ...
Scanned 5500000 rows ...
Scanned 6000000 rows ...
Scanned 6500000 rows ...
Scanned 7000000 rows ...
Scanned 7500000 rows ...
Scanned 8000000 rows ...
Scanned 8500000 rows ...
Scanned 9000000 rows ...
Scanned 9500000 rows ...
Scanned 10000000 rows ...
Scanned 10500000 rows ...
Scanned 11000000 rows ...
Scanned 11500000 rows ...
Scanned 12000000 rows ...
Scanned 12500000 rows ...
Scanned 13000000 rows ...
--------------------------------------------------------------------------------
Number of unique affected rows =  777
Unique offending columns =  {'rho_1545', 'vega_1545', 'gamma_1545', 'delta_1545', 'theta_1545', 'vwap'}
Unique offending field entries =  {'1.#QNAN0', '-1.#J', '-1.#R', '-1.#IND00', '1.#QNB'}
--------------------------------------------------------------------------------
In [12]:
# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
    print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
    num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
                                        output_file=clean_file,
                                        bad_rows=uniq_bads['bad_rows'],
                                        bad_text=uniq_bads['bad_text'])
else:
    print('No bad rows; Skipping <clean_cboe_file>')
Cleaning 777 row
Reviewed / cleaned 0 rows ...
Reviewed / cleaned 500000 rows ...
Reviewed / cleaned 1000000 rows ...
Reviewed / cleaned 1500000 rows ...
Reviewed / cleaned 2000000 rows ...
Reviewed / cleaned 2500000 rows ...
Reviewed / cleaned 3000000 rows ...
Reviewed / cleaned 3500000 rows ...
Reviewed / cleaned 4000000 rows ...
Reviewed / cleaned 4500000 rows ...
Reviewed / cleaned 5000000 rows ...
Reviewed / cleaned 5500000 rows ...
Reviewed / cleaned 6000000 rows ...
Reviewed / cleaned 6500000 rows ...
Reviewed / cleaned 7000000 rows ...
Reviewed / cleaned 7500000 rows ...
Reviewed / cleaned 8000000 rows ...
Reviewed / cleaned 8500000 rows ...
Reviewed / cleaned 9000000 rows ...
Reviewed / cleaned 9500000 rows ...
Reviewed / cleaned 10000000 rows ...
Reviewed / cleaned 10500000 rows ...
Reviewed / cleaned 11000000 rows ...
Reviewed / cleaned 11500000 rows ...
Reviewed / cleaned 12000000 rows ...
Reviewed / cleaned 12500000 rows ...
Reviewed / cleaned 13000000 rows ...
--------------------------------------------------------------------------------
Number of modified rows 777
--------------------------------------------------------------------------------
In [13]:
# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=cboe.col_names,
        dtype=cboe.col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month.  Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)

1911_SPX Data Preprocessing

These are the new data I downloaded in November after the CBOE published some corrections to the historical data.

Some observations that you will notice below:

  1. There are fewer lines in these data (12,763,145 vs. 13,057,831)
  2. There are no instances of "bad text" appearing in a column that should be formatted as float

In the next script we will review the data in greater detail.

In [14]:
# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1911_SPX'

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1911_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1911.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1911_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1911_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
In [15]:
# =============================================================================
# Count files / lines
# =============================================================================

# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)
Counted lines for 0 files of 173
Counted lines for 10 files of 173
Counted lines for 20 files of 173
Counted lines for 30 files of 173
Counted lines for 40 files of 173
Counted lines for 50 files of 173
Counted lines for 60 files of 173
Counted lines for 70 files of 173
Counted lines for 80 files of 173
Counted lines for 90 files of 173
Counted lines for 100 files of 173
Counted lines for 110 files of 173
Counted lines for 120 files of 173
Counted lines for 130 files of 173
Counted lines for 140 files of 173
Counted lines for 150 files of 173
Counted lines for 160 files of 173
Counted lines for 170 files of 173
Total lines in 173 files = 12763145
In [16]:
# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)
Concatenating 173 files ...
In [17]:
# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
                                           cnames=cboe.col_names,
                                           ctypes=cboe.col_types)
Scanned 500000 rows ...
Scanned 1000000 rows ...
Scanned 1500000 rows ...
Scanned 2000000 rows ...
Scanned 2500000 rows ...
Scanned 3000000 rows ...
Scanned 3500000 rows ...
Scanned 4000000 rows ...
Scanned 4500000 rows ...
Scanned 5000000 rows ...
Scanned 5500000 rows ...
Scanned 6000000 rows ...
Scanned 6500000 rows ...
Scanned 7000000 rows ...
Scanned 7500000 rows ...
Scanned 8000000 rows ...
Scanned 8500000 rows ...
Scanned 9000000 rows ...
Scanned 9500000 rows ...
Scanned 10000000 rows ...
Scanned 10500000 rows ...
Scanned 11000000 rows ...
Scanned 11500000 rows ...
Scanned 12000000 rows ...
Scanned 12500000 rows ...
--------------------------------------------------------------------------------
Number of unique affected rows =  0
Unique offending columns =  set()
Unique offending field entries =  set()
--------------------------------------------------------------------------------
In [18]:
# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
    print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
    num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
                                        output_file=clean_file,
                                        bad_rows=uniq_bads['bad_rows'],
                                        bad_text=uniq_bads['bad_text'])
else:
    print('No bad rows; Skipping <clean_cboe_file>')
No bad rows; Skipping <clean_cboe_file>
In [19]:
# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=cboe.col_names,
        dtype=cboe.col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month.  Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)
In [22]:
# Check the first couple of lines
df.head(4).transpose()
Out[22]:
92 93 94 95
underlying_symbol ^SPX ^SPX ^SPX ^SPX
quote_date 2004-01-02 00:00:00 2004-01-02 00:00:00 2004-01-02 00:00:00 2004-01-02 00:00:00
root SYG SYG SYG SYG
expiration 2004-01-17 00:00:00 2004-01-17 00:00:00 2004-01-17 00:00:00 2004-01-17 00:00:00
strike 600 600 625 625
option_type C P C P
open 0 0 0 0
high 0 0 0 0
low 0 0 0 0
close 0 0 0 0
trade_volume 0 0 0 0
bid_size_1545 1 0 1 0
bid_1545 506.5 0 481.5 0
ask_size_1545 1 10 1 10
ask_1545 508.5 0.05 483.5 0.05
underlying_bid_1545 1107.61 1107.61 1107.61 1107.61
underlying_ask_1545 1108.06 1108.06 1108.06 1108.06
implied_underlying_price_1545 1107.25 1107.25 1107.25 1107.25
active_underlying_price_1545 1107.8 1107.8 1107.8 1107.8
implied_volatility_1545 1.0222 0.9665 0.9432 0.9056
delta_1545 0.9992 -0.0004 0.9993 -0.0005
gamma_1545 0 0 0 0
theta_1545 -0.0359 -0.0117 -0.0332 -0.0116
vega_1545 0.0059 0.0034 0.0054 0.0036
rho_1545 23.0216 -0.0195 23.9857 -0.0207
bid_size_eod 1 0 1 0
bid_eod 509 0 484 0
ask_size_eod 1 10 1 10
ask_eod 511 0.05 486 0.05
underlying_bid_eod 1107.65 1107.65 1107.65 1107.65
underlying_ask_eod 1108.41 1108.41 1108.41 1108.41
vwap 0 0 0 0
open_interest 1 131 0 100
delivery_code NaN NaN NaN NaN
In [23]:
# Check the first couple of lines
df.tail(4).transpose()
Out[23]:
12762968 12762969 12762970 12762971
underlying_symbol ^SPX ^SPX ^SPX ^SPX
quote_date 2018-05-29 00:00:00 2018-05-29 00:00:00 2018-05-29 00:00:00 2018-05-29 00:00:00
root SPX SPX SPX SPX
expiration 2020-12-18 00:00:00 2020-12-18 00:00:00 2020-12-18 00:00:00 2020-12-18 00:00:00
strike 4000 4000 4100 4100
option_type C P C P
open 0 0 0 0
high 0 0 0 0
low 0 0 0 0
close 0 0 0 0
trade_volume 0 0 0 0
bid_size_1545 100 60 28 60
bid_1545 3.5 1171.5 1.45 1264
ask_size_1545 99 11 99 11
ask_1545 4.5 1187.8 3.5 1280.2
underlying_bid_1545 2688.58 2688.58 2688.58 2688.58
underlying_ask_1545 2689.25 2689.25 2689.25 2689.25
implied_underlying_price_1545 2575.72 2575.72 2575.72 2575.72
active_underlying_price_1545 2688.89 2688.89 2688.89 2688.89
implied_volatility_1545 0.1145 0.02 0.1127 0.02
delta_1545 0.024 -1 0.0158 -1
gamma_1545 0.0001 0 0.0001 0
theta_1545 -0.018 0 -0.0124 0
vega_1545 2.3226 0 1.633 0
rho_1545 147.596 -9625.96 97.9013 -9866.61
bid_size_eod 100 60 28 60
bid_eod 3.5 1165 1.75 1257.3
ask_size_eod 99 11 99 11
ask_eod 4.6 1181.9 3.6 1274.2
underlying_bid_eod 2665.37 2665.37 2665.37 2665.37
underlying_ask_eod 2716.89 2716.89 2716.89 2716.89
vwap 0 0 0 0
open_interest 442 34 501 3
delivery_code NaN NaN NaN NaN

No comments:

Post a Comment