CBOE Data Preprocessing¶

Summary¶

The data preprocessing steps have been consolidated into a module called cboe. In addition, I now have two sets of data to process. One downloaded in May (housed in the 1905_SPX directory) and another, corrected version downloaded in November (housed in the 1911_SPX directory).

A cursory review of the data (i.e., a line count) shows the two sets of historical are different. In this script I will preprocess both using the newly-consolidated scripts.

1905_SPX Data Preprocessing¶

Nothing should change here. The only thing that has changed is how we process the files. The contents of the output should not change.

To confirm this was the case, I saved copies of my previously concatenated and preprocessed data files. After running through the steps below, I did a SHA-256 checksum test on both sets of files. They were identical, so moving the steps to functions (and tweaking the functions) doesn't appear to have modified the output.

(An aside on configuring Jupyter)¶

The first time I attempted to run the import cboe call below, I received a ModuleNotFoundError. I've tried to modify the JUPYTER_PATH in my config file; however, that doesn't seem to work.

# Import modules
import os
import pandas as pd
import cboe

# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'

# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-6e4f339c7e54> in <module>
      1 # Import modules
      2 import os
----> 3 import cboe
      4 import pandas as pd
      5 

ModuleNotFoundError: No module named 'cboe'

The following approach will work, but it is ugly.

# Import modules
import os
import sys
import pandas as pd

# Import local modules
local_module_path = '/Users/alexstephens/Development/Python/Option_Data_Analysis/modules'
sys.path.append(local_module_path)
import cboe

# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)

# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)

Counted lines for 0 files of 173
Counted lines for 10 files of 173
Counted lines for 20 files of 173
Counted lines for 30 files of 173
Counted lines for 40 files of 173
Counted lines for 50 files of 173
Counted lines for 60 files of 173
Counted lines for 70 files of 173
Counted lines for 80 files of 173
Counted lines for 90 files of 173
Counted lines for 100 files of 173
Counted lines for 110 files of 173
Counted lines for 120 files of 173
Counted lines for 130 files of 173
Counted lines for 140 files of 173
Counted lines for 150 files of 173
Counted lines for 160 files of 173
Counted lines for 170 files of 173
Total lines in 173 files = 13057831

# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)

Concatenating 173 files ...

# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
                                           cnames=cboe.col_names,
                                           ctypes=cboe.col_types)

Scanned 500000 rows ...
Scanned 1000000 rows ...
Scanned 1500000 rows ...
Scanned 2000000 rows ...
Scanned 2500000 rows ...
Scanned 3000000 rows ...
Scanned 3500000 rows ...
Scanned 4000000 rows ...
Scanned 4500000 rows ...
Scanned 5000000 rows ...
Scanned 5500000 rows ...
Scanned 6000000 rows ...
Scanned 6500000 rows ...
Scanned 7000000 rows ...
Scanned 7500000 rows ...
Scanned 8000000 rows ...
Scanned 8500000 rows ...
Scanned 9000000 rows ...
Scanned 9500000 rows ...
Scanned 10000000 rows ...
Scanned 10500000 rows ...
Scanned 11000000 rows ...
Scanned 11500000 rows ...
Scanned 12000000 rows ...
Scanned 12500000 rows ...
Scanned 13000000 rows ...
--------------------------------------------------------------------------------
Number of unique affected rows =  777
Unique offending columns =  {'rho_1545', 'vega_1545', 'gamma_1545', 'delta_1545', 'theta_1545', 'vwap'}
Unique offending field entries =  {'1.#QNAN0', '-1.#J', '-1.#R', '-1.#IND00', '1.#QNB'}
--------------------------------------------------------------------------------

# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
    print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
    num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
                                        output_file=clean_file,
                                        bad_rows=uniq_bads['bad_rows'],
                                        bad_text=uniq_bads['bad_text'])
else:
    print('No bad rows; Skipping <clean_cboe_file>')

Cleaning 777 row
Reviewed / cleaned 0 rows ...
Reviewed / cleaned 500000 rows ...
Reviewed / cleaned 1000000 rows ...
Reviewed / cleaned 1500000 rows ...
Reviewed / cleaned 2000000 rows ...
Reviewed / cleaned 2500000 rows ...
Reviewed / cleaned 3000000 rows ...
Reviewed / cleaned 3500000 rows ...
Reviewed / cleaned 4000000 rows ...
Reviewed / cleaned 4500000 rows ...
Reviewed / cleaned 5000000 rows ...
Reviewed / cleaned 5500000 rows ...
Reviewed / cleaned 6000000 rows ...
Reviewed / cleaned 6500000 rows ...
Reviewed / cleaned 7000000 rows ...
Reviewed / cleaned 7500000 rows ...
Reviewed / cleaned 8000000 rows ...
Reviewed / cleaned 8500000 rows ...
Reviewed / cleaned 9000000 rows ...
Reviewed / cleaned 9500000 rows ...
Reviewed / cleaned 10000000 rows ...
Reviewed / cleaned 10500000 rows ...
Reviewed / cleaned 11000000 rows ...
Reviewed / cleaned 11500000 rows ...
Reviewed / cleaned 12000000 rows ...
Reviewed / cleaned 12500000 rows ...
Reviewed / cleaned 13000000 rows ...
--------------------------------------------------------------------------------
Number of modified rows 777
--------------------------------------------------------------------------------

# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=cboe.col_names,
        dtype=cboe.col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month.  Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)

1911_SPX Data Preprocessing¶

These are the new data I downloaded in November after the CBOE published some corrections to the historical data.

Some observations that you will notice below:

There are fewer lines in these data (12,763,145 vs. 13,057,831)
There are no instances of "bad text" appearing in a column that should be formatted as float

In the next script we will review the data in greater detail.

# =============================================================================
# Set-up
# =============================================================================

# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1911_SPX'

# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1911_SPX"

# Define the concatenated raw file
proc_csv = r"cboe_spx_1911.csv"
proc_file = os.path.join(proc_dir, proc_csv)

# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1911_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1911_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)

# =============================================================================
# Count files / lines
# =============================================================================

# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)

Counted lines for 0 files of 173
Counted lines for 10 files of 173
Counted lines for 20 files of 173
Counted lines for 30 files of 173
Counted lines for 40 files of 173
Counted lines for 50 files of 173
Counted lines for 60 files of 173
Counted lines for 70 files of 173
Counted lines for 80 files of 173
Counted lines for 90 files of 173
Counted lines for 100 files of 173
Counted lines for 110 files of 173
Counted lines for 120 files of 173
Counted lines for 130 files of 173
Counted lines for 140 files of 173
Counted lines for 150 files of 173
Counted lines for 160 files of 173
Counted lines for 170 files of 173
Total lines in 173 files = 12763145

# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)

Concatenating 173 files ...

# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
                                           cnames=cboe.col_names,
                                           ctypes=cboe.col_types)

Scanned 500000 rows ...
Scanned 1000000 rows ...
Scanned 1500000 rows ...
Scanned 2000000 rows ...
Scanned 2500000 rows ...
Scanned 3000000 rows ...
Scanned 3500000 rows ...
Scanned 4000000 rows ...
Scanned 4500000 rows ...
Scanned 5000000 rows ...
Scanned 5500000 rows ...
Scanned 6000000 rows ...
Scanned 6500000 rows ...
Scanned 7000000 rows ...
Scanned 7500000 rows ...
Scanned 8000000 rows ...
Scanned 8500000 rows ...
Scanned 9000000 rows ...
Scanned 9500000 rows ...
Scanned 10000000 rows ...
Scanned 10500000 rows ...
Scanned 11000000 rows ...
Scanned 11500000 rows ...
Scanned 12000000 rows ...
Scanned 12500000 rows ...
--------------------------------------------------------------------------------
Number of unique affected rows =  0
Unique offending columns =  set()
Unique offending field entries =  set()
--------------------------------------------------------------------------------

# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
    print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
    num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
                                        output_file=clean_file,
                                        bad_rows=uniq_bads['bad_rows'],
                                        bad_text=uniq_bads['bad_text'])
else:
    print('No bad rows; Skipping <clean_cboe_file>')

No bad rows; Skipping <clean_cboe_file>

# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=cboe.col_names,
        dtype=cboe.col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month.  Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)

# Check the first couple of lines
df.head(4).transpose()

# Check the first couple of lines
df.tail(4).transpose()

	92	93	94	95
underlying_symbol	^SPX	^SPX	^SPX	^SPX
quote_date	2004-01-02 00:00:00	2004-01-02 00:00:00	2004-01-02 00:00:00	2004-01-02 00:00:00
root	SYG	SYG	SYG	SYG
expiration	2004-01-17 00:00:00	2004-01-17 00:00:00	2004-01-17 00:00:00	2004-01-17 00:00:00
strike	600	600	625	625
option_type	C	P	C	P
open	0	0	0	0
high	0	0	0	0
low	0	0	0	0
close	0	0	0	0
trade_volume	0	0	0	0
bid_size_1545	1	0	1	0
bid_1545	506.5	0	481.5	0
ask_size_1545	1	10	1	10
ask_1545	508.5	0.05	483.5	0.05
underlying_bid_1545	1107.61	1107.61	1107.61	1107.61
underlying_ask_1545	1108.06	1108.06	1108.06	1108.06
implied_underlying_price_1545	1107.25	1107.25	1107.25	1107.25
active_underlying_price_1545	1107.8	1107.8	1107.8	1107.8
implied_volatility_1545	1.0222	0.9665	0.9432	0.9056
delta_1545	0.9992	-0.0004	0.9993	-0.0005
gamma_1545	0	0	0	0
theta_1545	-0.0359	-0.0117	-0.0332	-0.0116
vega_1545	0.0059	0.0034	0.0054	0.0036
rho_1545	23.0216	-0.0195	23.9857	-0.0207
bid_size_eod	1	0	1	0
bid_eod	509	0	484	0
ask_size_eod	1	10	1	10
ask_eod	511	0.05	486	0.05
underlying_bid_eod	1107.65	1107.65	1107.65	1107.65
underlying_ask_eod	1108.41	1108.41	1108.41	1108.41
vwap	0	0	0	0
open_interest	1	131	0	100
delivery_code	NaN	NaN	NaN	NaN

	12762968	12762969	12762970	12762971
underlying_symbol	^SPX	^SPX	^SPX	^SPX
quote_date	2018-05-29 00:00:00	2018-05-29 00:00:00	2018-05-29 00:00:00	2018-05-29 00:00:00
root	SPX	SPX	SPX	SPX
expiration	2020-12-18 00:00:00	2020-12-18 00:00:00	2020-12-18 00:00:00	2020-12-18 00:00:00
strike	4000	4000	4100	4100
option_type	C	P	C	P
open	0	0	0	0
high	0	0	0	0
low	0	0	0	0
close	0	0	0	0
trade_volume	0	0	0	0
bid_size_1545	100	60	28	60
bid_1545	3.5	1171.5	1.45	1264
ask_size_1545	99	11	99	11
ask_1545	4.5	1187.8	3.5	1280.2
underlying_bid_1545	2688.58	2688.58	2688.58	2688.58
underlying_ask_1545	2689.25	2689.25	2689.25	2689.25
implied_underlying_price_1545	2575.72	2575.72	2575.72	2575.72
active_underlying_price_1545	2688.89	2688.89	2688.89	2688.89
implied_volatility_1545	0.1145	0.02	0.1127	0.02
delta_1545	0.024	-1	0.0158	-1
gamma_1545	0.0001	0	0.0001	0
theta_1545	-0.018	0	-0.0124	0
vega_1545	2.3226	0	1.633	0
rho_1545	147.596	-9625.96	97.9013	-9866.61
bid_size_eod	100	60	28	60
bid_eod	3.5	1165	1.75	1257.3
ask_size_eod	99	11	99	11
ask_eod	4.6	1181.9	3.6	1274.2
underlying_bid_eod	2665.37	2665.37	2665.37	2665.37
underlying_ask_eod	2716.89	2716.89	2716.89	2716.89
vwap	0	0	0	0
open_interest	442	34	501	3
delivery_code	NaN	NaN	NaN	NaN

python-backtest

Monday, November 11, 2019

010 - CBOE Data Preprocessing

CBOE Data Preprocessing¶

Summary¶

1905_SPX Data Preprocessing¶

(An aside on configuring Jupyter)¶

1911_SPX Data Preprocessing¶

No comments:

Post a Comment

About Me

Blog Archive