CBOE Data Preprocessing¶
Summary¶
The data preprocessing steps have been consolidated into a module called cboe. In addition, I now have two sets of data to process. One downloaded in May (housed in the 1905_SPX directory) and another, corrected version downloaded in November (housed in the 1911_SPX directory).
A cursory review of the data (i.e., a line count) shows the two sets of historical are different. In this script I will preprocess both using the newly-consolidated scripts.
1905_SPX Data Preprocessing¶
Nothing should change here. The only thing that has changed is how we process the files. The contents of the output should not change.
To confirm this was the case, I saved copies of my previously concatenated and preprocessed data files. After running through the steps below, I did a SHA-256 checksum test on both sets of files. They were identical, so moving the steps to functions (and tweaking the functions) doesn't appear to have modified the output.
(An aside on configuring Jupyter)¶
The first time I attempted to run the import cboe call below, I received a ModuleNotFoundError. I've tried to modify the JUPYTER_PATH in my config file; however, that doesn't seem to work.
# Import modules
import os
import pandas as pd
import cboe
# =============================================================================
# Set-up
# =============================================================================
# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'
# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)
# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"
# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)
# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)
# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
The following approach will work, but it is ugly.
# Import modules
import os
import sys
import pandas as pd
# Import local modules
local_module_path = '/Users/alexstephens/Development/Python/Option_Data_Analysis/modules'
sys.path.append(local_module_path)
import cboe
# =============================================================================
# Set-up
# =============================================================================
# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1905_SPX'
# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"
# Define the concatenated raw file
proc_csv = r"cboe_spx_1905.csv"
proc_file = os.path.join(proc_dir, proc_csv)
# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1905_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)
# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1905_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)
# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)
# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
cnames=cboe.col_names,
ctypes=cboe.col_types)
# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
output_file=clean_file,
bad_rows=uniq_bads['bad_rows'],
bad_text=uniq_bads['bad_text'])
else:
print('No bad rows; Skipping <clean_cboe_file>')
# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']
# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
clean_file,
skiprows=0,
delimiter=",",
header=0,
names=cboe.col_names,
dtype=cboe.col_types,
parse_dates=parse_dates,
infer_datetime_format=True,
)
# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month. Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
ascending=['True', 'True', 'True'],
inplace=True)
# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)
1911_SPX Data Preprocessing¶
These are the new data I downloaded in November after the CBOE published some corrections to the historical data.
Some observations that you will notice below:
- There are fewer lines in these data (12,763,145 vs. 13,057,831)
- There are no instances of "bad text" appearing in a column that should be formatted as float
In the next script we will review the data in greater detail.
# =============================================================================
# Set-up
# =============================================================================
# Location of the raw csv files
csv_dir = r'/Users/alexstephens/data/cboe/csv/1911_SPX'
# Location of the processed CBOE csv files
proc_dir = r"/Users/alexstephens/data/cboe/proc/1911_SPX"
# Define the concatenated raw file
proc_csv = r"cboe_spx_1911.csv"
proc_file = os.path.join(proc_dir, proc_csv)
# Define the cleaned, concatenated csv file
clean_csv = r"cboe_spx_1911_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)
# Define the cleaned, concatenated pickle file
clean_pkl = r"cboe_spx_1911_clean.pkl"
clean_pkl_file = os.path.join(proc_dir, clean_pkl)
# =============================================================================
# Count files / lines
# =============================================================================
# Echo the number of files / lines to concatenate
csv_list = cboe.get_cboe_csv_file_list(path=csv_dir)
csv_count = cboe.count_cboe_csv_file_lines(file_list=csv_list)
# =============================================================================
# Concatenate the raw monthly CBOE csv files
# =============================================================================
cboe.concat_cboe_csv_files(csv_list=csv_list, output_file=proc_file)
# =============================================================================
# Scan the concatenated file for unexpected values in non-string fields
# =============================================================================
raw_bads, uniq_bads = cboe.check_cboe_file(cboe_csv_file=proc_file,
cnames=cboe.col_names,
ctypes=cboe.col_types)
# =============================================================================
# Clean the identified bad fields (if any exist)
# =============================================================================
if (uniq_bads['bad_rows']):
print('Cleaning {} row'.format(len(uniq_bads['bad_rows'])))
num_bad_rows = cboe.clean_cboe_file(input_file=proc_file,
output_file=clean_file,
bad_rows=uniq_bads['bad_rows'],
bad_text=uniq_bads['bad_text'])
else:
print('No bad rows; Skipping <clean_cboe_file>')
# =============================================================================
# Create a pandas data frame using the cleaned data
# =============================================================================
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']
# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
clean_file,
skiprows=0,
delimiter=",",
header=0,
names=cboe.col_names,
dtype=cboe.col_types,
parse_dates=parse_dates,
infer_datetime_format=True,
)
# The raw monthly data files are in reverse order by date, so the first rows
# are from the end of the month and the last rows are from the start of the
# month. Therefore, sort and reindex the data.
df.sort_values(['quote_date', 'expiration', 'strike'],
ascending=['True', 'True', 'True'],
inplace=True)
# Save the Dataframe to a pickle
df.to_pickle(clean_pkl_file)
# Check the first couple of lines
df.head(4).transpose()
# Check the first couple of lines
df.tail(4).transpose()
No comments:
Post a Comment