Creating and exploring a pandas dataframe of the CBOE data¶
Overview¶
The purpose of this post is to try and load the cleaned-up set of CBOE data into a Pandas dataframe. Then I'll poke around and see if anything anomalous appears in the data. Really this is just an attempt to learn a little more about the Pandas syntax.
I'll start by identifying where the cleaned data reside.
import os
import pandas as pd
# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Directory & file that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"
clean_csv = r"cboe_mmyy_all_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)
The next step highlights that I really need to define a module for common CBOE parameters.
# CBOE column names
col_names = (
'underlying_symbol',
'quote_date',
'root',
'expiration',
'strike',
'option_type',
'open',
'high',
'low',
'close',
'trade_volume',
'bid_size_1545',
'bid_1545',
'ask_size_1545',
'ask_1545',
'underlying_bid_1545',
'underlying_ask_1545',
'implied_underlying_price_1545',
'active_underlying_price_1545',
'implied_volatility_1545',
'delta_1545',
'gamma_1545',
'theta_1545',
'vega_1545',
'rho_1545',
'bid_size_eod',
'bid_eod',
'ask_size_eod',
'ask_eod',
'underlying_bid_eod',
'underlying_ask_eod',
'vwap',
'open_interest',
'delivery_code',
)
# Assumed CBOE column types
col_types = {
'underlying_symbol': str, # String
'quote_date': str, # Date
'root': str, # String
'expiration': str, # Date
'strike': float, # Float
'option_type': str, # String
'open': float, # Float
'high': float, # Float
'low': float, # Float
'close': float, # Float
'trade_volume': float, # Float
'bid_size_1545': float, # Float
'bid_1545': float, # Float
'ask_size_1545': float, # Float
'ask_1545': float, # Float
'underlying_bid_1545': float, # Float
'underlying_ask_1545': float, # Float
'implied_underlying_price_1545': float, # Float
'active_underlying_price_1545': float, # Float
'implied_volatility_1545': float, # Float
'delta_1545': float, # Float
'gamma_1545': float, # Float
'theta_1545': float, # Float
'vega_1545': float, # Float
'rho_1545': float, # Float
'bid_size_eod': float, # Float
'bid_eod': float, # Float
'ask_size_eod': float, # Float
'ask_eod': float, # Float
'underlying_bid_eod': float, # Float
'underlying_ask_eod': float, # Float
'vwap': float, # Float
'open_interest': float, # Float
'delivery_code': str, # String
}
# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']
Now I will load the data into a dataframe. Since we have already cleaned-up some of the known error flags, I will not pass any additional na-type flads into the function. Also I am going to read the 'quote_date' and 'expiration' fields using the Datetime format.
# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
clean_file,
skiprows=0,
delimiter=",",
header=0,
names=col_names,
dtype=col_types,
parse_dates=parse_dates,
infer_datetime_format=True,
)
Data Review¶
Now let's explore some of the column data to see if anything looks peculiar. I will treat date, string, and numeric data separately.
I will start with the dates. The historical data span 01/01/2004 through 05/31/2018, so the range of quote_date better adhere to that range.
I'm using the describe() method (rotated using the transpose() method) to display the date ranges.
Overall, the date summaries look 'ok'. The 'quote_date' is consistent with out expectations. The expirations span a wider range because certain LEAP options will extend approximately 2 years out from the last quote date (here: May 2018). The last cycle in the data appears to be the Dec 2020 LEAP.
# Summarize date columns
df[['quote_date', 'expiration']].describe().transpose()
Next we will review the string objects. I'm using the include=['object'] keyword pair to isolate string column types.
There is only one uderlying (SPX), and the underlying_symbol confirms that is the case.
We would expect there to be only two option types: calls and puts. The value_counts() tabulation confirms this.
The root field will require additional research. I do not know how to translate each-and-every root into an option series. I am fairly certain that SPXW represents weekly options, SPX represents the standard (i.e., 3rd Friday) expiration cycle, and SPXQ represents quarterly expirations. However, I have no clue what, say, the JXE expirations represent. I will address that in a subsequent post.
The delivery code seems to be mostly populated by zeros. So, I do not know what to think of this code at the moment. The CBOE does not list it as a column in its metadata summary. I may just drop it from the data set in an upcoming post.
# Summarize string columns
df.describe(include=['object']).transpose()
# Tabulate the option types
df['option_type'].value_counts()
# Tabulate the delivery code
df['delivery_code'].value_counts()
# Tabulate the root
df['root'].value_counts()
Now let's move onto the numeric data. In this case I'm using the include='number' keyword pair to isolate columns with a numeric type.
A couple of initial observations based on the data shown below:
The vwap field is clearly odd. It has infinite variance, and extrema spanning -1e299 to 1e294. We will need to review that column in detail. It is clearly screwed-up.
In the greeks we see that the choice of Special Value is popping-up as the minimum. So, the replacements appear to have cleared-out all of the error codes (otherwise I'd expect an type error if we tried to compute statistics for those columns)
There are lots of zeros in the open/high/low/close fields. This not surprising since a lot of the away-from-the-money strikes will be illiquid (i.e., rarely traded).
It is not clear why the underlying would have a minimum value of zero. Perhaps these correspond to the periods where the greeks errored-out? I will need to investigate further.
The greeks look superficially sensible. I.e.g, gamma and vega are > 0.
The open interest is odd as well. There should not be any negative values in that field.
Another observation. The monthly data files appear to be ordered such that the most recent data are at the top of the file, and the oldest observations are at the bottom of the file. So, the data frame is not ordered by date. This is not an issue, but at this point we cannot assume that the data are in order of descending date.
# Summarize numeric columns
df.describe(include='number').transpose()
I want to do a couple of side-by-side comparisons just to do a quick check of the 1545 and EOD fields. These should have similar ranges, but it might be useful to verify that assumption. If there's a difference, then I will need to research that further.
# Summarize numeric columns
df[['bid_1545','bid_eod']].describe(include='number').transpose()
# Summarize numeric columns
df[['bid_size_1545','bid_size_eod']].describe(include='number').transpose()
# Summarize numeric columns
df[['ask_1545','ask_eod']].describe(include='number').transpose()
# Summarize numeric columns
df[['ask_size_1545','ask_size_eod']].describe(include='number').transpose()
The ranges appear to be fairly similar. There isn't anything in these numbers that suggest a "vwap-like" anomaly.
Ultimately, I will probably drop the end-of-day (eod) columns, since the CBOE states that the 15:45pm data are more reliable. But I'll do that at a later date.
For now, I will save the data frame to disk
# Save the Dataframe to a pickle
proc_pkl = r"cboe_mmyy_all_clean_df.pkl"
df.to_pickle(os.path.join(proc_dir, proc_pkl))
No comments:
Post a Comment