Creating and exploring a pandas dataframe of the CBOE data¶

Overview¶

The purpose of this post is to try and load the cleaned-up set of CBOE data into a Pandas dataframe. Then I'll poke around and see if anything anomalous appears in the data. Really this is just an attempt to learn a little more about the Pandas syntax.

I'll start by identifying where the cleaned data reside.

import os
import pandas as pd

# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Directory & file that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"
clean_csv = r"cboe_mmyy_all_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

The next step highlights that I really need to define a module for common CBOE parameters.

# CBOE column names
col_names = (
         'underlying_symbol',
         'quote_date',
         'root',
         'expiration',
         'strike',
         'option_type',
         'open',
         'high',
         'low',
         'close',
         'trade_volume',
         'bid_size_1545',
         'bid_1545',
         'ask_size_1545',
         'ask_1545',
         'underlying_bid_1545',
         'underlying_ask_1545',
         'implied_underlying_price_1545',
         'active_underlying_price_1545',
         'implied_volatility_1545',
         'delta_1545',
         'gamma_1545',
         'theta_1545',
         'vega_1545',
         'rho_1545',
         'bid_size_eod',
         'bid_eod',
         'ask_size_eod',
         'ask_eod',
         'underlying_bid_eod',
         'underlying_ask_eod',
         'vwap',
         'open_interest',
         'delivery_code',
)

# Assumed CBOE column types
col_types = {
         'underlying_symbol': str,                      # String
         'quote_date': str,                             # Date
         'root': str,                                   # String
         'expiration': str,                             # Date
         'strike': float,                               # Float
         'option_type': str,                            # String
         'open': float,                                 # Float
         'high': float,                                 # Float
         'low': float,                                  # Float
         'close': float,                                # Float
         'trade_volume': float,                         # Float
         'bid_size_1545': float,                        # Float
         'bid_1545': float,                             # Float
         'ask_size_1545': float,                        # Float
         'ask_1545': float,                             # Float
         'underlying_bid_1545': float,                  # Float
         'underlying_ask_1545': float,                  # Float
         'implied_underlying_price_1545': float,        # Float
         'active_underlying_price_1545': float,         # Float
         'implied_volatility_1545': float,              # Float
         'delta_1545': float,                           # Float
         'gamma_1545': float,                           # Float
         'theta_1545': float,                           # Float
         'vega_1545': float,                            # Float
         'rho_1545': float,                             # Float
         'bid_size_eod': float,                         # Float
         'bid_eod': float,                              # Float
         'ask_size_eod': float,                         # Float
         'ask_eod': float,                              # Float
         'underlying_bid_eod': float,                   # Float
         'underlying_ask_eod': float,                   # Float
         'vwap': float,                                 # Float
         'open_interest': float,                        # Float
         'delivery_code': str,                          # String
}


# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

Now I will load the data into a dataframe. Since we have already cleaned-up some of the known error flags, I will not pass any additional na-type flads into the function. Also I am going to read the 'quote_date' and 'expiration' fields using the Datetime format.

# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=col_names,
        dtype=col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

Data Review¶

Now let's explore some of the column data to see if anything looks peculiar. I will treat date, string, and numeric data separately.

I will start with the dates. The historical data span 01/01/2004 through 05/31/2018, so the range of quote_date better adhere to that range.

I'm using the describe() method (rotated using the transpose() method) to display the date ranges.

Overall, the date summaries look 'ok'. The 'quote_date' is consistent with out expectations. The expirations span a wider range because certain LEAP options will extend approximately 2 years out from the last quote date (here: May 2018). The last cycle in the data appears to be the Dec 2020 LEAP.

# Summarize date columns
df[['quote_date', 'expiration']].describe().transpose()

Next we will review the string objects. I'm using the include=['object'] keyword pair to isolate string column types.

There is only one uderlying (SPX), and the underlying_symbol confirms that is the case.

We would expect there to be only two option types: calls and puts. The value_counts() tabulation confirms this.

The root field will require additional research. I do not know how to translate each-and-every root into an option series. I am fairly certain that SPXW represents weekly options, SPX represents the standard (i.e., 3rd Friday) expiration cycle, and SPXQ represents quarterly expirations. However, I have no clue what, say, the JXE expirations represent. I will address that in a subsequent post.

The delivery code seems to be mostly populated by zeros. So, I do not know what to think of this code at the moment. The CBOE does not list it as a column in its metadata summary. I may just drop it from the data set in an upcoming post.

# Summarize string columns
df.describe(include=['object']).transpose()

# Tabulate the option types
df['option_type'].value_counts()

c    6528843
p    6528815
Name: option_type, dtype: int64

# Tabulate the delivery code
df['delivery_code'].value_counts()

0       12857578
$100      170246
Name: delivery_code, dtype: int64

# Tabulate the root
df['root'].value_counts()

SPXW     6006050
SPX      4791910
SPXQ      372106
BSZ       190474
SZP       161795
SPT       158242
SXY       130466
SPQ       125913
SXB       108431
SXZ        98516
SPZ        78715
SXM        66582
SXG        63580
SYG        62684
SPXPM      59798
SZT        55672
BSK        47643
SPL        38712
SZJ        35213
SRO        35075
SPB        33886
SYU        31545
SZV        30052
SZQ        27039
SYZ        25924
SVP        24092
QSE        19059
SAQ        18069
SLQ        17414
SPV        16670
SQP        15591
QZQ        15581
QSZ        13313
SZD        12216
BSF        12078
SZU        12070
SYF         6568
SKQ         6163
JXB         5828
JXA         5468
JXD         5452
SYV         4910
SXJ         3558
VSTRP       2782
SQG         2760
JXE         1993
Name: root, dtype: int64

Now let's move onto the numeric data. In this case I'm using the include='number' keyword pair to isolate columns with a numeric type.

A couple of initial observations based on the data shown below:

The vwap field is clearly odd. It has infinite variance, and extrema spanning -1e299 to 1e294. We will need to review that column in detail. It is clearly screwed-up.
In the greeks we see that the choice of Special Value is popping-up as the minimum. So, the replacements appear to have cleared-out all of the error codes (otherwise I'd expect an type error if we tried to compute statistics for those columns)
There are lots of zeros in the open/high/low/close fields. This not surprising since a lot of the away-from-the-money strikes will be illiquid (i.e., rarely traded).
It is not clear why the underlying would have a minimum value of zero. Perhaps these correspond to the periods where the greeks errored-out? I will need to investigate further.
The greeks look superficially sensible. I.e.g, gamma and vega are > 0.
The open interest is odd as well. There should not be any negative values in that field.

Another observation. The monthly data files appear to be ordered such that the most recent data are at the top of the file, and the oldest observations are at the bottom of the file. So, the data frame is not ordered by date. This is not an issue, but at this point we cannot assume that the data are in order of descending date.

# Summarize numeric columns
df.describe(include='number').transpose()

I want to do a couple of side-by-side comparisons just to do a quick check of the 1545 and EOD fields. These should have similar ranges, but it might be useful to verify that assumption. If there's a difference, then I will need to research that further.

# Summarize numeric columns
df[['bid_1545','bid_eod']].describe(include='number').transpose()

# Summarize numeric columns
df[['bid_size_1545','bid_size_eod']].describe(include='number').transpose()

# Summarize numeric columns
df[['ask_1545','ask_eod']].describe(include='number').transpose()

# Summarize numeric columns
df[['ask_size_1545','ask_size_eod']].describe(include='number').transpose()

The ranges appear to be fairly similar. There isn't anything in these numbers that suggest a "vwap-like" anomaly.

Ultimately, I will probably drop the end-of-day (eod) columns, since the CBOE states that the 15:45pm data are more reliable. But I'll do that at a later date.

For now, I will save the data frame to disk

# Save the Dataframe to a pickle
proc_pkl = r"cboe_mmyy_all_clean_df.pkl"
df.to_pickle(os.path.join(proc_dir, proc_pkl))

	count	unique	top	freq	first	last
quote_date	13057658	3626	2018-03-26 00:00:00	13700	2004-01-02 00:00:00	2018-05-29 00:00:00
expiration	13057658	960	2017-12-15 00:00:00	206696	2004-01-17 00:00:00	2020-12-18 00:00:00

	count	mean	std	min	25%	50%	75%	max
strike	13057658.0	1.695322e+03	6.127082e+02	5.000000e+00	1250.000000	1700.000000	2150.000000	1.320000e+04
open	13057658.0	9.748849e+00	5.173222e+01	0.000000e+00	0.000000	0.000000	0.150000	4.482300e+03
high	13057658.0	1.005680e+01	5.276826e+01	0.000000e+00	0.000000	0.000000	0.150000	9.105000e+03
low	13057658.0	9.485772e+00	5.150621e+01	0.000000e+00	0.000000	0.000000	0.100000	2.751000e+03
close	13057658.0	9.752279e+00	5.175270e+01	0.000000e+00	0.000000	0.000000	0.150000	3.029000e+03
trade_volume	13057658.0	1.623325e+03	5.556798e+04	0.000000e+00	0.000000	0.000000	1.000000	1.576207e+07
bid_size_1545	13057658.0	1.221540e+02	1.907005e+02	0.000000e+00	11.000000	55.000000	150.000000	2.040400e+04
bid_1545	13057658.0	1.900723e+02	3.079355e+02	0.000000e+00	1.450000	46.100000	253.200000	2.748300e+03
ask_size_1545	13057658.0	1.649060e+02	3.783876e+02	0.000000e+00	24.000000	75.000000	176.000000	3.366500e+04
ask_1545	13057658.0	1.933352e+02	3.103242e+02	0.000000e+00	1.950000	48.500000	259.500000	2.756100e+03
underlying_bid_1545	13057658.0	1.947768e+03	5.156719e+02	0.000000e+00	1480.940000	2051.170000	2359.000000	2.869190e+03
underlying_ask_1545	13057658.0	1.947768e+03	5.156719e+02	0.000000e+00	1480.940000	2051.170000	2359.000000	2.869190e+03
implied_underlying_price_1545	13057658.0	1.901086e+03	5.954465e+02	0.000000e+00	1436.020000	2034.600000	2353.550000	3.014750e+03
active_underlying_price_1545	13057658.0	1.939062e+03	5.204705e+02	0.000000e+00	1465.940000	2037.530000	2356.000000	3.014750e+03
implied_volatility_1545	13057658.0	2.955677e-01	6.429183e-01	0.000000e+00	0.118800	0.202500	0.325000	2.078845e+02
delta_1545	13057658.0	-2.861949e+01	5.366050e+03	-9.999990e+05	-0.070400	0.000000	0.842700	1.011200e+00
gamma_1545	13057658.0	-5.436161e+00	2.331820e+03	-9.999990e+05	0.000070	0.000415	0.001415	6.224230e-01
theta_1545	13057658.0	-2.914570e+01	5.366050e+03	-9.999990e+05	-0.234927	-0.109010	-0.028687	6.741959e+00
vega_1545	13057658.0	-2.753056e+01	5.380309e+03	-9.999990e+05	0.072245	0.517580	1.976462	2.041189e+01
rho_1545	13057658.0	-2.788455e+00	5.393093e+03	-9.999990e+05	-23.493632	0.000000	116.692534	6.366142e+03
bid_size_eod	13057658.0	1.014582e+02	1.817101e+02	0.000000e+00	6.000000	45.000000	105.000000	2.040400e+04
bid_eod	13057658.0	1.896491e+02	3.077333e+02	0.000000e+00	1.350000	45.600000	252.400000	2.751500e+03
ask_size_eod	13057658.0	1.393529e+02	3.682652e+02	0.000000e+00	11.000000	51.000000	125.000000	3.366500e+04
ask_eod	13057658.0	1.931852e+02	3.102778e+02	0.000000e+00	1.900000	48.200000	259.300000	2.760400e+03
underlying_bid_eod	13057658.0	1.947523e+03	5.155860e+02	6.765300e+02	1481.050000	2051.310000	2357.030000	2.872870e+03
underlying_ask_eod	13057658.0	1.947523e+03	5.155860e+02	6.765300e+02	1481.050000	2051.310000	2357.030000	2.872870e+03
vwap	13057658.0	-9.684221e+298	inf	-1.662828e+305	0.000000	0.000000	0.160000	1.437005e+294
open_interest	13057658.0	3.445178e+03	1.313170e+06	-1.325400e+09	0.000000	19.000000	924.000000	1.745682e+09

	count	mean	std	min	25%	50%	75%	max
bid_1545	13057658.0	190.072277	307.935496	0.0	1.45	46.1	253.2	2748.3
bid_eod	13057658.0	189.649149	307.733315	0.0	1.35	45.6	252.4	2751.5

	count	mean	std	min	25%	50%	75%	max
bid_size_1545	13057658.0	122.153976	190.700466	0.0	11.0	55.0	150.0	20404.0
bid_size_eod	13057658.0	101.458220	181.710110	0.0	6.0	45.0	105.0	20404.0

	count	mean	std	min	25%	50%	75%	max
ask_1545	13057658.0	193.335247	310.324193	0.0	1.95	48.5	259.5	2756.1
ask_eod	13057658.0	193.185156	310.277757	0.0	1.90	48.2	259.3	2760.4

python-backtest

Friday, November 1, 2019

004 - Creating and Exploring a Pandas Dataframe

Creating and exploring a pandas dataframe of the CBOE data¶

Overview¶

Data Review¶

No comments:

Post a Comment

About Me

Blog Archive

	count	unique	top	freq
underlying_symbol	13057658	1	^SPX	13057658
root	13057658	46	SPXW	6006050
option_type	13057658	2	c	6528843
delivery_code	13027824	2	0	12857578

	count	mean	std	min	25%	50%	75%	max
ask_size_1545	13057658.0	164.906022	378.387645	0.0	24.0	75.0	176.0	33665.0
ask_size_eod	13057658.0	139.352922	368.265241	0.0	11.0	51.0	125.0	33665.0