Friday, November 1, 2019

004 - Creating and Exploring a Pandas Dataframe

04_cboe_create_df

Creating and exploring a pandas dataframe of the CBOE data

Overview

The purpose of this post is to try and load the cleaned-up set of CBOE data into a Pandas dataframe. Then I'll poke around and see if anything anomalous appears in the data. Really this is just an attempt to learn a little more about the Pandas syntax.

I'll start by identifying where the cleaned data reside.

In [11]:
import os
import pandas as pd

# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Directory & file that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"
clean_csv = r"cboe_mmyy_all_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

The next step highlights that I really need to define a module for common CBOE parameters.

In [12]:
# CBOE column names
col_names = (
         'underlying_symbol',
         'quote_date',
         'root',
         'expiration',
         'strike',
         'option_type',
         'open',
         'high',
         'low',
         'close',
         'trade_volume',
         'bid_size_1545',
         'bid_1545',
         'ask_size_1545',
         'ask_1545',
         'underlying_bid_1545',
         'underlying_ask_1545',
         'implied_underlying_price_1545',
         'active_underlying_price_1545',
         'implied_volatility_1545',
         'delta_1545',
         'gamma_1545',
         'theta_1545',
         'vega_1545',
         'rho_1545',
         'bid_size_eod',
         'bid_eod',
         'ask_size_eod',
         'ask_eod',
         'underlying_bid_eod',
         'underlying_ask_eod',
         'vwap',
         'open_interest',
         'delivery_code',
)

# Assumed CBOE column types
col_types = {
         'underlying_symbol': str,                      # String
         'quote_date': str,                             # Date
         'root': str,                                   # String
         'expiration': str,                             # Date
         'strike': float,                               # Float
         'option_type': str,                            # String
         'open': float,                                 # Float
         'high': float,                                 # Float
         'low': float,                                  # Float
         'close': float,                                # Float
         'trade_volume': float,                         # Float
         'bid_size_1545': float,                        # Float
         'bid_1545': float,                             # Float
         'ask_size_1545': float,                        # Float
         'ask_1545': float,                             # Float
         'underlying_bid_1545': float,                  # Float
         'underlying_ask_1545': float,                  # Float
         'implied_underlying_price_1545': float,        # Float
         'active_underlying_price_1545': float,         # Float
         'implied_volatility_1545': float,              # Float
         'delta_1545': float,                           # Float
         'gamma_1545': float,                           # Float
         'theta_1545': float,                           # Float
         'vega_1545': float,                            # Float
         'rho_1545': float,                             # Float
         'bid_size_eod': float,                         # Float
         'bid_eod': float,                              # Float
         'ask_size_eod': float,                         # Float
         'ask_eod': float,                              # Float
         'underlying_bid_eod': float,                   # Float
         'underlying_ask_eod': float,                   # Float
         'vwap': float,                                 # Float
         'open_interest': float,                        # Float
         'delivery_code': str,                          # String
}


# Define columns that hold dates
parse_dates = ['quote_date', 'expiration']

Now I will load the data into a dataframe. Since we have already cleaned-up some of the known error flags, I will not pass any additional na-type flads into the function. Also I am going to read the 'quote_date' and 'expiration' fields using the Datetime format.

In [13]:
# Read the concatenated csv data into a pandas Dataframe
df = pd.read_csv(
        clean_file,
        skiprows=0,
        delimiter=",",
        header=0,
        names=col_names,
        dtype=col_types,
        parse_dates=parse_dates,
        infer_datetime_format=True,
        )

Data Review

Now let's explore some of the column data to see if anything looks peculiar. I will treat date, string, and numeric data separately.

I will start with the dates. The historical data span 01/01/2004 through 05/31/2018, so the range of quote_date better adhere to that range.

I'm using the describe() method (rotated using the transpose() method) to display the date ranges.

Overall, the date summaries look 'ok'. The 'quote_date' is consistent with out expectations. The expirations span a wider range because certain LEAP options will extend approximately 2 years out from the last quote date (here: May 2018). The last cycle in the data appears to be the Dec 2020 LEAP.

In [14]:
# Summarize date columns
df[['quote_date', 'expiration']].describe().transpose()
Out[14]:
count unique top freq first last
quote_date 13057658 3626 2018-03-26 00:00:00 13700 2004-01-02 00:00:00 2018-05-29 00:00:00
expiration 13057658 960 2017-12-15 00:00:00 206696 2004-01-17 00:00:00 2020-12-18 00:00:00

Next we will review the string objects. I'm using the include=['object'] keyword pair to isolate string column types.

There is only one uderlying (SPX), and the underlying_symbol confirms that is the case.

We would expect there to be only two option types: calls and puts. The value_counts() tabulation confirms this.

The root field will require additional research. I do not know how to translate each-and-every root into an option series. I am fairly certain that SPXW represents weekly options, SPX represents the standard (i.e., 3rd Friday) expiration cycle, and SPXQ represents quarterly expirations. However, I have no clue what, say, the JXE expirations represent. I will address that in a subsequent post.

The delivery code seems to be mostly populated by zeros. So, I do not know what to think of this code at the moment. The CBOE does not list it as a column in its metadata summary. I may just drop it from the data set in an upcoming post.

In [15]:
# Summarize string columns
df.describe(include=['object']).transpose()
Out[15]:
count unique top freq
underlying_symbol 13057658 1 ^SPX 13057658
root 13057658 46 SPXW 6006050
option_type 13057658 2 c 6528843
delivery_code 13027824 2 0 12857578
In [16]:
# Tabulate the option types
df['option_type'].value_counts()
Out[16]:
c    6528843
p    6528815
Name: option_type, dtype: int64
In [17]:
# Tabulate the delivery code
df['delivery_code'].value_counts()
Out[17]:
0       12857578
$100      170246
Name: delivery_code, dtype: int64
In [18]:
# Tabulate the root
df['root'].value_counts()
Out[18]:
SPXW     6006050
SPX      4791910
SPXQ      372106
BSZ       190474
SZP       161795
SPT       158242
SXY       130466
SPQ       125913
SXB       108431
SXZ        98516
SPZ        78715
SXM        66582
SXG        63580
SYG        62684
SPXPM      59798
SZT        55672
BSK        47643
SPL        38712
SZJ        35213
SRO        35075
SPB        33886
SYU        31545
SZV        30052
SZQ        27039
SYZ        25924
SVP        24092
QSE        19059
SAQ        18069
SLQ        17414
SPV        16670
SQP        15591
QZQ        15581
QSZ        13313
SZD        12216
BSF        12078
SZU        12070
SYF         6568
SKQ         6163
JXB         5828
JXA         5468
JXD         5452
SYV         4910
SXJ         3558
VSTRP       2782
SQG         2760
JXE         1993
Name: root, dtype: int64

Now let's move onto the numeric data. In this case I'm using the include='number' keyword pair to isolate columns with a numeric type.

A couple of initial observations based on the data shown below:

  • The vwap field is clearly odd. It has infinite variance, and extrema spanning -1e299 to 1e294. We will need to review that column in detail. It is clearly screwed-up.

  • In the greeks we see that the choice of Special Value is popping-up as the minimum. So, the replacements appear to have cleared-out all of the error codes (otherwise I'd expect an type error if we tried to compute statistics for those columns)

  • There are lots of zeros in the open/high/low/close fields. This not surprising since a lot of the away-from-the-money strikes will be illiquid (i.e., rarely traded).

  • It is not clear why the underlying would have a minimum value of zero. Perhaps these correspond to the periods where the greeks errored-out? I will need to investigate further.

  • The greeks look superficially sensible. I.e.g, gamma and vega are > 0.

  • The open interest is odd as well. There should not be any negative values in that field.

Another observation. The monthly data files appear to be ordered such that the most recent data are at the top of the file, and the oldest observations are at the bottom of the file. So, the data frame is not ordered by date. This is not an issue, but at this point we cannot assume that the data are in order of descending date.

In [19]:
# Summarize numeric columns
df.describe(include='number').transpose()
Out[19]:
count mean std min 25% 50% 75% max
strike 13057658.0 1.695322e+03 6.127082e+02 5.000000e+00 1250.000000 1700.000000 2150.000000 1.320000e+04
open 13057658.0 9.748849e+00 5.173222e+01 0.000000e+00 0.000000 0.000000 0.150000 4.482300e+03
high 13057658.0 1.005680e+01 5.276826e+01 0.000000e+00 0.000000 0.000000 0.150000 9.105000e+03
low 13057658.0 9.485772e+00 5.150621e+01 0.000000e+00 0.000000 0.000000 0.100000 2.751000e+03
close 13057658.0 9.752279e+00 5.175270e+01 0.000000e+00 0.000000 0.000000 0.150000 3.029000e+03
trade_volume 13057658.0 1.623325e+03 5.556798e+04 0.000000e+00 0.000000 0.000000 1.000000 1.576207e+07
bid_size_1545 13057658.0 1.221540e+02 1.907005e+02 0.000000e+00 11.000000 55.000000 150.000000 2.040400e+04
bid_1545 13057658.0 1.900723e+02 3.079355e+02 0.000000e+00 1.450000 46.100000 253.200000 2.748300e+03
ask_size_1545 13057658.0 1.649060e+02 3.783876e+02 0.000000e+00 24.000000 75.000000 176.000000 3.366500e+04
ask_1545 13057658.0 1.933352e+02 3.103242e+02 0.000000e+00 1.950000 48.500000 259.500000 2.756100e+03
underlying_bid_1545 13057658.0 1.947768e+03 5.156719e+02 0.000000e+00 1480.940000 2051.170000 2359.000000 2.869190e+03
underlying_ask_1545 13057658.0 1.947768e+03 5.156719e+02 0.000000e+00 1480.940000 2051.170000 2359.000000 2.869190e+03
implied_underlying_price_1545 13057658.0 1.901086e+03 5.954465e+02 0.000000e+00 1436.020000 2034.600000 2353.550000 3.014750e+03
active_underlying_price_1545 13057658.0 1.939062e+03 5.204705e+02 0.000000e+00 1465.940000 2037.530000 2356.000000 3.014750e+03
implied_volatility_1545 13057658.0 2.955677e-01 6.429183e-01 0.000000e+00 0.118800 0.202500 0.325000 2.078845e+02
delta_1545 13057658.0 -2.861949e+01 5.366050e+03 -9.999990e+05 -0.070400 0.000000 0.842700 1.011200e+00
gamma_1545 13057658.0 -5.436161e+00 2.331820e+03 -9.999990e+05 0.000070 0.000415 0.001415 6.224230e-01
theta_1545 13057658.0 -2.914570e+01 5.366050e+03 -9.999990e+05 -0.234927 -0.109010 -0.028687 6.741959e+00
vega_1545 13057658.0 -2.753056e+01 5.380309e+03 -9.999990e+05 0.072245 0.517580 1.976462 2.041189e+01
rho_1545 13057658.0 -2.788455e+00 5.393093e+03 -9.999990e+05 -23.493632 0.000000 116.692534 6.366142e+03
bid_size_eod 13057658.0 1.014582e+02 1.817101e+02 0.000000e+00 6.000000 45.000000 105.000000 2.040400e+04
bid_eod 13057658.0 1.896491e+02 3.077333e+02 0.000000e+00 1.350000 45.600000 252.400000 2.751500e+03
ask_size_eod 13057658.0 1.393529e+02 3.682652e+02 0.000000e+00 11.000000 51.000000 125.000000 3.366500e+04
ask_eod 13057658.0 1.931852e+02 3.102778e+02 0.000000e+00 1.900000 48.200000 259.300000 2.760400e+03
underlying_bid_eod 13057658.0 1.947523e+03 5.155860e+02 6.765300e+02 1481.050000 2051.310000 2357.030000 2.872870e+03
underlying_ask_eod 13057658.0 1.947523e+03 5.155860e+02 6.765300e+02 1481.050000 2051.310000 2357.030000 2.872870e+03
vwap 13057658.0 -9.684221e+298 inf -1.662828e+305 0.000000 0.000000 0.160000 1.437005e+294
open_interest 13057658.0 3.445178e+03 1.313170e+06 -1.325400e+09 0.000000 19.000000 924.000000 1.745682e+09

I want to do a couple of side-by-side comparisons just to do a quick check of the 1545 and EOD fields. These should have similar ranges, but it might be useful to verify that assumption. If there's a difference, then I will need to research that further.

In [20]:
# Summarize numeric columns
df[['bid_1545','bid_eod']].describe(include='number').transpose()
Out[20]:
count mean std min 25% 50% 75% max
bid_1545 13057658.0 190.072277 307.935496 0.0 1.45 46.1 253.2 2748.3
bid_eod 13057658.0 189.649149 307.733315 0.0 1.35 45.6 252.4 2751.5
In [21]:
# Summarize numeric columns
df[['bid_size_1545','bid_size_eod']].describe(include='number').transpose()
Out[21]:
count mean std min 25% 50% 75% max
bid_size_1545 13057658.0 122.153976 190.700466 0.0 11.0 55.0 150.0 20404.0
bid_size_eod 13057658.0 101.458220 181.710110 0.0 6.0 45.0 105.0 20404.0
In [22]:
# Summarize numeric columns
df[['ask_1545','ask_eod']].describe(include='number').transpose()
Out[22]:
count mean std min 25% 50% 75% max
ask_1545 13057658.0 193.335247 310.324193 0.0 1.95 48.5 259.5 2756.1
ask_eod 13057658.0 193.185156 310.277757 0.0 1.90 48.2 259.3 2760.4
In [23]:
# Summarize numeric columns
df[['ask_size_1545','ask_size_eod']].describe(include='number').transpose()
Out[23]:
count mean std min 25% 50% 75% max
ask_size_1545 13057658.0 164.906022 378.387645 0.0 24.0 75.0 176.0 33665.0
ask_size_eod 13057658.0 139.352922 368.265241 0.0 11.0 51.0 125.0 33665.0

The ranges appear to be fairly similar. There isn't anything in these numbers that suggest a "vwap-like" anomaly.

Ultimately, I will probably drop the end-of-day (eod) columns, since the CBOE states that the 15:45pm data are more reliable. But I'll do that at a later date.

For now, I will save the data frame to disk

In [24]:
# Save the Dataframe to a pickle
proc_pkl = r"cboe_mmyy_all_clean_df.pkl"
df.to_pickle(os.path.join(proc_dir, proc_pkl))

No comments:

Post a Comment