Saturday, November 2, 2019

007 - Augmenting the Raw CBOE File

07_cboe_augment_df

Augmenting the CBOE file

Summary

There are a couple of additions I would like to make to the scrubbed CBOE data frame. These additions should (in theory) make it easier to isolate relevant data.

The additions I'd like to make include (this list may grow over time):

  • Capitalize the option type
  • Create a key using the root, a YYYYMMDD string, the put/call key, and the strike
  • Create a similar key using only th YYYYMMDD string, the put/call key, and the strike
  • Compute the number of days that will elapse from the quote_date to expiration
  • Add an expiration day of week field (i.e., Friday, Saturday, etc.)
  • Add an expiration week of month (i.e., the 3rd week in the month)

At this stage, I would like to retain as much data in its raw form as is practical. In the future I will start to prune the data frame.

To start, load the raw data frame.

In [1]:
import os
import pandas as pd

# Alter display settings
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)

# Directory & file that houses the concatenated CBOE csv data
proc_dir = r"/Users/alexstephens/data/cboe/proc"
pkl_file = os.path.join(proc_dir, r"cboe_mmyy_all_clean_df.pkl")

# Read the .pkl file
df = pd.read_pickle(pkl_file)

# Echo the columns
list(df.columns.values)
Out[1]:
['underlying_symbol',
 'quote_date',
 'root',
 'expiration',
 'strike',
 'option_type',
 'open',
 'high',
 'low',
 'close',
 'trade_volume',
 'bid_size_1545',
 'bid_1545',
 'ask_size_1545',
 'ask_1545',
 'underlying_bid_1545',
 'underlying_ask_1545',
 'implied_underlying_price_1545',
 'active_underlying_price_1545',
 'implied_volatility_1545',
 'delta_1545',
 'gamma_1545',
 'theta_1545',
 'vega_1545',
 'rho_1545',
 'bid_size_eod',
 'bid_eod',
 'ask_size_eod',
 'ask_eod',
 'underlying_bid_eod',
 'underlying_ask_eod',
 'vwap',
 'open_interest',
 'delivery_code']

Start to work through the punch-list. Here I upper-case the put/call field, and I use list comprehensions to create a two character representation of the day number and month number. That is, I want a day number like 9 to read as '09'. Finally I concatenate the fields to create the YYYYMMDD string.

In [2]:
# Apply upper-case to the option_type field
df['option_type'] = df['option_type'].str.upper()

# Use a list comprehension to create a 2 character expiry day & month field
df['ex_dd'] = [str(day) if (day > 9) else ('0' + str(day))
               for day in df['expiration'].dt.day]

df['ex_mm'] = [str(month) if (month > 9) else ('0' + str(month))
               for month in df['expiration'].dt.month]

# Use a map method to create the four character expiry YYYY field
df['ex_yy'] = df['expiration'].dt.year.map(str)

# Create an expiration YYYYMMDD field
df['ex_yymmdd'] = df['ex_yy'] + df['ex_mm'] + df['ex_dd']

# Echo progress
df.head(2).transpose()
Out[2]:
0 1
underlying_symbol ^SPX ^SPX
quote_date 2004-01-30 00:00:00 2004-01-30 00:00:00
root SYG SYG
expiration 2004-02-21 00:00:00 2004-02-21 00:00:00
strike 600 600
option_type C P
open 0 0
high 0 0
low 0 0
close 0 0
trade_volume 0 0
bid_size_1545 50 0
bid_1545 529.6 0
ask_size_1545 50 10
ask_1545 531.6 0.05
underlying_bid_1545 1132.04 1132.04
underlying_ask_1545 1132.04 1132.04
implied_underlying_price_1545 1129.63 1129.63
active_underlying_price_1545 1129.63 1129.63
implied_volatility_1545 0 0.8085
delta_1545 1 -0.0004
gamma_1545 0 6e-06
theta_1545 -0.039398 -0.007394
vega_1545 0 0.003898
rho_1545 34.4217 -0.027385
bid_size_eod 50 0
bid_eod 529.9 0
ask_size_eod 50 10
ask_eod 531.9 0.05
underlying_bid_eod 1131.13 1131.13
underlying_ask_eod 1131.13 1131.13
vwap 0 0
open_interest 0 40
delivery_code 0 0
ex_dd 21 21
ex_mm 02 02
ex_yy 2004 2004
ex_yymmdd 20040221 20040221

Now I add the day name and the number of weeks into the month the expiration date represents.

I need to create a string version of an integer version of the floating point strike. This is what the 'strike_str' represents.

Then I create a couple of string-based keys. Lastly, I compute the number of days between expiration and the quote_date.

In [3]:
# Add the day name to ease visualization of the expiration day
df['ex_dayname'] = df['expiration'].dt.day_name()

# Compute the number of weeks elapsed in the month by expiration
df['ex_weeknum'] = df['expiration'].apply(lambda d: (d.day-1) // 7 + 1)

# Create a strike string field; I have to convert the float to
# integer before converting to a string.  There is a single strike
# with a non-integer value (8.75)
df['strike_str'] = df.strike.astype(int).astype(str)

# Create an option series key
df['series_key'] = (df['root'] + df['ex_yymmdd'] + df['option_type'] +
                    df['strike_str'])

# Create strike and expiration date keys
df['dt_typ_stk_key'] = df['ex_yymmdd'] + df['option_type'] + df['strike_str']
df['dt_typ_key'] = df['ex_yymmdd'] + df['option_type']

# Compute the number of days-to-expiration (dte); Use the so-called
# dt() accessor for series to extract the numerical days to expiry
df['dte'] = (df['expiration'] - df['quote_date']).dt.days

# Keep only the date component of the 'quote_date' and 'expiration'
df['quote_date'] = df['quote_date'].dt.date
df['expiration'] = df['expiration'].dt.date

# Compute the underlying mip-price
df['underlying_mid_1545'] = 0.5*(df['underlying_bid_1545'] +
                                 df['underlying_ask_1545'])

# Turns out the underlying bid/ask spread is 0.0 in all cases
# df['underlying_dif_1545'] = abs(df['underlying_bid_1545'] -
#                                df['underlying_ask_1545'])

# Echo progress
df.head(2).transpose()
Out[3]:
0 1
underlying_symbol ^SPX ^SPX
quote_date 2004-01-30 2004-01-30
root SYG SYG
expiration 2004-02-21 2004-02-21
strike 600 600
option_type C P
open 0 0
high 0 0
low 0 0
close 0 0
trade_volume 0 0
bid_size_1545 50 0
bid_1545 529.6 0
ask_size_1545 50 10
ask_1545 531.6 0.05
underlying_bid_1545 1132.04 1132.04
underlying_ask_1545 1132.04 1132.04
implied_underlying_price_1545 1129.63 1129.63
active_underlying_price_1545 1129.63 1129.63
implied_volatility_1545 0 0.8085
delta_1545 1 -0.0004
gamma_1545 0 6e-06
theta_1545 -0.039398 -0.007394
vega_1545 0 0.003898
rho_1545 34.4217 -0.027385
bid_size_eod 50 0
bid_eod 529.9 0
ask_size_eod 50 10
ask_eod 531.9 0.05
underlying_bid_eod 1131.13 1131.13
underlying_ask_eod 1131.13 1131.13
vwap 0 0
open_interest 0 40
delivery_code 0 0
ex_dd 21 21
ex_mm 02 02
ex_yy 2004 2004
ex_yymmdd 20040221 20040221
ex_dayname Saturday Saturday
ex_weeknum 3 3
strike_str 600 600
series_key SYG20040221C600 SYG20040221P600
dt_typ_stk_key 20040221C600 20040221P600
dt_typ_key 20040221C 20040221P
dte 22 22
underlying_mid_1545 1132.04 1132.04

I want to print out some diagnostics to better characterize the underlying data. In particular, I still want to review what the 'root' field shows, as I am not totally comforatable with the definitions of those fields.

In [4]:
# A sanity check of the week number calculation
pd.crosstab(df.ex_dd, df.ex_weeknum)
Out[4]:
ex_weeknum 1 2 3 4 5
ex_dd
01 139728 0 0 0 0
02 209944 0 0 0 0
03 128896 0 0 0 0
04 171556 0 0 0 0
05 173302 0 0 0 0
06 200170 0 0 0 0
07 150846 0 0 0 0
08 0 166732 0 0 0
09 0 168755 0 0 0
10 0 151210 0 0 0
11 0 182048 0 0 0
12 0 168318 0 0 0
13 0 204096 0 0 0
14 0 137114 0 0 0
15 0 0 698462 0 0
16 0 0 986936 0 0
17 0 0 967102 0 0
18 0 0 917650 0 0
19 0 0 1027930 0 0
20 0 0 1109663 0 0
21 0 0 1011259 0 0
22 0 0 0 735190 0
23 0 0 0 173718 0
24 0 0 0 208007 0
25 0 0 0 128994 0
26 0 0 0 170828 0
27 0 0 0 207723 0
28 0 0 0 315181 0
29 0 0 0 0 425000
30 0 0 0 0 799494
31 0 0 0 0 821806
In [5]:
# When does a Saturday expiration occur
pd.crosstab(df.ex_dd, df.ex_dayname)
Out[5]:
ex_dayname Friday Monday Saturday Thursday Tuesday Wednesday
ex_dd
01 117128 4706 0 130 0 17764
02 132108 11830 0 42862 5908 17236
03 87072 12204 0 9858 4806 14956
04 141466 10782 0 0 0 19308
05 129408 22584 0 684 5668 14958
06 160574 15566 0 0 2280 21750
07 106504 16234 0 0 0 28108
08 138190 10398 0 0 0 18144
09 133684 16070 0 0 0 19001
10 123662 12402 0 0 0 15146
11 147934 14878 0 0 8 19228
12 128526 24502 0 0 0 15290
13 154320 15614 0 13532 0 20630
14 93520 15920 0 0 0 27674
15 674738 4924 0 0 0 18800
16 517456 12492 432970 0 5752 18266
17 347266 12236 588744 0 4012 14844
18 341240 13642 542890 0 0 19878
19 319200 18050 675340 0 0 15340
20 492332 11030 578559 0 7116 20626
21 433064 16232 528461 480 4936 28086
22 131932 11258 572996 0 0 19004
23 138226 16842 0 356 0 18294
24 128409 12400 0 52948 0 14250
25 102322 6834 0 0 0 19838
26 130066 20508 0 0 5932 14322
27 167727 15542 0 0 4610 19844
28 207023 9378 0 28158 26892 43730
29 316516 40086 0 47606 6642 14150
30 295693 152336 0 123142 111099 117224
31 144524 170344 0 180454 178069 148415
In [6]:
# A cross table of the day name and week number
pd.crosstab(df.ex_dayname, df.ex_weeknum)
Out[6]:
ex_weeknum 1 2 3 4 5
ex_dayname
Friday 874260 919836 3125296 1005705 756733
Monday 93906 109784 88606 92762 362766
Saturday 0 0 3346964 572996 0
Thursday 53534 13532 480 81462 351202
Tuesday 18662 8 21816 37434 295810
Wednesday 134080 135113 135840 149282 279789
In [7]:
# A cross table of the root and the week number
pd.crosstab(df.root, df.ex_weeknum)
Out[7]:
ex_weeknum 1 2 3 4 5
root
BSF 0 0 11124 954 0
BSK 0 0 44763 2880 0
BSZ 0 0 174054 16420 0
JXA 5418 0 50 0 0
JXB 0 5790 38 0 0
JXD 0 0 0 5452 0
JXE 40 0 0 0 1953
QSE 0 0 1 0 19058
QSZ 0 0 0 0 13313
QZQ 0 0 0 0 15581
SAQ 0 0 16 1232 16821
SKQ 0 0 0 0 6163
SLQ 0 0 0 1270 16144
SPB 0 0 24554 9332 0
SPL 0 0 35968 2744 0
SPQ 0 0 111605 14308 0
SPT 0 0 137612 20630 0
SPV 0 0 15898 772 0
SPX 1680 1322 4422746 365596 566
SPXPM 0 0 0 0 59798
SPXQ 20 20 0 81638 290428
SPXW 1167284 1171141 830946 1274129 1562550
SPZ 0 0 70581 8134 0
SQG 0 0 0 0 2760
SQP 0 0 0 1248 14343
SRO 0 0 28186 6889 0
SVP 0 0 19706 4386 0
SXB 0 0 96817 11614 0
SXG 0 0 60828 2752 0
SXJ 0 0 2204 1354 0
SXM 0 0 54088 12494 0
SXY 0 0 109220 21246 0
SXZ 0 0 81345 17171 0
SYF 0 0 6442 126 0
SYG 0 0 56182 6502 0
SYU 0 0 28673 2872 0
SYV 0 0 4910 0 0
SYZ 0 0 24140 1784 0
SZD 0 0 12216 0 0
SZJ 0 0 30853 4360 0
SZP 0 0 138221 23574 0
SZQ 0 0 33 184 26822
SZT 0 0 44792 10880 0
SZU 0 0 11366 704 0
SZV 0 0 27122 2930 0
VSTRP 0 0 1702 1080 0
In [8]:
# A cross table of the root and the day name
pd.crosstab(df.root, df.ex_dayname)
Out[8]:
ex_dayname Friday Monday Saturday Thursday Tuesday Wednesday
root
BSF 0 0 12078 0 0 0
BSK 0 0 47643 0 0 0
BSZ 0 0 190474 0 0 0
JXA 5288 0 50 130 0 0
JXB 5790 0 38 0 0 0
JXD 5452 0 0 0 0 0
JXE 1993 0 0 0 0 0
QSE 789 264 1 5360 3869 8776
QSZ 733 0 0 3520 2464 6596
QZQ 745 0 0 4536 2792 7508
SAQ 2551 5028 16 1756 4252 4466
SKQ 260 1306 0 568 2021 2008
SLQ 2588 4894 0 1752 4282 3898
SPB 0 0 33886 0 0 0
SPL 0 0 38712 0 0 0
SPQ 0 0 125913 0 0 0
SPT 0 0 158242 0 0 0
SPV 0 0 16670 0 0 0
SPX 2546116 0 2245758 36 0 0
SPXPM 23378 0 0 33540 0 2880
SPXQ 123676 138462 0 33290 62398 14280
SPXW 3958771 588836 0 410200 280820 767423
SPZ 0 0 78715 0 0 0
SQG 0 0 0 0 864 1896
SQP 2639 4760 0 1246 3062 3884
SRO 0 0 35075 0 0 0
SVP 0 0 24092 0 0 0
SXB 0 0 108431 0 0 0
SXG 0 0 63580 0 0 0
SXJ 0 0 3558 0 0 0
SXM 0 0 66582 0 0 0
SXY 0 0 130466 0 0 0
SXZ 0 0 98516 0 0 0
SYF 0 0 6568 0 0 0
SYG 0 0 62684 0 0 0
SYU 0 0 31545 0 0 0
SYV 0 0 4910 0 0 0
SYZ 0 0 25924 0 0 0
SZD 0 0 12216 0 0 0
SZJ 0 0 35213 0 0 0
SZP 0 0 161795 0 0 0
SZQ 1061 4274 33 4276 6906 10489
SZT 0 0 55672 0 0 0
SZU 0 0 12070 0 0 0
SZV 0 0 30052 0 0 0
VSTRP 0 0 2782 0 0 0

I will absorb the above information and use it to help with the filtering step in the next post. For now, save this data frame to a file. (NB. It is up to about 5 Gb in size).

In [9]:
# Save the output 
out_file = os.path.join(proc_dir, r"cboe_mmyy_aug_clean_df.pkl")
df.to_pickle(out_file)

No comments:

Post a Comment