Saturday, November 9, 2019

008 - Reducing the CBOE data file

08_cboe_reduce_df

Reducing the size of the CBOE Data Frame

Summary

In theory, this should be the last step of the data preparation process. The goal of this script is to take the bloated Pandas data frame I've constructed, and winnow it down the data that I'm interested in using.

In reality, this will end the first attempt at this data prep process. As I've made my way through Learning Python, I want use some of the accumulated knowledge to simplify and modularize the data read, scrubbing, and reduction steps.

In addition, the CBOE recently published corrections to their historical data. So, I will apply the modified process to those new data files.

In [1]:
# =============================================================================
# Set-up
# =============================================================================

import os
import pandas as pd
from datetime import timedelta

# Alter display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)


# =============================================================================
# Read the augmented file
# =============================================================================

# Directory & file that houses the concatenated CBOE csv data
# (Note: The change in directory reflects the fact that I now
# have two sets of CBOE data ... so I will segregate the two
# by the date downloaded and the underlying ticker symbol)
proc_dir = r"/Users/alexstephens/data/cboe/proc/1905_SPX"
pkl_file = os.path.join(proc_dir, r"cboe_mmyy_aug_clean_df.pkl")

# Read the .pkl file
df = pd.read_pickle(pkl_file)

# Echo the columns
list(df.columns.values)

# Echo the frist two rows (transposed)
df.head(2).transpose()
Out[1]:
0 1
underlying_symbol ^SPX ^SPX
quote_date 2004-01-30 00:00:00 2004-01-30 00:00:00
root SYG SYG
expiration 2004-02-21 00:00:00 2004-02-21 00:00:00
strike 600 600
option_type C P
open 0 0
high 0 0
low 0 0
close 0 0
trade_volume 0 0
bid_size_1545 50 0
bid_1545 529.6 0
ask_size_1545 50 10
ask_1545 531.6 0.05
underlying_bid_1545 1132.04 1132.04
underlying_ask_1545 1132.04 1132.04
implied_underlying_price_1545 1129.63 1129.63
active_underlying_price_1545 1129.63 1129.63
implied_volatility_1545 0 0.8085
delta_1545 1 -0.0004
gamma_1545 0 6e-06
theta_1545 -0.039398 -0.007394
vega_1545 0 0.003898
rho_1545 34.4217 -0.027385
bid_size_eod 50 0
bid_eod 529.9 0
ask_size_eod 50 10
ask_eod 531.9 0.05
underlying_bid_eod 1131.13 1131.13
underlying_ask_eod 1131.13 1131.13
vwap 0 0
open_interest 0 40
delivery_code 0 0
ex_dd 21 21
ex_mm 02 02
ex_yy 2004 2004
ex_yymmdd 20040221 20040221
ex_dayname Saturday Saturday
ex_weeknum 3 3
strike_str 600 600
dt_ty_sk_key 20040221C600 20040221P600
dt_ty_key 20040221C 20040221P
dte 22 22
mid_underlying_price_1545 1132.04 1132.04
In [2]:
# =============================================================================
# Generate date vectors / cutoffs
# =============================================================================

# Collect min/max quote dates
min_date = df['quote_date'].min()
max_date = df['quote_date'].max()

# Since we may want to compute the VIX up-to the last quote day in the
# history, pull expirations that extend out 37 days past the max_date
vix_date = max_date + timedelta(days=37)

# Retain only expiration cycles that happen before the vix_date
# Shape before: (13057658, 45)
# Shape after: (12471324, 45)
df = df[(df.expiration <= vix_date)]
In [3]:
# =============================================================================
# Remove & rename columns
# =============================================================================

# Drop columns
drop_columns = ['open', 'high', 'low', 'close',
                'underlying_bid_1545', 'underlying_ask_1545',
                'bid_size_eod', 'bid_eod',
                'ask_size_eod', 'ask_eod',
                'underlying_bid_eod', 'underlying_ask_eod',
                'implied_underlying_price_1545',
                'vwap', 'delivery_code',
                'ex_weeknum', 'ex_yymmdd', 'strike_str',
                ]

df.drop(drop_columns, axis=1, inplace=True)

# Rename columns
df.rename(columns={'implied_volatility_1545': 'iv_1545',
                   'active_underlying_price_1545': 'aup_1545',
                   'mid_underlying_price_1545': 'mup_1545',
                   'open_interest': 'oi'}, inplace=True)

# Echo the frist two rows (transposed)
df.head(2).transpose()
Out[3]:
0 1
underlying_symbol ^SPX ^SPX
quote_date 2004-01-30 00:00:00 2004-01-30 00:00:00
root SYG SYG
expiration 2004-02-21 00:00:00 2004-02-21 00:00:00
strike 600 600
option_type C P
trade_volume 0 0
bid_size_1545 50 0
bid_1545 529.6 0
ask_size_1545 50 10
ask_1545 531.6 0.05
aup_1545 1129.63 1129.63
iv_1545 0 0.8085
delta_1545 1 -0.0004
gamma_1545 0 6e-06
theta_1545 -0.039398 -0.007394
vega_1545 0 0.003898
rho_1545 34.4217 -0.027385
oi 0 40
ex_dd 21 21
ex_mm 02 02
ex_yy 2004 2004
ex_dayname Saturday Saturday
dt_ty_sk_key 20040221C600 20040221P600
dt_ty_key 20040221C 20040221P
dte 22 22
mup_1545 1132.04 1132.04
In [4]:
# =============================================================================
# Define a minimum set of print columns
# =============================================================================

# Print columns
print_columns = ['quote_date', 'root', 'expiration', 'ex_dayname',
                 'option_type', 'strike', 'mup_1545',
                 'bid_1545', 'ask_1545', 'trade_volume', 'oi',
                 'dte', 'iv_1545', 'delta_1545']


# =============================================================================
# Sort the option chain
# =============================================================================

# Sort by: quote_date > expiration > strike
df.sort_values(['quote_date', 'expiration', 'strike'],
               ascending=['True', 'True', 'True'],
               inplace=True)

# Reindex the data
# df.reset_index()

# Echo df
df.head(2).transpose()
Out[4]:
10238 10239
underlying_symbol ^SPX ^SPX
quote_date 2004-01-02 00:00:00 2004-01-02 00:00:00
root SYG SYG
expiration 2004-01-17 00:00:00 2004-01-17 00:00:00
strike 600 600
option_type C P
trade_volume 0 0
bid_size_1545 1 0
bid_1545 506.5 0
ask_size_1545 1 10
ask_1545 508.5 0.05
aup_1545 1107.23 1107.23
iv_1545 1.4717 0.9688
delta_1545 0.9886 -0.0004
gamma_1545 9.4e-05 7e-06
theta_1545 -0.379398 -0.011504
vega_1545 0.064713 0.003389
rho_1545 22.4056 -0.019588
oi 1 131
ex_dd 17 17
ex_mm 01 01
ex_yy 2004 2004
ex_dayname Saturday Saturday
dt_ty_sk_key 20040117C600 20040117P600
dt_ty_key 20040117C 20040117P
dte 15 15
mup_1545 1107.8 1107.8
In [5]:
# =============================================================================
# Isolate friday expirations (+/- 1 day)
# =============================================================================

# daily date vector spanning the quote_date range
quote_range = pd.date_range(start=min_date, end=vix_date, freq='d')

# Monthly date vector spanning the quote_date range (but end-of-month)
eom_range = list(pd.date_range(start=min_date, end=vix_date, freq='M'))

# Create a pandas dataframe using the vector of dates
qr_df = pd.DataFrame(quote_range, columns=['date'])

# Append the day name for readability
qr_df['day_name'] = qr_df['date'].dt.day_name()

# Append the week number
qr_df['week_num'] = qr_df['date'].apply(lambda d: (d.day-1) // 7 + 1)

# Isolate all Fridays of every month in the series (+/- 1 day)
all_fri = qr_df[(qr_df.day_name == 'Friday')].date
all_thu = all_fri + timedelta(-1)
all_sat = all_fri + timedelta(+1)

# Create a list combining all three date lists as a global 'end-of-week' filter
all_eow = sorted(list(all_thu) + list(all_fri) + list(all_sat))

# Identify the 3rd Friday of every month
third_fri = qr_df[(qr_df.day_name == 'Friday') & (qr_df.week_num == 3)].date
third_thu = third_fri + timedelta(-1)
third_sat = third_fri + timedelta(+1)

# Create a list combining all 3rd-week eow dates
third_eow = sorted(list(third_thu) + list(third_fri) + list(third_sat))
In [6]:
# =============================================================================
# Isolate unique expiration dates
# =============================================================================

# Create a pandas dataframe using the vector of expiration dates
ex_date = sorted(df.expiration.unique())
ex_df = pd.DataFrame(pd.to_datetime(ex_date), columns=['ex_date'])

# Create a day number field
ex_df['ex_day_num'] = ex_df['ex_date'].dt.day

# Append the day name for readability
ex_df['ex_day_name'] = ex_df['ex_date'].dt.day_name()

# Append the week number
ex_df['ex_week_num'] = ex_df['ex_date'].apply(lambda d: (d.day-1) // 7 + 1)
In [8]:
# =============================================================================
# [!] Make a copy of the df DataFrame -- filter on tmp_df only
# =============================================================================

# shape = (12471324, 27)
tmp_df = df.copy()
tmp_df.shape
Out[8]:
(12471324, 27)
In [9]:
# =============================================================================
# Apply root filters
# =============================================================================

# Remove quarterlies
quartery_roots = ['SPXQ', 'SPXPM',
                  'SAQ', 'SKQ', 'SLQ',
                  'QSE', 'QSZ', 'QZQ',
                  'SQG', 'SQP', 'SZQ',
                  ]

# shape = (11904431, 27)
excl_01 = tmp_df[(tmp_df.root.isin(quartery_roots))]
tmp_df = tmp_df[~(tmp_df.root.isin(quartery_roots))]


# Remove binary options (B*), Variance Strip (VSTRP), Range Option (SRO)
binary_options = ['BSZ', 'BSF', 'BSK', 'SRO', 'VSTRP']

# shape = (11616379, 27)
excl_02 = tmp_df[(tmp_df.root.isin(binary_options))]
tmp_df = tmp_df[~(tmp_df.root.isin(binary_options))]
In [10]:
# =============================================================================
# Isolate all Friday expirations (+/- 1 day)
# =============================================================================

# Isolate all Friday expirations (+/- 1 day)
# shape = (10015038, 28)
excl_03 = tmp_df[~tmp_df['expiration'].isin(all_eow)]
tmp_df = tmp_df[tmp_df['expiration'].isin(all_eow)]
In [11]:
# =============================================================================
# Exclude weeklies with a PM-settlment on the third Friday of the month
# =============================================================================

# Remove the PM expirations that occur on the third Friday of the month
weekly_roots = ['JXA', 'JXB', 'JXD', 'JXE', 'SPXW']


excl_04 = tmp_df[(tmp_df['expiration'].isin(third_eow) &
                  tmp_df['root'].isin(weekly_roots))]

# shape = (9555202, 28)
tmp_df = tmp_df[~(tmp_df['expiration'].isin(third_eow) &
                  tmp_df['root'].isin(weekly_roots))]
In [12]:
# =============================================================================
# Exclude weeklies with an  end-of-month settlement date on a Thursday
# =============================================================================

excl_05 = tmp_df[((tmp_df['expiration'].isin(eom_range)) &
                  (tmp_df['ex_dayname'] == 'Thursday'))]

# shape = (9313292, 28)
tmp_df = tmp_df[~((tmp_df['expiration'].isin(eom_range)) &
                  (tmp_df['ex_dayname'] == 'Thursday'))]


# =============================================================================
# Apply numeric filters
# =============================================================================

# Remove rows with all zero greeks
zero_greek = ((tmp_df.delta_1545 == 0) &
              (tmp_df.gamma_1545 == 0) &
              (tmp_df.theta_1545 == 0) &
              (tmp_df.vega_1545 == 0) &
              (tmp_df.rho_1545 == 0))

excl_06 = tmp_df[(zero_greek)]
tmp_df = tmp_df[~(zero_greek)]


# Remove rows with artificial error code replacements
error_greek = ((tmp_df.delta_1545 == -999999) |
               (tmp_df.gamma_1545 == -999999) |
               (tmp_df.theta_1545 == -999999) |
               (tmp_df.vega_1545 == -999999) |
               (tmp_df.rho_1545 == -999999))

excl_07 = tmp_df[(error_greek)]
tmp_df = tmp_df[~(error_greek)]
In [13]:
# First isolate the columns to group by
dd = tmp_df[['quote_date',
             'expiration',
             'option_type',
             'strike',
             'dte',
             'root']]

# Isolate the unique (non-root) columns
uniq_dt = dd.groupby(['quote_date',
                      'expiration',
                      'option_type',
                      'strike',
                      'dte']).size()

# Fill-in the group-by matrix
uniq_dt = uniq_dt.reset_index().rename(columns={0: 'cnt'})
uniq_dt = uniq_dt[uniq_dt.cnt > 1]
In [14]:
print(uniq_dt.head(10))
      quote_date expiration option_type  strike  dte  cnt
63373 2004-06-23 2005-06-18           C   500.0  360    2
63374 2004-06-23 2005-06-18           C   600.0  360    2
63375 2004-06-23 2005-06-18           C   700.0  360    2
63376 2004-06-23 2005-06-18           C   750.0  360    2
63377 2004-06-23 2005-06-18           C   800.0  360    2
63378 2004-06-23 2005-06-18           C   850.0  360    2
63380 2004-06-23 2005-06-18           C   900.0  360    2
63382 2004-06-23 2005-06-18           C   950.0  360    2
63384 2004-06-23 2005-06-18           C   995.0  360    2
63385 2004-06-23 2005-06-18           C  1005.0  360    2
In [19]:
# Display the first duplicate (get first two sets of duplicates -- 4 rows in total)
tmp_df[(tmp_df.quote_date == pd.to_datetime('2004-06-23')) &
       (tmp_df.expiration == pd.to_datetime('2005-06-18')) &
       (tmp_df.option_type == 'C')].head(4).transpose()
Out[19]:
61806 61808 61810 61812
underlying_symbol ^SPX ^SPX ^SPX ^SPX
quote_date 2004-06-23 00:00:00 2004-06-23 00:00:00 2004-06-23 00:00:00 2004-06-23 00:00:00
root SPL SYF SPL SYG
expiration 2005-06-18 00:00:00 2005-06-18 00:00:00 2005-06-18 00:00:00 2005-06-18 00:00:00
strike 500 500 600 600
option_type C C C C
trade_volume 0 0 0 0
bid_size_1545 50 50 50 50
bid_1545 636.7 635.6 545.9 538.7
ask_size_1545 50 50 50 50
ask_1545 640.7 637.6 549.9 540.7
aup_1545 1119.01 1119.01 1119.01 1119.01
iv_1545 0.5749 0.5749 0.3985 0.4484
delta_1545 0.9629 0.9629 0.9713 0.9587
gamma_1545 0.000124 0.000124 0.000145 0.000174
theta_1545 -0.106659 -0.106659 -0.082949 -0.1029
vega_1545 0.919736 0.919736 0.743699 1.00343
rho_1545 424.778 424.778 536.162 517.671
oi 0 0 0 0
ex_dd 18 18 18 18
ex_mm 06 06 06 06
ex_yy 2005 2005 2005 2005
ex_dayname Saturday Saturday Saturday Saturday
dt_ty_sk_key 20050618C500 20050618C500 20050618C600 20050618C600
dt_ty_key 20050618C 20050618C 20050618C 20050618C
dte 360 360 360 360
mup_1545 1144.2 1144.2 1144.2 1144.2
In [20]:
# =============================================================================
# Save output to a file
# =============================================================================

# Save the output
df = tmp_df.copy()
out_file = os.path.join(proc_dir, r"cboe_mmyy_std_clean_df.pkl")
df.to_pickle(out_file)

These steps bring the filesize down from ~5 Gb to just under ~2 Gb.

The above steps will serve as a sketch for a more modularized data processing script, where I combine all of the features of previous scripts.

No comments:

Post a Comment