Saturday, October 26, 2019

003 - Cleaning the Raw CBOE File

03_cboe_file_clean

Scrubbing the CBOE Historical Option Chain Data

Overview

Now that we know the entire universe of bad values are in the concatenated file, I want to replace those with special values that have the same format as the expected column data type (i.e., a float). I will use the module csv to simplify some of the reading / writing to the file.

Start with defining the input and output files that we will use.

In [5]:
# Import modules
import os
import csv

# Directory & files that house the csv dat
proc_dir = r"/Users/alexstephens/data/cboe/proc"

proc_csv = r"cboe_mmyy_all.csv"
proc_file = os.path.join(proc_dir, proc_csv)

clean_csv = r"cboe_mmyy_all_clean.csv"
clean_file = os.path.join(proc_dir, clean_csv)

I want to scan through the original file and only do somthing if I know that a line has a bad value.

Since we know the universe of bad values, just define those as a set. I can then use set logic to do a quick compare between the bad values and a parsed csv line read from the file.

For each line in the file, I will split the line. The first line (i.e., the header) is just dumped to the output file using the csv.writer method.

For all other lines, I will first define a set using the split line. That line-set is compared with the bad-set, and if the sets are disjoint (no overlapping members), then I will write the line to file without modification.

(Note: We know there are only 777 bad lines out of ~ 13mm, so we want to try to minimize the number and complexity of operations on the good lines).

When a split-line contains a bad value, I use a list comprehension with an embedded conditional to replace the bad value with a special value (i.e., -999999). Then I'll write the modified split-line to the new file.

In [6]:
# Define the set of bad entries
bad_set = {'1.#QNAN0', '1.#QNB', '-1.#R', '-1.#J', '-1.#IND00'}
bad_count = 0

with open(clean_file, 'w', newline="") as outfile:
    csv_writer = csv.writer(outfile)
    # scan through the concatenated file
    with open(proc_file, 'r') as infile:
        first_line = True
        for line in infile:
            s = line.rstrip().split(sep=',')
            # Echo the header to the outfile
            if (first_line):
                first_line = not(first_line)
                csv_writer.writerow(s)
            # Use set logic to check if the sets are disjoint; otherwise
            # we will need to replace the bad values in the line
            else:
                s_set = set(s)
                if (s_set.isdisjoint(bad_set)):
                    # Echo the original line on a no-hit
                    csv_writer.writerow(s)
                else:
                    # do processing here b/c there was a bad value; use a list
                    # comprehension approach where each element (s_i_ of the
                    # split string (s) is tested for membership in the bad_set
                    new_s = [-999999 if (s_i in bad_set) else s_i for s_i in s]
                    csv_writer.writerow(new_s)
                    bad_count += 1

We can confirm the number of rows modified using the counter.

In [7]:
# Echo results
print('-' * 80)
print('Number of modified rows {}'.format(bad_count))
print('-' * 80)
--------------------------------------------------------------------------------
Number of modified rows 777
--------------------------------------------------------------------------------

I also confirmed that the number of rows with the newly-replaced "-99.9," matches the above (via: grep -e -999999, cboe_mmyy_all_clean.csv | wc -l)

Update: I just echoed the output of the above grep to a terminal window. Some quick generalizations of what I see.

  1. The last row with a special value appears on 02-03-2012.
  2. The bad greeks appear to be tied to the early weekly series, quarterly series, and some of the later weeklies (i.e., SPXW and SPXPM).

No comments:

Post a Comment