Optiver Realized Volatility Prediction
Introduction
Getting this far in my machine learning journey as student at Holbeton school is one of the most significant accomplishment in my studies. and in this blog post I’ll try to walk through my experience practicing what I’ve learned.
I’ll try to apply many concepts I’ve got in the last year in order to predict the realized volatility in the released competition at Kaggle by Optiver.
The competition
Optiver Realized Volatility Prediction is a time series prediction competition released by Optiver in June 2021 aims to predict one of the most prominent factor in stock market the Volatility (fluctuation) of stocks. It’s a competition holds a huge amount of data about 112 different stock in both the book and trade history.
The competition become one of the popular and the most contributed competition due the number of participants, and the amount of entries got involved in it, which make Optiver Realized Volatility Prediction is the best candidate for me to start with, and I can say I’m 100% right regarding to the valuable discussion and coding notebook published in this competition.
Feature engineering
One of my first pitfall and the most first silly thing I’ve done is to start building a prediction model based on the few column provided on the data competition. without any extra features and I expect to got a reasonable result.
But I learnt that I need to dedicate effort and time for treating, understand and extracting new features, as much (or may be more) as what i planned for model tuning and selection. Understanding features and searching for the business domain it comes with a great benefit when you treat and interpolate data.
Therefore, Before I touched the data, I had learnt some financial market concept and acronyms. and I had started with the Optiver tutorial notebook which gave me a great insight about what I’m dealing with and It had been a great point to start from before I googled a lot of financial acronym and terms.
With the helps of code notebook from the competition contributor whom publish their idea about potential solution. I started to build my own idea about feature engineering and I come with own way on how to treat my features.
- First level: It’s similar to second axis aggregation when I tried to create new features based on mathematical relation between existing data. so for example I compute
spread = ask_price — bid_price

- Second level: It’s the way to window rows (first axis aggregation) to produce new features. ans also you change the data from a time series problem to supervised one.

Architecture
To keep my journey on this first real world project as much productive as possible I learnt about a beautiful technique called Super learner, it’s a general loss based learning method that has been proposed and analyzed theoretically in van der Laan et al. (2007). It’s an ensemble technique based on sharing knowledge between models, or even different version of the same model.

Taken from “Super Learner.”
It’s based on three main part model selection and building a model library to contribute on decision make, cross validate learning by creating a new datasets compose of different out-of-fold prediction as input to the third part which is a regression prediction of different validates outcomes.
Models library
Following the Guide to SuperLearner we can see a different available and efficient models in the R packages:
So the secret recipe is to keep your model variations as high as possible since each model it can contribute on the main prediction as much as it get the best prediction. It’s much recommended to use as much as available model to get your best result.
Model selection and fine tuning
In my case (due to my computation available resources) I picked 8 different algorithm:
- Linear regression
- ElasticNet
- Decision tree regressor
- Multi-layer perceptron rgressor
- Light gradient boost model
- Bagging regressor
- Random forest regressor
- Extras trees regressor
I need to mention here that some of those model are an ensemble models as well composed either on bagging or boosting with other component models.
My first step working on those models is to try train each individual model isolated from the ensemble and try to figure out the best set of parameters to reach the maximum performance for each, so I used scikit_optimize
library and a Gaussian Process as my tuning algorithm and here exactly where I found one of my gap knowledge about models when I tried to set a large domain for each parameter which some time take a life time to find the best combination.
Result
The result here is based just only 20% of data and with just a few models of the previous eight are well tuned:
The main observation here is how well the SuperLearner can perform against each individual model.
Next step
The next step is about how to improve those results:
- First for each individual model in the library I need to well tune parameters and try to extract the best result and the best version from each.
- Second I need to increase the variations of my models library by implementing either more of set of parameters for some models that potentially can perform well.
- Also in term of library variation, I can apply a feature sampling by replicate each model with randomly sampled set of features, for example by sampling 60% of features for 5 times, that’s could provide a 40 model instead of 8 they all contribute for getting a better result.
Final thoughts
I’m very happy about my learning journey at Holberton School I got a lot of new things and I learnt a lot of concepts and better habit for sure on how to understand and search for learn. Also my contribution in this data science competition helps me a lot to put my knowledge on test and got my hands dirty by involving in a real world problem.