.. _sampling: ############# Data Sampling ############# In financial machine learning, samples are not independent. For the most part, traditional machine learning algorithms assume that samples are IID, in the case of financial machine learning samples are neither identically distributed nor independent. In this section we will tackle the problem of samples dependency. As you will remember, we mostly label our data sets using the triple-barrier method. Each label in triple-barrier event has a label index and a label end time (t1) which corresponds to time when one of barriers were touched. Sample Uniqueness ================= Let's look at an example of 3 samples: A, B, C. Imagine that: * A was generated at :math:`t_1` and triggered on :math:`t_8` * B was generated at :math:`t_3` and triggered on :math:`t_6` * C was generated on :math:`t_7` and triggered on :math:`t_9` In this case we see that A used information about returns on :math:`[t_1,t_8]` to generate label-endtime which overlaps with :math:`[t_3, t_6]` which was used by B, however C didn't use any returns information which was used by to label other samples. Here we would like to introduce the concept of concurrency. We say that labels :math:`y_i` and :math:`y_j` are concurrent at :math:`t` if they are a function of at least one common return at :math:`r_{t-1,t}` In terms of concurrency label C is the most 'pure' as it doesn't use any piece of information from other labels, while A is the 'dirtiest' as it uses information from both B and C. By understanding average label uniqueness you can measure how 'pure' your dataset is based on concurrency of labels. We can measure average label uniqueness using get_av_uniqueness_from_triple_barrier function from the **Mlfin.py** package. This function is the orchestrator to derive average sample uniqueness from a dateset labeled by the triple barrier method. Implementation -------------- .. py:currentmodule:: mlfinpy.sampling.concurrent .. autofunction:: get_av_uniqueness_from_triple_barrier Example ------- An example of calculating average uniqueness given that we have already have our barrier events can be seen below: .. code-block:: python import pandas as pd import numpy as np from mlfinpy.sampling.concurrent import get_av_uniqueness_from_triple_barrier barrier_events = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) close_prices = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) av_unique = get_av_uniqueness_from_triple_barrier(barrier_events, close_prices.close, num_threads=3) We would like to build our model in such a way that it takes into account label concurrency (overlapping samples). In order to do that we need to look at the bootstrapping algorithm of a Random Forest. Sequential Bootstrapping ======================== The key power of ensemble learning techniques is bagging (which is bootstrapping with replacement). The key idea behind bagging is to randomly choose samples for each decision tree. In this case trees become diverse and by averaging predictions of diverse trees built on randomly selected samples and random subset of features data scientists make the algorithm much less prone to overfit. However, in our case we would not only like to randomly choose samples but also choose samples which are unique and non-concurrent. But how can we solve this problem? Here comes Sequential Bootstrapping algorithm. The key idea behind Sequential Bootstrapping is to select samples in such a way that on each iteration we maximize average uniqueness of selected subsamples. Implementation -------------- The core functions behind Sequential Bootstrapping are implemented in **Mlfin.py** and can be seen below: .. py:currentmodule:: mlfinpy.sampling.bootstrapping .. autofunction:: get_ind_matrix .. autofunction:: get_ind_mat_average_uniqueness .. autofunction:: get_ind_mat_label_uniqueness .. autofunction:: seq_bootstrap Example ------- An example of Sequential Bootstrap using a a toy example from the book can be seen below. Consider a set of labels :math:`\left\{y_i\right\}_{i=0,1,2}` where: * label :math:`y_0` is a function of return :math:`r_{0,2}` * label :math:`y_1` is a function of return :math:`r_{2,3}` * label :math:`y_2` is a function of return :math:`r_{4,5}` The first thing we need to do is to build and indicator matrix. Columns of this matrix correspond to samples and rows correspond to price returns timestamps which were used during samples labelling. In our case indicator matrix is: .. code-block:: python ind_mat = pd.DataFrame(index = range(0,6), columns=range(0,3)) ind_mat.loc[:, 0] = [1, 1, 1, 0, 0, 0] ind_mat.loc[:, 1] = [0, 0, 1, 1, 0, 0] ind_mat.loc[:, 2] = [0, 0, 0, 0, 1, 1] One can use get_ind_matrix method from **Mlfin.py** to build indicator matrix from triple-barrier events. .. code-block:: python triple_barrier_ind_mat = get_ind_matrix(barrier_events) We can get average label uniqueness on indicator matrix using get_ind_mat_average_uniqueness function from **Mlfin.py**. .. code-block:: python ind_mat_uniqueness = get_ind_mat_average_uniqueness(triple_barrier_ind_mat) Let's get the first sample average uniqueness (we need to filter out zeros to get unbiased result). .. code-block:: python first_sample = ind_mat_uniqueness[0] first_sample[first_sample > 0].mean() >> 0.26886446886446885 av_unique.iloc[0] >> tW 0.238776 As you can see it is quite close to values generated by ``get_av_uniqueness_from_triple_barrier`` function call. Let's move back to our example. In Sequential Bootstrapping algorithm we start with an empty array of samples (:math:`\phi`) and loop through all samples to get the probability of chosing the sample based on average uniqueness of reduced indicator matrix constructed from [previously chosen columns] + sample. .. code-block:: python phi = [] while length(phi) < number of samples to bootstrap: average_uniqueness_array = [] for sample in samples: previous_columns = phi ind_mat_reduced = ind_mat[previous_columns + i] average_uniqueness_array[sample] = get_ind_mat_average_uniqueness(ind_mat_reduced) # Normalise so that probabilities sum up to 1 probability_array = average_uniqueness_array / sum(average_uniqueness_array) chosen_sample = random_choice(samples, probability = probability_array) phi.append(chosen_sample) For performance increase we optimized and parallesied for-loop using numba, which corresponds to ``bootstrap_loop_run`` function. Now let's finish the example: To be as close to the **Mlfin.py** implementation let's convert ind_mat to numpy matrix .. code-block:: python ind_mat = ind_mat.values **1st Iteration:** On the first step all labels will have equal probalities as average uniqueness of matrix with 1 column is 1. Say we have chosen 1 on the first step **2nd Iteration** .. code-block:: python phi = [1] # Sample chosen from the 2st step uniqueness_array = np.array([None, None, None]) for i in range(0, 3): ind_mat_reduced = ind_mat[:, phi + [i]] label_uniqueness = get_ind_mat_average_uniqueness(ind_mat_reduced)[-1] # The last value corresponds to appended i uniqueness_array[i] = (label_uniqueness[label_uniqueness > 0].mean()) prob_array = uniqueness_array / sum(uniqueness_array) prob_array >> array([0.35714285714285715, 0.21428571428571427, 0.42857142857142855], dtype=object) Probably the second chosen feature will be 2 (prob_array[2] = 0.42857 which is the largest probability). As you can see up till now the algorithm has chosen two the least concurrent labels (1 and 2). **3rd Iteration** .. code-block:: python phi = [1,2] uniqueness_array = np.array([None, None, None]) for i in range(0, 3): ind_mat_reduced = ind_mat[:, phi + [i]] label_uniqueness = get_ind_mat_average_uniqueness(ind_mat_reduced)[-1] uniqueness_array[i] = (label_uniqueness[label_uniqueness > 0].mean()) prob_array = uniqueness_array / sum(uniqueness_array) prob_array >> array([0.45454545454545453, 0.2727272727272727, 0.2727272727272727], dtype=object) Sequential Bootstrapping tries to minimise the probability of repeated samples so as you can see the most probable sample would be 0 with 1 and 2 already selected. **4th Iteration** .. code-block:: python phi = [1, 2, 0] uniqueness_array = np.array([None, None, None]) for i in range(0, 3): ind_mat_reduced = ind_mat[:, phi + [i]] label_uniqueness = get_ind_mat_average_uniqueness(ind_mat_reduced)[-1] uniqueness_array[i] = (label_uniqueness[label_uniqueness > 0].mean()) prob_array = uniqueness_array / sum(uniqueness_array) prob_array >> array([0.32653061224489793, 0.3061224489795918, 0.36734693877551017], dtype=object) The most probable sample would be 2 in this case. After 4 steps of sequential bootstrapping our drawn samples are [1, 2, 0, 2]. Let's see how this example is solved by the **Mlfin.py** implementation. To reproduce that: 1) we need to set warmup to [1], which corresponds to phi = [1] on the first step 2) verbose = True to print updated probabilities .. code-block:: python samples = seq_bootstrap(ind_mat, sample_length=4, warmup_samples=[1], verbose=True) >> [0.33333333 0.33333333 0.33333333] >> [0.35714286 0.21428571 0.42857143] >> [0.45454545 0.27272727 0.27272727] >> [0.32653061 0.30612245 0.36734694] samples >> [1, 2, 0, 2] As you can see the first 2 iterations of algorithm yield the same probabilities, however sometimes the algorithm randomly chooses not the 2 sample on 2nd iteration that is why further probabilities are different from the example above. However, if you repeat the process several times you'll see that on average drawn sample equal to the one from the example Monte-Carlo Experiment ---------------------- Let's see how sequential bootstrapping increases average label uniqueness on this example by generating 3 samples using sequential bootstrapping and 3 samples using standard random choise, repeat the experiment 10000 times and record corresponding label uniqueness in each experiment .. code-block:: python standard_unq_array = np.zeros(10000) * np.nan # Array of random sampling uniqueness seq_unq_array = np.zeros(10000) * np.nan # Array of Sequential Bootstapping uniqueness for i in range(0, 10000): bootstrapped_samples = seq_bootstrap(ind_mat, sample_length=3) random_samples = np.random.choice(ind_mat.shape[1], size=3) random_unq = get_ind_mat_average_uniqueness(ind_mat[:, random_samples]) random_unq_mean = random_unq[random_unq > 0].mean() sequential_unq = get_ind_mat_average_uniqueness(ind_mat[:, bootstrapped_samples]) sequential_unq_mean = sequential_unq[sequential_unq > 0].mean() standard_unq_array[i] = random_unq_mean seq_unq_array[i] = sequential_unq_mean KDE plots of label uniqueness support the fact that sequential bootstrapping gives higher average label uniqueness .. image:: media/monte_carlo_bootstrap.png :scale: 130 % :align: center We can compare average label uniqueness using sequential bootstrap vs label uniqueness using standard random sampling by setting compare parameter to True. We have massively increased the performance of Sequential Bootstrapping which was described in the book. For comparison generating 50 samples from 8000 barrier-events would take 3 days, we have reduced time to 10-12 seconds which decreases by increasing number of CPUs. Let's apply sequential bootstrapping to our full data set and draw 50 samples: .. code-block:: python Standard uniqueness: 0.9465875370919882 Sequential uniqueness: 0.9913169319826338 Sometimes you would see that standard bootstrapping gives higher uniqueness, however as it was shown in Monte-Carlo example, on average Sequential Bootstrapping algorithm has higher average uniqueness. Sample Weights ============== **Mlfin.py** supports two methods of applying sample weights. The first is weighting an observation based on its given return as well as average uniqueness. The second is weighting an observation based on a time decay. By Returns and Average Uniqueness --------------------------------- The following function utilizes a samples average uniqueness and its return to compute sample weights: .. py:currentmodule:: mlfinpy.sample_weights.attribution .. autofunction:: get_weights_by_return This function can be utilized as shown below assuming we have already found our barrier events .. code-block:: python import pandas as pd import numpy as np from mlfinpy.sample_weights.attribution import get_weights_by_return barrier_events = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) close_prices = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) sample_weights = get_weights_by_return(barrier_events, close_prices.close, num_threads=3) By Time Decay ------------- The following function assigns sample weights using a time decay factor .. autofunction:: get_weights_by_time_decay This function can be utilized as shown below assuming we have already found our barrier events .. code-block:: python import pandas as pd import numpy as np from mlfinpy.sample_weights.attribution import get_weights_by_time_decay barrier_events = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) close_prices = pd.read_csv('FILE_PATH', index_col=0, parse_dates=[0,2]) sample_weights = get_weights_by_time_decay(barrier_events, close_prices.close, num_threads=3, decay=0.4)