Note

Click here to download the full example code

GridSearchCV#

Note

These examples are basically copies from the same examples in tpcp, but using gait algorithms! These examples are less often updated than the official tpcp examples. Hence, it makes sense to cross-check the official examples.

When trying to optimize parameters for algorithms that have trainable components, it is required to perform the parameter search on a validation set (that is separate from the test set used for the final validation). Even better, is to use a cross validation for this step. In gaitmap this can be done by using GridSearchCV.

This example explains how to use this method. To learn more about the concept, review the evaluation guide and the sklearn guide on tuning hyperparameters.

import random
from typing import Optional

import numpy as np
import pandas as pd


from gaitmap.data_transform import TrainableAbsMaxScaler
from gaitmap.utils.array_handling import iterate_region_data

random.seed(1)  # We set the random seed for repeatable results

Dataset#

As always, we need a dataset, a pipeline, and a scoring method for a parameter search. We reuse the dataset used in other pipeline examples.

from tpcp import Dataset

from gaitmap.example_data import get_healthy_example_imu_data, get_healthy_example_stride_borders


class MyDataset(Dataset):
    @property
    def sampling_rate_hz(self) -> float:
        return 204.8

    @property
    def data(self):
        self.assert_is_single(None, "data")
        return get_healthy_example_imu_data()[self.index.iloc[0]["foot"] + "_sensor"]

    @property
    def segmented_stride_list_(self):
        self.assert_is_single(None, "data")
        return get_healthy_example_stride_borders()[self.index.iloc[0]["foot"] + "_sensor"].set_index("s_id")

    def create_index(self) -> pd.DataFrame:
        return pd.DataFrame({"participant": ["test", "test"], "foot": ["left", "right"]})

The Pipeline#

We use a gait segmentation pipeline, that is explained in more detail in the Optimizable Pipelines example. However, we modify this pipeline in one key way: We add an additional parameter n_train_strides that controls how many randomly selected strides should be used during training. Modifying this parameter, will change the result of the self_optimize step.

from tpcp import CloneFactory, HyperParameter, OptimizableParameter, OptimizablePipeline, PureParameter

from gaitmap.stride_segmentation import BarthDtw, InterpolatedDtwTemplate
from gaitmap.utils.coordinate_conversion import convert_left_foot_to_fbf, convert_right_foot_to_fbf
from gaitmap.utils.datatype_helper import SingleSensorStrideList


class MyPipeline(OptimizablePipeline):
    max_cost: PureParameter[float]
    template: OptimizableParameter[InterpolatedDtwTemplate]
    n_train_strides: HyperParameter[Optional[int]]

    segmented_stride_list_: SingleSensorStrideList
    cost_func_: np.ndarray

    def __init__(
        self,
        max_cost: float = 3,
        # We need to wrap the template in a `CloneFactory` call here to prevent issues with mutable defaults!
        template: InterpolatedDtwTemplate = CloneFactory(InterpolatedDtwTemplate(scaling=TrainableAbsMaxScaler())),
        n_train_strides: Optional[int] = None,
    ) -> None:
        self.max_cost = max_cost
        self.template = template
        self.n_train_strides = n_train_strides

    def self_optimize(self, dataset: MyDataset, **kwargs):
        # Our training consists of cutting all strides from the dataset and then creating a new template from all
        # strides in the dataset

        # We expect multiple datapoints in the dataset
        sampling_rate = dataset[0].sampling_rate_hz

        # We create a generator for the data and the stride labels
        data_sequences = (
            self._convert_cord_system(datapoint.data, datapoint.groups[0][1]).filter(like="gyr")
            for datapoint in dataset
        )
        stride_labels = (datapoint.segmented_stride_list_ for datapoint in dataset)

        stride_geneator = iterate_region_data(data_sequences, stride_labels)

        # This is the new part:
        # Note, that this is not really optimal, as we force all strides into memory and iterate over them,
        # but shouldn't really matter.
        all_strides = list(stride_geneator)
        if self.n_train_strides:
            all_strides = random.sample(all_strides, self.n_train_strides)

        # Note that this will also retrain the scaling based on the new data
        self.template = self.template.self_optimize(all_strides, sampling_rate_hz=sampling_rate)

        return self

    def _convert_cord_system(self, data, foot):
        converter = {"left": convert_left_foot_to_fbf, "right": convert_right_foot_to_fbf}
        return converter[foot](data)

    def run(self, datapoint: MyDataset):
        # `datapoint.groups[0]` gives us the identifier of the datapoint (e.g. `("test", "left")`).
        # And `datapoint.groups[0][1]` is the foot.
        data = self._convert_cord_system(datapoint.data, datapoint.groups[0][1])

        dtw = BarthDtw(max_cost=self.max_cost, template=self.template)
        dtw.segment(data, datapoint.sampling_rate_hz)

        self.segmented_stride_list_ = dtw.stride_list_
        self.cost_func_ = dtw.cost_function_
        return self

The Scorer#

The scorer is identical to the scoring function used in the other examples. The F1-score is still the most important parameter for our comparison.

from gaitmap.evaluation_utils import evaluate_segmented_stride_list, precision_recall_f1_score


def score(pipeline: MyPipeline, datapoint: MyDataset):
    pipeline.safe_run(datapoint)
    matches_df = evaluate_segmented_stride_list(
        ground_truth=datapoint.segmented_stride_list_, segmented_stride_list=pipeline.segmented_stride_list_
    )
    return precision_recall_f1_score(matches_df)

Data Splitting#

Like with a normal cross validation, we need to decide on the number of folds and type of splits. In gaitmap we support all cross validation iterators provided in sklearn.

In this example we only have two datapoints. This means, we can only use a 2-fold cross-validation:

from sklearn.model_selection import KFold

cv = KFold(n_splits=2)

The Parameters#

The pipeline above exposes a couple of parameters. The template will be modified during training. The n_train_strides controls how many strides are used during training and hence, directly effects the outcome. The max_cost parameter is important for the actual dtw-matching, but does not influence the optimization step. For our basic GridSearchCV this doesn’t matter and we treat both types of parameters the same way. But if you have a similar case in your pipeline make sure to read the section on Pure Parameters at the end of the example.

For the n_train_strides we test the values None (all strides) and 1 (single stride) to make sure that we will see a performance difference between the two options.

from sklearn.model_selection import ParameterGrid

parameters = ParameterGrid({"max_cost": [3, 5], "n_train_strides": [None, 1]})  # None means all strides.

GridSearchCV#

Setting up the GridSearchCV object is similar to the normal GridSearch, we just need to add the additional cv parameter. Then we can simply run the search using the optimize method.

from tpcp.optimize import GridSearchCV

gs = GridSearchCV(pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, cv=cv, return_optimized="f1_score")
gs = gs.optimize(MyDataset())

Split-Para Combos:   0%|          | 0/8 [00:00<?, ?it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]

Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.67it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.63it/s]

Split-Para Combos:  12%|#2        | 1/8 [00:00<00:01,  4.87it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.77it/s]

Split-Para Combos:  25%|##5       | 2/8 [00:00<00:01,  5.65it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.99it/s]

Split-Para Combos:  38%|###7      | 3/8 [00:00<00:00,  6.13it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.96it/s]

Split-Para Combos:  50%|#####     | 4/8 [00:00<00:00,  6.39it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.87it/s]

Split-Para Combos:  62%|######2   | 5/8 [00:00<00:00,  6.42it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 13.08it/s]

Split-Para Combos:  75%|#######5  | 6/8 [00:00<00:00,  6.48it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.43it/s]

Split-Para Combos:  88%|########7 | 7/8 [00:01<00:00,  6.52it/s]

Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.64it/s]

Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  6.60it/s]
Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  6.34it/s]

Results#

The output is also comparable to the output of the GridSearch. The main results are stored in the cv_results_ parameter. But instead of just a single performance value per parameter, we get one value per fold and the mean and std over all folds.

results = gs.cv_results_
results_df = pd.DataFrame(results)

results_df

	mean__debug__optimize_time	std__debug__optimize_time	mean__debug__score_time	std__debug__score_time	split0__test__data_labels	split1__test__data_labels	split0__train__data_labels	split1__train__data_labels	param__max_cost	param__n_train_strides	params	split0__test__agg__precision	split1__test__agg__precision	mean__test__agg__precision	std__test__agg__precision	rank__test__agg__precision	split0__test__agg__recall	split1__test__agg__recall	mean__test__agg__recall	std__test__agg__recall	rank__test__agg__recall	split0__test__agg__f1_score	split1__test__agg__f1_score	mean__test__agg__f1_score	std__test__agg__f1_score	rank__test__agg__f1_score	split0__test__single__precision	split1__test__single__precision	split0__test__single__recall	split1__test__single__recall	split0__test__single__f1_score	split1__test__single__f1_score
0	0.066653	0.010117	0.104701	0.013533	[(test, left)]	[(test, right)]	[(test, right)]	[(test, left)]	3	None	{'max_cost': 3, 'n_train_strides': None}	1.0	1.000000	1.000000	0.000000	1	1.0	0.866667	0.933333	0.066667	3	1.0	0.928571	0.964286	0.035714	3	[1.0]	[1.0]	[1.0]	[0.8666666666666667]	[1.0]	[0.9285714285714286]
1	0.047887	0.000089	0.089664	0.000108	[(test, left)]	[(test, right)]	[(test, right)]	[(test, left)]	3	1	{'max_cost': 3, 'n_train_strides': 1}	1.0	1.000000	1.000000	0.000000	1	1.0	0.866667	0.933333	0.066667	3	1.0	0.928571	0.964286	0.035714	3	[1.0]	[1.0]	[1.0]	[0.8666666666666667]	[1.0]	[0.9285714285714286]
2	0.053957	0.000756	0.090297	0.000390	[(test, left)]	[(test, right)]	[(test, right)]	[(test, left)]	5	None	{'max_cost': 5, 'n_train_strides': None}	1.0	1.000000	1.000000	0.000000	1	1.0	0.933333	0.966667	0.033333	1	1.0	0.965517	0.982759	0.017241	1	[1.0]	[1.0]	[1.0]	[0.9333333333333333]	[1.0]	[0.9655172413793104]
3	0.048471	0.000563	0.092458	0.000976	[(test, left)]	[(test, right)]	[(test, right)]	[(test, left)]	5	1	{'max_cost': 5, 'n_train_strides': 1}	1.0	0.965517	0.982759	0.017241	4	1.0	0.933333	0.966667	0.033333	1	1.0	0.949153	0.974576	0.025424	2	[1.0]	[0.9655172413793104]	[1.0]	[0.9333333333333333]	[1.0]	[0.9491525423728815]

The mean score is the primary parameter used to select the best parameter combi (if return_optimized is True). All other values performance values are just there to provide further inside.

results_df[["mean__test__agg__precision", "mean__test__agg__recall", "mean__test__agg__f1_score"]]

	mean__test__agg__precision	mean__test__agg__recall	mean__test__agg__f1_score
0	1.000000	0.933333	0.964286
1	1.000000	0.933333	0.964286
2	1.000000	0.966667	0.982759
3	0.982759	0.966667	0.974576

For even more insight, you can inspect the scores per datapoint:

results_df.filter(like="test__single")

	split0__test__single__precision	split1__test__single__precision	split0__test__single__recall	split1__test__single__recall	split0__test__single__f1_score	split1__test__single__f1_score
0	[1.0]	[1.0]	[1.0]	[0.8666666666666667]	[1.0]	[0.9285714285714286]
1	[1.0]	[1.0]	[1.0]	[0.8666666666666667]	[1.0]	[0.9285714285714286]
2	[1.0]	[1.0]	[1.0]	[0.9333333333333333]	[1.0]	[0.9655172413793104]
3	[1.0]	[0.9655172413793104]	[1.0]	[0.9333333333333333]	[1.0]	[0.9491525423728815]

If return_optimized was set to True (or the name of a score), a final optimization is performed using the best set of parameters and all the available data. The resulting pipeline will be stored in optimizable_pipeline_.

print("Best Para Combi:", gs.best_params_)
print("Paras of optimized Pipeline:", gs.optimized_pipeline_.get_params())

Best Para Combi: {'max_cost': 5, 'n_train_strides': None}
Paras of optimized Pipeline: {'max_cost': 5, 'n_train_strides': None, 'template__data':          gyr_pa      gyr_ml      gyr_si
 -203.142981 -512.256912  109.443353
 -166.092050 -442.302302   35.136148
 -134.636155 -314.549589    9.269076
  -69.096900 -184.730245  -53.273026
   18.316815  -69.020164  -82.914830
..          ...         ...         ...
-205.091070 -406.058835  139.618045
-204.008343 -412.149119  143.430504
-196.744087 -421.287768  140.890287
-193.801858 -434.551734  132.242778
-202.365642 -460.508649  121.266798

[225 rows x 3 columns], 'template__interpolation_method': 'linear', 'template__n_samples': None, 'template__sampling_rate_hz': 204.8, 'template__scaling__data_max': 512.2569124540979, 'template__scaling__out_max': 1, 'template__scaling': TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), 'template__use_cols': None, 'template': InterpolatedDtwTemplate(data=         gyr_pa      gyr_ml      gyr_si
 -203.142981 -512.256912  109.443353
 -166.092050 -442.302302   35.136148
 -134.636155 -314.549589    9.269076
  -69.096900 -184.730245  -53.273026
   18.316815  -69.020164  -82.914830
..          ...         ...         ...
-205.091070 -406.058835  139.618045
-204.008343 -412.149119  143.430504
-196.744087 -421.287768  140.890287
-193.801858 -434.551734  132.242778
-202.365642 -460.508649  121.266798

[225 rows x 3 columns], interpolation_method='linear', n_samples=None, sampling_rate_hz=204.8, scaling=TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), use_cols=None)}

To run the optmized pipeline, we can directly use the run/safe_run method on the GridSearch object. This makes it possible to use the GridSearch as a replacement for your pipeline object with minimal code changes.

If you would try to call run/safe_run (or score for that matter), before the optimization, an error is raised.

segmented_stride_list = gs.safe_run(MyDataset()[0]).segmented_stride_list_
segmented_stride_list

	start	end
s_id
0	364	584
1	584	802
2	802	1023
3	1023	1242
4	1242	1458
5	1458	1672
6	1672	1887
7	1887	2104
8	2104	2327
9	2327	2546
10	2546	2773
11	2773	2998
12	2998	3231
13	3231	3453
14	3934	4163
15	4163	4382
16	4382	4603
17	4603	4822
18	4822	5043
19	5043	5267
20	5267	5489
21	5489	5713
22	5713	5936
23	5936	6167
24	6167	6395
25	6395	6628
26	6628	6858
27	6858	7091

Pure Parameters#

As mentioned above, some parameters in this search do not affect the outcome of the optimization step. We call these parameters pure parameters. In this example max_cost is a pure parameter. In contrast, n_train_strides is a Hyperparameter, as changing the parameter does change the outcome of the pipeline optimization step.

However, during our GridSearch we treat both types of parameters the same. This means, self_optimize is called once for each parameter combination above, even though we expect the same output of self_optimize for the e.g. parameter combinations {"max_cost": 3, "n_train_strides": None} and {"max_cost": 5, "n_train_strides": None}. In this example this didn’t really matter, because the optimization was fast, but in other cases it could be very wasteful to rerun the optimization multiple times, even though the outcome would be identical.

A better approach would be to only run the training for all parameter combinations that are actually expected to change its output and set the rest of the parameters only during the run step. To learn more about this approach review the concept of Group 3 algorithms in the evaluation guide.

GridSearchCV has the option to make exactly this optimization. However, it can not magically know, which parameters should be considered “pure”. This information needs to be provided manually via the pure_parameter_names parameter. If provided, the output of the optimization will be cached and reused, if only pure parameters are modified.

Warning

Setting the wrong parameters as pure can result in hard to debug issues. Make sure you fully understand your pipeline, before using this option and compare the results of you pipeline on a subset of your data with and without the option before using it!

In our case, max_cost is a pure parameter. We will rerun the pipeline below and mark max_cost as a pure parameter explicitly. We will also set the verbosity to 2, to see the caching in action. Note, that we also need to reset the random seed, otherwise we would get different results than above.

random.seed(1)

gs_cached = GridSearchCV(
    pipeline=MyPipeline(),
    parameter_grid=parameters,
    scoring=score,
    pure_parameters=True,
    cv=cv,
    return_optimized="f1_score",
    verbose=2,
)
gs_cached = gs_cached.optimize(MyDataset())
cached_results = gs_cached.cv_results_

Split-Para Combos:   0%|          | 0/8 [00:00<?, ?it/s]________________________________________________________________________________
[Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows]

     participant   foot
   0        test  right, {})
________________________________________________cachable_optimize - 0.1s, 0.0min


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]

Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.63it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.59it/s]

Split-Para Combos:  12%|#2        | 1/8 [00:00<00:01,  4.45it/s]________________________________________________________________________________
[Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows]

     participant  foot
   0        test  left, {})
________________________________________________cachable_optimize - 0.1s, 0.0min


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.80it/s]

Split-Para Combos:  25%|##5       | 2/8 [00:00<00:01,  5.22it/s]________________________________________________________________________________
[Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows]

     participant   foot
   0        test  right, {})
________________________________________________cachable_optimize - 0.0s, 0.0min


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.91it/s]

Split-Para Combos:  38%|###7      | 3/8 [00:00<00:00,  5.64it/s]________________________________________________________________________________
[Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows]

     participant  foot
   0        test  left, {})
________________________________________________cachable_optimize - 0.0s, 0.0min


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 13.29it/s]

Split-Para Combos:  50%|#####     | 4/8 [00:00<00:00,  5.93it/s][Memory]0.7s, 0.0min    : Loading cachable_optimize...


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.68it/s]

Split-Para Combos:  62%|######2   | 5/8 [00:00<00:00,  6.87it/s][Memory]0.8s, 0.0min    : Loading cachable_optimize...


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.97it/s]

Split-Para Combos:  75%|#######5  | 6/8 [00:00<00:00,  7.65it/s][Memory]0.9s, 0.0min    : Loading cachable_optimize...


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 13.03it/s]

Split-Para Combos:  88%|########7 | 7/8 [00:01<00:00,  8.25it/s][Memory]1.0s, 0.0min    : Loading cachable_optimize...


Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
Datapoints: 100%|##########| 1/1 [00:00<00:00, 13.35it/s]

Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  7.17it/s]

When inspecting the debug output above, we can see that the function cachable_optimize (which handles the optimization internally) was called 4 times for all combinations of the hyper parameter and data folds. Then these cached results were used in 4 further cases, which correspond to the already run combinations, but with a different value for max_cost. This means, we saved 50% of all calls to self_optimize of our pipeline.

Just to make sure, you can see that the results below are still identical to our first run.

pd.DataFrame(cached_results).filter(like="mean")

	mean__debug__optimize_time	mean__debug__score_time	mean__test__agg__precision	mean__test__agg__recall	mean__test__agg__f1_score
0	0.082641	0.104395	1.000000	0.933333	0.964286
1	0.060059	0.088986	1.000000	0.933333	0.964286
2	0.003984	0.090722	1.000000	0.966667	0.982759
3	0.003724	0.088062	0.982759	0.966667	0.974576

Total running time of the script: ( 0 minutes 7.809 seconds)

Estimated memory usage: 9 MB

Gallery generated by Sphinx-Gallery