.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/datasets_and_pipelines/gridsearch_cv.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_datasets_and_pipelines_gridsearch_cv.py: .. _gridsearch_cv: GridSearchCV ============ .. note:: These examples are basically copies from the same examples in tpcp, but using gait algorithms! These examples are less often updated than the official tpcp examples. Hence, it makes sense to cross-check the official examples. When trying to optimize parameters for algorithms that have trainable components, it is required to perform the parameter search on a validation set (that is separate from the test set used for the final validation). Even better, is to use a cross validation for this step. In gaitmap this can be done by using :class:`~tpcp.optimize.GridSearchCV`. This example explains how to use this method. To learn more about the concept, review the :ref:`evaluation guide ` and the `sklearn guide on tuning hyperparameters `_. .. GENERATED FROM PYTHON SOURCE LINES 21-34 .. code-block:: default import random from typing import Optional import numpy as np import pandas as pd from gaitmap.data_transform import TrainableAbsMaxScaler from gaitmap.utils.array_handling import iterate_region_data random.seed(1) # We set the random seed for repeatable results .. GENERATED FROM PYTHON SOURCE LINES 35-39 Dataset ------- As always, we need a dataset, a pipeline, and a scoring method for a parameter search. We reuse the dataset used in other pipeline examples. .. GENERATED FROM PYTHON SOURCE LINES 39-63 .. code-block:: default from tpcp import Dataset from gaitmap.example_data import get_healthy_example_imu_data, get_healthy_example_stride_borders class MyDataset(Dataset): @property def sampling_rate_hz(self) -> float: return 204.8 @property def data(self): self.assert_is_single(None, "data") return get_healthy_example_imu_data()[self.index.iloc[0]["foot"] + "_sensor"] @property def segmented_stride_list_(self): self.assert_is_single(None, "data") return get_healthy_example_stride_borders()[self.index.iloc[0]["foot"] + "_sensor"].set_index("s_id") def create_index(self) -> pd.DataFrame: return pd.DataFrame({"participant": ["test", "test"], "foot": ["left", "right"]}) .. GENERATED FROM PYTHON SOURCE LINES 64-71 The Pipeline ------------ We use a gait segmentation pipeline, that is explained in more detail in the :ref:`optimize_pipelines` example. However, we modify this pipeline in one key way: We add an additional parameter `n_train_strides` that controls how many randomly selected strides should be used during training. Modifying this parameter, will change the result of the `self_optimize` step. .. GENERATED FROM PYTHON SOURCE LINES 71-142 .. code-block:: default from tpcp import CloneFactory, HyperParameter, OptimizableParameter, OptimizablePipeline, PureParameter from gaitmap.stride_segmentation import BarthDtw, InterpolatedDtwTemplate from gaitmap.utils.coordinate_conversion import convert_left_foot_to_fbf, convert_right_foot_to_fbf from gaitmap.utils.datatype_helper import SingleSensorStrideList class MyPipeline(OptimizablePipeline): max_cost: PureParameter[float] template: OptimizableParameter[InterpolatedDtwTemplate] n_train_strides: HyperParameter[Optional[int]] segmented_stride_list_: SingleSensorStrideList cost_func_: np.ndarray def __init__( self, max_cost: float = 3, # We need to wrap the template in a `CloneFactory` call here to prevent issues with mutable defaults! template: InterpolatedDtwTemplate = CloneFactory(InterpolatedDtwTemplate(scaling=TrainableAbsMaxScaler())), n_train_strides: Optional[int] = None, ) -> None: self.max_cost = max_cost self.template = template self.n_train_strides = n_train_strides def self_optimize(self, dataset: MyDataset, **kwargs): # Our training consists of cutting all strides from the dataset and then creating a new template from all # strides in the dataset # We expect multiple datapoints in the dataset sampling_rate = dataset[0].sampling_rate_hz # We create a generator for the data and the stride labels data_sequences = ( self._convert_cord_system(datapoint.data, datapoint.groups[0][1]).filter(like="gyr") for datapoint in dataset ) stride_labels = (datapoint.segmented_stride_list_ for datapoint in dataset) stride_geneator = iterate_region_data(data_sequences, stride_labels) # This is the new part: # Note, that this is not really optimal, as we force all strides into memory and iterate over them, # but shouldn't really matter. all_strides = list(stride_geneator) if self.n_train_strides: all_strides = random.sample(all_strides, self.n_train_strides) # Note that this will also retrain the scaling based on the new data self.template = self.template.self_optimize(all_strides, sampling_rate_hz=sampling_rate) return self def _convert_cord_system(self, data, foot): converter = {"left": convert_left_foot_to_fbf, "right": convert_right_foot_to_fbf} return converter[foot](data) def run(self, datapoint: MyDataset): # `datapoint.groups[0]` gives us the identifier of the datapoint (e.g. `("test", "left")`). # And `datapoint.groups[0][1]` is the foot. data = self._convert_cord_system(datapoint.data, datapoint.groups[0][1]) dtw = BarthDtw(max_cost=self.max_cost, template=self.template) dtw.segment(data, datapoint.sampling_rate_hz) self.segmented_stride_list_ = dtw.stride_list_ self.cost_func_ = dtw.cost_function_ return self .. GENERATED FROM PYTHON SOURCE LINES 143-147 The Scorer ---------- The scorer is identical to the scoring function used in the other examples. The F1-score is still the most important parameter for our comparison. .. GENERATED FROM PYTHON SOURCE LINES 147-158 .. code-block:: default from gaitmap.evaluation_utils import evaluate_segmented_stride_list, precision_recall_f1_score def score(pipeline: MyPipeline, datapoint: MyDataset): pipeline.safe_run(datapoint) matches_df = evaluate_segmented_stride_list( ground_truth=datapoint.segmented_stride_list_, segmented_stride_list=pipeline.segmented_stride_list_ ) return precision_recall_f1_score(matches_df) .. GENERATED FROM PYTHON SOURCE LINES 159-167 Data Splitting -------------- Like with a normal cross validation, we need to decide on the number of folds and type of splits. In gaitmap we support all cross validation iterators provided in :ref:`sklearn `. In this example we only have two datapoints. This means, we can only use a 2-fold cross-validation: .. GENERATED FROM PYTHON SOURCE LINES 167-171 .. code-block:: default from sklearn.model_selection import KFold cv = KFold(n_splits=2) .. GENERATED FROM PYTHON SOURCE LINES 172-184 The Parameters -------------- The pipeline above exposes a couple of parameters. The `template` will be modified during training. The `n_train_strides` controls how many strides are used during training and hence, directly effects the outcome. The `max_cost` parameter is important for the actual dtw-matching, but does not influence the optimization step. For our basic `GridSearchCV` this doesn't matter and we treat both types of parameters the same way. But if you have a similar case in your pipeline make sure to read the section on *Pure Parameters* at the end of the example. For the `n_train_strides` we test the values `None` (all strides) and 1 (single stride) to make sure that we will see a performance difference between the two options. .. GENERATED FROM PYTHON SOURCE LINES 184-188 .. code-block:: default from sklearn.model_selection import ParameterGrid parameters = ParameterGrid({"max_cost": [3, 5], "n_train_strides": [None, 1]}) # None means all strides. .. GENERATED FROM PYTHON SOURCE LINES 189-194 GridSearchCV ------------ Setting up the GridSearchCV object is similar to the normal GridSearch, we just need to add the additional `cv` parameter. Then we can simply run the search using the `optimize` method. .. GENERATED FROM PYTHON SOURCE LINES 194-199 .. code-block:: default from tpcp.optimize import GridSearchCV gs = GridSearchCV(pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, cv=cv, return_optimized="f1_score") gs = gs.optimize(MyDataset()) .. rst-class:: sphx-glr-script-out .. code-block:: none Split-Para Combos: 0%| | 0/8 [00:00
mean__debug__optimize_time std__debug__optimize_time mean__debug__score_time std__debug__score_time split0__test__data_labels split1__test__data_labels split0__train__data_labels split1__train__data_labels param__max_cost param__n_train_strides params split0__test__agg__precision split1__test__agg__precision mean__test__agg__precision std__test__agg__precision rank__test__agg__precision split0__test__agg__recall split1__test__agg__recall mean__test__agg__recall std__test__agg__recall rank__test__agg__recall split0__test__agg__f1_score split1__test__agg__f1_score mean__test__agg__f1_score std__test__agg__f1_score rank__test__agg__f1_score split0__test__single__precision split1__test__single__precision split0__test__single__recall split1__test__single__recall split0__test__single__f1_score split1__test__single__f1_score
0 0.064745 0.007969 0.101779 0.010444 [(test, left)] [(test, right)] [(test, right)] [(test, left)] 3 None {'max_cost': 3, 'n_train_strides': None} 1.0 1.000000 1.000000 0.000000 1 1.0 0.866667 0.933333 0.066667 3 1.0 0.928571 0.964286 0.035714 3 [1.0] [1.0] [1.0] [0.8666666666666667] [1.0] [0.9285714285714286]
1 0.049803 0.000527 0.089396 0.000702 [(test, left)] [(test, right)] [(test, right)] [(test, left)] 3 1 {'max_cost': 3, 'n_train_strides': 1} 1.0 1.000000 1.000000 0.000000 1 1.0 0.866667 0.933333 0.066667 3 1.0 0.928571 0.964286 0.035714 3 [1.0] [1.0] [1.0] [0.8666666666666667] [1.0] [0.9285714285714286]
2 0.055776 0.000002 0.095733 0.000946 [(test, left)] [(test, right)] [(test, right)] [(test, left)] 5 None {'max_cost': 5, 'n_train_strides': None} 1.0 1.000000 1.000000 0.000000 1 1.0 0.933333 0.966667 0.033333 1 1.0 0.965517 0.982759 0.017241 1 [1.0] [1.0] [1.0] [0.9333333333333333] [1.0] [0.9655172413793104]
3 0.050634 0.001628 0.094837 0.000001 [(test, left)] [(test, right)] [(test, right)] [(test, left)] 5 1 {'max_cost': 5, 'n_train_strides': 1} 1.0 0.965517 0.982759 0.017241 4 1.0 0.933333 0.966667 0.033333 1 1.0 0.949153 0.974576 0.025424 2 [1.0] [0.9655172413793104] [1.0] [0.9333333333333333] [1.0] [0.9491525423728815]


.. GENERATED FROM PYTHON SOURCE LINES 212-214 The mean score is the primary parameter used to select the best parameter combi (if `return_optimized` is True). All other values performance values are just there to provide further inside. .. GENERATED FROM PYTHON SOURCE LINES 214-217 .. code-block:: default results_df[["mean__test__agg__precision", "mean__test__agg__recall", "mean__test__agg__f1_score"]] .. raw:: html
mean__test__agg__precision mean__test__agg__recall mean__test__agg__f1_score
0 1.000000 0.933333 0.964286
1 1.000000 0.933333 0.964286
2 1.000000 0.966667 0.982759
3 0.982759 0.966667 0.974576


.. GENERATED FROM PYTHON SOURCE LINES 218-219 For even more insight, you can inspect the scores per datapoint: .. GENERATED FROM PYTHON SOURCE LINES 219-222 .. code-block:: default results_df.filter(like="test__single") .. raw:: html
split0__test__single__precision split1__test__single__precision split0__test__single__recall split1__test__single__recall split0__test__single__f1_score split1__test__single__f1_score
0 [1.0] [1.0] [1.0] [0.8666666666666667] [1.0] [0.9285714285714286]
1 [1.0] [1.0] [1.0] [0.8666666666666667] [1.0] [0.9285714285714286]
2 [1.0] [1.0] [1.0] [0.9333333333333333] [1.0] [0.9655172413793104]
3 [1.0] [0.9655172413793104] [1.0] [0.9333333333333333] [1.0] [0.9491525423728815]


.. GENERATED FROM PYTHON SOURCE LINES 223-226 If `return_optimized` was set to True (or the name of a score), a final optimization is performed using the best set of parameters and **all** the available data. The resulting pipeline will be stored in `optimizable_pipeline_`. .. GENERATED FROM PYTHON SOURCE LINES 226-229 .. code-block:: default print("Best Para Combi:", gs.best_params_) print("Paras of optimized Pipeline:", gs.optimized_pipeline_.get_params()) .. rst-class:: sphx-glr-script-out .. code-block:: none Best Para Combi: {'max_cost': 5, 'n_train_strides': None} Paras of optimized Pipeline: {'max_cost': 5, 'n_train_strides': None, 'template__data': gyr_pa gyr_ml gyr_si 0 -203.142981 -512.256912 109.443353 1 -166.092050 -442.302302 35.136148 2 -134.636155 -314.549589 9.269076 3 -69.096900 -184.730245 -53.273026 4 18.316815 -69.020164 -82.914830 .. ... ... ... 220 -205.091070 -406.058835 139.618045 221 -204.008343 -412.149119 143.430504 222 -196.744087 -421.287768 140.890287 223 -193.801858 -434.551734 132.242778 224 -202.365642 -460.508649 121.266798 [225 rows x 3 columns], 'template__interpolation_method': 'linear', 'template__n_samples': None, 'template__sampling_rate_hz': 204.8, 'template__scaling__data_max': 512.2569124540979, 'template__scaling__out_max': 1, 'template__scaling': TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), 'template__use_cols': None, 'template': InterpolatedDtwTemplate(data= gyr_pa gyr_ml gyr_si 0 -203.142981 -512.256912 109.443353 1 -166.092050 -442.302302 35.136148 2 -134.636155 -314.549589 9.269076 3 -69.096900 -184.730245 -53.273026 4 18.316815 -69.020164 -82.914830 .. ... ... ... 220 -205.091070 -406.058835 139.618045 221 -204.008343 -412.149119 143.430504 222 -196.744087 -421.287768 140.890287 223 -193.801858 -434.551734 132.242778 224 -202.365642 -460.508649 121.266798 [225 rows x 3 columns], interpolation_method='linear', n_samples=None, sampling_rate_hz=204.8, scaling=TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), use_cols=None)} .. GENERATED FROM PYTHON SOURCE LINES 230-234 To run the optmized pipeline, we can directly use the `run`/`safe_run` method on the GridSearch object. This makes it possible to use the `GridSearch` as a replacement for your pipeline object with minimal code changes. If you would try to call `run`/`safe_run` (or `score` for that matter), before the optimization, an error is raised. .. GENERATED FROM PYTHON SOURCE LINES 234-238 .. code-block:: default segmented_stride_list = gs.safe_run(MyDataset()[0]).segmented_stride_list_ segmented_stride_list .. raw:: html
start end
s_id
0 364 584
1 584 802
2 802 1023
3 1023 1242
4 1242 1458
5 1458 1672
6 1672 1887
7 1887 2104
8 2104 2327
9 2327 2546
10 2546 2773
11 2773 2998
12 2998 3231
13 3231 3453
14 3934 4163
15 4163 4382
16 4382 4603
17 4603 4822
18 4822 5043
19 5043 5267
20 5267 5489
21 5489 5713
22 5713 5936
23 5936 6167
24 6167 6395
25 6395 6628
26 6628 6858
27 6858 7091


.. GENERATED FROM PYTHON SOURCE LINES 239-272 Pure Parameters --------------- As mentioned above, some parameters in this search do not affect the outcome of the optimization step. We call these parameters *pure* parameters. In this example `max_cost` is a *pure* parameter. In contrast, `n_train_strides` is a *Hyperparameter*, as changing the parameter does change the outcome of the pipeline optimization step. However, during our GridSearch we treat both types of parameters the same. This means, `self_optimize` is called once for each parameter combination above, even though we expect the same output of `self_optimize` for the e.g. parameter combinations `{"max_cost": 3, "n_train_strides": None}` and `{"max_cost": 5, "n_train_strides": None}`. In this example this didn't really matter, because the optimization was fast, but in other cases it could be very wasteful to rerun the optimization multiple times, even though the outcome would be identical. A better approach would be to only run the training for all parameter combinations that are actually expected to change its output and set the rest of the parameters only during the `run` step. To learn more about this approach review the concept of *Group 3 algorithms* in the :ref:`evaluation guide `. `GridSearchCV` has the option to make exactly this optimization. However, it can not magically know, which parameters should be considered "pure". This information needs to be provided manually via the `pure_parameter_names` parameter. If provided, the output of the optimization will be cached and reused, if only pure parameters are modified. .. warning :: Setting the wrong parameters as *pure* can result in hard to debug issues. Make sure you fully understand your pipeline, before using this option and compare the results of you pipeline on a subset of your data with and without the option before using it! In our case, `max_cost` is a pure parameter. We will rerun the pipeline below and mark `max_cost` as a pure parameter explicitly. We will also set the verbosity to 2, to see the caching in action. Note, that we also need to reset the random seed, otherwise we would get different results than above. .. GENERATED FROM PYTHON SOURCE LINES 272-286 .. code-block:: default random.seed(1) gs_cached = GridSearchCV( pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, pure_parameters=True, cv=cv, return_optimized="f1_score", verbose=2, ) gs_cached = gs_cached.optimize(MyDataset()) cached_results = gs_cached.cv_results_ .. rst-class:: sphx-glr-script-out .. code-block:: none Split-Para Combos: 0%| | 0/8 [00:00.cachable_optimize... cachable_optimize(, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows] participant foot 0 test right, {}) ________________________________________________cachable_optimize - 0.1s, 0.0min Datapoints: 0%| | 0/1 [00:00.cachable_optimize... cachable_optimize(, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows] participant foot 0 test left, {}) ________________________________________________cachable_optimize - 0.1s, 0.0min Datapoints: 0%| | 0/1 [00:00.cachable_optimize... cachable_optimize(, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows] participant foot 0 test right, {}) ________________________________________________cachable_optimize - 0.1s, 0.0min Datapoints: 0%| | 0/1 [00:00.cachable_optimize... cachable_optimize(, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows] participant foot 0 test left, {}) ________________________________________________cachable_optimize - 0.1s, 0.0min Datapoints: 0%| | 0/1 [00:00
mean__debug__optimize_time mean__debug__score_time mean__test__agg__precision mean__test__agg__recall mean__test__agg__f1_score
0 0.079231 0.104237 1.000000 0.933333 0.964286
1 0.064390 0.092016 1.000000 0.933333 0.964286
2 0.004036 0.093557 1.000000 0.966667 0.982759
3 0.003866 0.090319 0.982759 0.966667 0.974576


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 7.961 seconds) **Estimated memory usage:** 9 MB .. _sphx_glr_download_auto_examples_datasets_and_pipelines_gridsearch_cv.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: gridsearch_cv.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: gridsearch_cv.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_