.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/datasets_and_pipelines/gridsearch_cv.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_datasets_and_pipelines_gridsearch_cv.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_datasets_and_pipelines_gridsearch_cv.py:


.. _gridsearch_cv:

GridSearchCV
============

.. note:: These examples are basically copies from the same examples in tpcp, but using gait algorithms!
          These examples are less often updated than the official tpcp examples.
          Hence, it makes sense to cross-check the official examples.

When trying to optimize parameters for algorithms that have trainable components, it is required to perform
the parameter search on a validation set (that is separate from the test set used for the final validation).
Even better, is to use a cross validation for this step.
In gaitmap this can be done by using :class:`~tpcp.optimize.GridSearchCV`.

This example explains how to use this method.
To learn more about the concept, review the :ref:`evaluation guide <algorithm_evaluation>` and the `sklearn guide on
tuning hyperparameters <https://scikit-learn.org/stable/modules/grid_search.html#grid-search>`_.

.. GENERATED FROM PYTHON SOURCE LINES 21-34

.. code-block:: default


    import random
    from typing import Optional

    import numpy as np
    import pandas as pd


    from gaitmap.data_transform import TrainableAbsMaxScaler
    from gaitmap.utils.array_handling import iterate_region_data

    random.seed(1)  # We set the random seed for repeatable results


.. GENERATED FROM PYTHON SOURCE LINES 35-39

Dataset
-------
As always, we need a dataset, a pipeline, and a scoring method for a parameter search.
We reuse the dataset used in other pipeline examples.

.. GENERATED FROM PYTHON SOURCE LINES 39-63

.. code-block:: default

    from tpcp import Dataset

    from gaitmap.example_data import get_healthy_example_imu_data, get_healthy_example_stride_borders


    class MyDataset(Dataset):
        @property
        def sampling_rate_hz(self) -> float:
            return 204.8

        @property
        def data(self):
            self.assert_is_single(None, "data")
            return get_healthy_example_imu_data()[self.index.iloc[0]["foot"] + "_sensor"]

        @property
        def segmented_stride_list_(self):
            self.assert_is_single(None, "data")
            return get_healthy_example_stride_borders()[self.index.iloc[0]["foot"] + "_sensor"].set_index("s_id")

        def create_index(self) -> pd.DataFrame:
            return pd.DataFrame({"participant": ["test", "test"], "foot": ["left", "right"]})


.. GENERATED FROM PYTHON SOURCE LINES 64-71

The Pipeline
------------
We use a gait segmentation pipeline, that is explained in more detail in the :ref:`optimize_pipelines` example.
However, we modify this pipeline in one key way:
We add an additional parameter `n_train_strides` that controls how many randomly selected strides should be used
during training.
Modifying this parameter, will change the result of the `self_optimize` step.

.. GENERATED FROM PYTHON SOURCE LINES 71-142

.. code-block:: default

    from tpcp import CloneFactory, HyperParameter, OptimizableParameter, OptimizablePipeline, PureParameter

    from gaitmap.stride_segmentation import BarthDtw, InterpolatedDtwTemplate
    from gaitmap.utils.coordinate_conversion import convert_left_foot_to_fbf, convert_right_foot_to_fbf
    from gaitmap.utils.datatype_helper import SingleSensorStrideList


    class MyPipeline(OptimizablePipeline):
        max_cost: PureParameter[float]
        template: OptimizableParameter[InterpolatedDtwTemplate]
        n_train_strides: HyperParameter[Optional[int]]

        segmented_stride_list_: SingleSensorStrideList
        cost_func_: np.ndarray

        def __init__(
            self,
            max_cost: float = 3,
            # We need to wrap the template in a `CloneFactory` call here to prevent issues with mutable defaults!
            template: InterpolatedDtwTemplate = CloneFactory(InterpolatedDtwTemplate(scaling=TrainableAbsMaxScaler())),
            n_train_strides: Optional[int] = None,
        ) -> None:
            self.max_cost = max_cost
            self.template = template
            self.n_train_strides = n_train_strides

        def self_optimize(self, dataset: MyDataset, **kwargs):
            # Our training consists of cutting all strides from the dataset and then creating a new template from all
            # strides in the dataset

            # We expect multiple datapoints in the dataset
            sampling_rate = dataset[0].sampling_rate_hz

            # We create a generator for the data and the stride labels
            data_sequences = (
                self._convert_cord_system(datapoint.data, datapoint.groups[0][1]).filter(like="gyr")
                for datapoint in dataset
            )
            stride_labels = (datapoint.segmented_stride_list_ for datapoint in dataset)

            stride_geneator = iterate_region_data(data_sequences, stride_labels)

            # This is the new part:
            # Note, that this is not really optimal, as we force all strides into memory and iterate over them,
            # but shouldn't really matter.
            all_strides = list(stride_geneator)
            if self.n_train_strides:
                all_strides = random.sample(all_strides, self.n_train_strides)

            # Note that this will also retrain the scaling based on the new data
            self.template = self.template.self_optimize(all_strides, sampling_rate_hz=sampling_rate)

            return self

        def _convert_cord_system(self, data, foot):
            converter = {"left": convert_left_foot_to_fbf, "right": convert_right_foot_to_fbf}
            return converter[foot](data)

        def run(self, datapoint: MyDataset):
            # `datapoint.groups[0]` gives us the identifier of the datapoint (e.g. `("test", "left")`).
            # And `datapoint.groups[0][1]` is the foot.
            data = self._convert_cord_system(datapoint.data, datapoint.groups[0][1])

            dtw = BarthDtw(max_cost=self.max_cost, template=self.template)
            dtw.segment(data, datapoint.sampling_rate_hz)

            self.segmented_stride_list_ = dtw.stride_list_
            self.cost_func_ = dtw.cost_function_
            return self


.. GENERATED FROM PYTHON SOURCE LINES 143-147

The Scorer
----------
The scorer is identical to the scoring function used in the other examples.
The F1-score is still the most important parameter for our comparison.

.. GENERATED FROM PYTHON SOURCE LINES 147-158

.. code-block:: default

    from gaitmap.evaluation_utils import evaluate_segmented_stride_list, precision_recall_f1_score


    def score(pipeline: MyPipeline, datapoint: MyDataset):
        pipeline.safe_run(datapoint)
        matches_df = evaluate_segmented_stride_list(
            ground_truth=datapoint.segmented_stride_list_, segmented_stride_list=pipeline.segmented_stride_list_
        )
        return precision_recall_f1_score(matches_df)


.. GENERATED FROM PYTHON SOURCE LINES 159-167

Data Splitting
--------------
Like with a normal cross validation, we need to decide on the number of folds and type of splits.
In gaitmap we support all cross validation iterators provided in :ref:`sklearn
<https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators>`.

In this example we only have two datapoints.
This means, we can only use a 2-fold cross-validation:

.. GENERATED FROM PYTHON SOURCE LINES 167-171

.. code-block:: default

    from sklearn.model_selection import KFold

    cv = KFold(n_splits=2)


.. GENERATED FROM PYTHON SOURCE LINES 172-184

The Parameters
--------------
The pipeline above exposes a couple of parameters.
The `template` will be modified during training.
The `n_train_strides` controls how many strides are used during training and hence, directly effects the outcome.
The `max_cost` parameter is important for the actual dtw-matching, but does not influence the optimization step.
For our basic `GridSearchCV` this doesn't matter and we treat both types of parameters the same way.
But if you have a similar case in your pipeline make sure to read the section on *Pure Parameters* at the end of the
example.

For the `n_train_strides` we test the values `None` (all strides) and 1 (single stride) to make sure that we will
see a performance difference between the two options.

.. GENERATED FROM PYTHON SOURCE LINES 184-188

.. code-block:: default

    from sklearn.model_selection import ParameterGrid

    parameters = ParameterGrid({"max_cost": [3, 5], "n_train_strides": [None, 1]})  # None means all strides.


.. GENERATED FROM PYTHON SOURCE LINES 189-194

GridSearchCV
------------
Setting up the GridSearchCV object is similar to the normal GridSearch, we just need to add the additional `cv`
parameter.
Then we can simply run the search using the `optimize` method.

.. GENERATED FROM PYTHON SOURCE LINES 194-199

.. code-block:: default

    from tpcp.optimize import GridSearchCV

    gs = GridSearchCV(pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, cv=cv, return_optimized="f1_score")
    gs = gs.optimize(MyDataset())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Split-Para Combos:   0%|          | 0/8 [00:00<?, ?it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 10.14it/s]
    Split-Para Combos:  12%|#2        | 1/8 [00:00<00:01,  5.13it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.78it/s]
    Split-Para Combos:  25%|##5       | 2/8 [00:00<00:01,  5.79it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.87it/s]
    Split-Para Combos:  38%|###7      | 3/8 [00:00<00:00,  6.17it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 13.16it/s]
    Split-Para Combos:  50%|#####     | 4/8 [00:00<00:00,  6.41it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.22it/s]
    Split-Para Combos:  62%|######2   | 5/8 [00:00<00:00,  6.36it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.07it/s]
    Split-Para Combos:  75%|#######5  | 6/8 [00:00<00:00,  6.30it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.27it/s]
    Split-Para Combos:  88%|########7 | 7/8 [00:01<00:00,  6.33it/s]
    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.21it/s]
    Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  6.40it/s]    Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  6.26it/s]


.. GENERATED FROM PYTHON SOURCE LINES 200-206

Results
-------
The output is also comparable to the output of the GridSearch.
The main results are stored in the `cv_results_` parameter.
But instead of just a single performance value per parameter, we get one value per fold and the mean and std over
all folds.

.. GENERATED FROM PYTHON SOURCE LINES 206-211

.. code-block:: default

    results = gs.cv_results_
    results_df = pd.DataFrame(results)

    results_df


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>mean__debug__optimize_time</th>
          <th>std__debug__optimize_time</th>
          <th>mean__debug__score_time</th>
          <th>std__debug__score_time</th>
          <th>split0__test__data_labels</th>
          <th>split1__test__data_labels</th>
          <th>split0__train__data_labels</th>
          <th>split1__train__data_labels</th>
          <th>param__max_cost</th>
          <th>param__n_train_strides</th>
          <th>params</th>
          <th>split0__test__agg__precision</th>
          <th>split1__test__agg__precision</th>
          <th>mean__test__agg__precision</th>
          <th>std__test__agg__precision</th>
          <th>rank__test__agg__precision</th>
          <th>split0__test__agg__recall</th>
          <th>split1__test__agg__recall</th>
          <th>mean__test__agg__recall</th>
          <th>std__test__agg__recall</th>
          <th>rank__test__agg__recall</th>
          <th>split0__test__agg__f1_score</th>
          <th>split1__test__agg__f1_score</th>
          <th>mean__test__agg__f1_score</th>
          <th>std__test__agg__f1_score</th>
          <th>rank__test__agg__f1_score</th>
          <th>split0__test__single__precision</th>
          <th>split1__test__single__precision</th>
          <th>split0__test__single__recall</th>
          <th>split1__test__single__recall</th>
          <th>split0__test__single__f1_score</th>
          <th>split1__test__single__f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0.064745</td>
          <td>0.007969</td>
          <td>0.101779</td>
          <td>0.010444</td>
          <td>[(test, left)]</td>
          <td>[(test, right)]</td>
          <td>[(test, right)]</td>
          <td>[(test, left)]</td>
          <td>3</td>
          <td>None</td>
          <td>{'max_cost': 3, 'n_train_strides': None}</td>
          <td>1.0</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>0.000000</td>
          <td>1</td>
          <td>1.0</td>
          <td>0.866667</td>
          <td>0.933333</td>
          <td>0.066667</td>
          <td>3</td>
          <td>1.0</td>
          <td>0.928571</td>
          <td>0.964286</td>
          <td>0.035714</td>
          <td>3</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.8666666666666667]</td>
          <td>[1.0]</td>
          <td>[0.9285714285714286]</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.049803</td>
          <td>0.000527</td>
          <td>0.089396</td>
          <td>0.000702</td>
          <td>[(test, left)]</td>
          <td>[(test, right)]</td>
          <td>[(test, right)]</td>
          <td>[(test, left)]</td>
          <td>3</td>
          <td>1</td>
          <td>{'max_cost': 3, 'n_train_strides': 1}</td>
          <td>1.0</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>0.000000</td>
          <td>1</td>
          <td>1.0</td>
          <td>0.866667</td>
          <td>0.933333</td>
          <td>0.066667</td>
          <td>3</td>
          <td>1.0</td>
          <td>0.928571</td>
          <td>0.964286</td>
          <td>0.035714</td>
          <td>3</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.8666666666666667]</td>
          <td>[1.0]</td>
          <td>[0.9285714285714286]</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0.055776</td>
          <td>0.000002</td>
          <td>0.095733</td>
          <td>0.000946</td>
          <td>[(test, left)]</td>
          <td>[(test, right)]</td>
          <td>[(test, right)]</td>
          <td>[(test, left)]</td>
          <td>5</td>
          <td>None</td>
          <td>{'max_cost': 5, 'n_train_strides': None}</td>
          <td>1.0</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>0.000000</td>
          <td>1</td>
          <td>1.0</td>
          <td>0.933333</td>
          <td>0.966667</td>
          <td>0.033333</td>
          <td>1</td>
          <td>1.0</td>
          <td>0.965517</td>
          <td>0.982759</td>
          <td>0.017241</td>
          <td>1</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.9333333333333333]</td>
          <td>[1.0]</td>
          <td>[0.9655172413793104]</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0.050634</td>
          <td>0.001628</td>
          <td>0.094837</td>
          <td>0.000001</td>
          <td>[(test, left)]</td>
          <td>[(test, right)]</td>
          <td>[(test, right)]</td>
          <td>[(test, left)]</td>
          <td>5</td>
          <td>1</td>
          <td>{'max_cost': 5, 'n_train_strides': 1}</td>
          <td>1.0</td>
          <td>0.965517</td>
          <td>0.982759</td>
          <td>0.017241</td>
          <td>4</td>
          <td>1.0</td>
          <td>0.933333</td>
          <td>0.966667</td>
          <td>0.033333</td>
          <td>1</td>
          <td>1.0</td>
          <td>0.949153</td>
          <td>0.974576</td>
          <td>0.025424</td>
          <td>2</td>
          <td>[1.0]</td>
          <td>[0.9655172413793104]</td>
          <td>[1.0]</td>
          <td>[0.9333333333333333]</td>
          <td>[1.0]</td>
          <td>[0.9491525423728815]</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 212-214

The mean score is the primary parameter used to select the best parameter combi (if `return_optimized` is True).
All other values performance values are just there to provide further inside.

.. GENERATED FROM PYTHON SOURCE LINES 214-217

.. code-block:: default


    results_df[["mean__test__agg__precision", "mean__test__agg__recall", "mean__test__agg__f1_score"]]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>mean__test__agg__precision</th>
          <th>mean__test__agg__recall</th>
          <th>mean__test__agg__f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>1.000000</td>
          <td>0.933333</td>
          <td>0.964286</td>
        </tr>
        <tr>
          <th>1</th>
          <td>1.000000</td>
          <td>0.933333</td>
          <td>0.964286</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1.000000</td>
          <td>0.966667</td>
          <td>0.982759</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0.982759</td>
          <td>0.966667</td>
          <td>0.974576</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 218-219

For even more insight, you can inspect the scores per datapoint:

.. GENERATED FROM PYTHON SOURCE LINES 219-222

.. code-block:: default


    results_df.filter(like="test__single")


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>split0__test__single__precision</th>
          <th>split1__test__single__precision</th>
          <th>split0__test__single__recall</th>
          <th>split1__test__single__recall</th>
          <th>split0__test__single__f1_score</th>
          <th>split1__test__single__f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.8666666666666667]</td>
          <td>[1.0]</td>
          <td>[0.9285714285714286]</td>
        </tr>
        <tr>
          <th>1</th>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.8666666666666667]</td>
          <td>[1.0]</td>
          <td>[0.9285714285714286]</td>
        </tr>
        <tr>
          <th>2</th>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[1.0]</td>
          <td>[0.9333333333333333]</td>
          <td>[1.0]</td>
          <td>[0.9655172413793104]</td>
        </tr>
        <tr>
          <th>3</th>
          <td>[1.0]</td>
          <td>[0.9655172413793104]</td>
          <td>[1.0]</td>
          <td>[0.9333333333333333]</td>
          <td>[1.0]</td>
          <td>[0.9491525423728815]</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 223-226

If `return_optimized` was set to True (or the name of a score), a final optimization is performed using the best
set of parameters and **all** the available data.
The resulting pipeline will be stored in `optimizable_pipeline_`.

.. GENERATED FROM PYTHON SOURCE LINES 226-229

.. code-block:: default

    print("Best Para Combi:", gs.best_params_)
    print("Paras of optimized Pipeline:", gs.optimized_pipeline_.get_params())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Best Para Combi: {'max_cost': 5, 'n_train_strides': None}
    Paras of optimized Pipeline: {'max_cost': 5, 'n_train_strides': None, 'template__data':          gyr_pa      gyr_ml      gyr_si
    0   -203.142981 -512.256912  109.443353
    1   -166.092050 -442.302302   35.136148
    2   -134.636155 -314.549589    9.269076
    3    -69.096900 -184.730245  -53.273026
    4     18.316815  -69.020164  -82.914830
    ..          ...         ...         ...
    220 -205.091070 -406.058835  139.618045
    221 -204.008343 -412.149119  143.430504
    222 -196.744087 -421.287768  140.890287
    223 -193.801858 -434.551734  132.242778
    224 -202.365642 -460.508649  121.266798

    [225 rows x 3 columns], 'template__interpolation_method': 'linear', 'template__n_samples': None, 'template__sampling_rate_hz': 204.8, 'template__scaling__data_max': 512.2569124540979, 'template__scaling__out_max': 1, 'template__scaling': TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), 'template__use_cols': None, 'template': InterpolatedDtwTemplate(data=         gyr_pa      gyr_ml      gyr_si
    0   -203.142981 -512.256912  109.443353
    1   -166.092050 -442.302302   35.136148
    2   -134.636155 -314.549589    9.269076
    3    -69.096900 -184.730245  -53.273026
    4     18.316815  -69.020164  -82.914830
    ..          ...         ...         ...
    220 -205.091070 -406.058835  139.618045
    221 -204.008343 -412.149119  143.430504
    222 -196.744087 -421.287768  140.890287
    223 -193.801858 -434.551734  132.242778
    224 -202.365642 -460.508649  121.266798

    [225 rows x 3 columns], interpolation_method='linear', n_samples=None, sampling_rate_hz=204.8, scaling=TrainableAbsMaxScaler(data_max=512.2569124540979, out_max=1), use_cols=None)}


.. GENERATED FROM PYTHON SOURCE LINES 230-234

To run the optmized pipeline, we can directly use the `run`/`safe_run` method on the GridSearch object.
This makes it possible to use the `GridSearch` as a replacement for your pipeline object with minimal code changes.

If you would try to call `run`/`safe_run` (or `score` for that matter), before the optimization, an error is raised.

.. GENERATED FROM PYTHON SOURCE LINES 234-238

.. code-block:: default

    segmented_stride_list = gs.safe_run(MyDataset()[0]).segmented_stride_list_
    segmented_stride_list


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>start</th>
          <th>end</th>
        </tr>
        <tr>
          <th>s_id</th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>364</td>
          <td>584</td>
        </tr>
        <tr>
          <th>1</th>
          <td>584</td>
          <td>802</td>
        </tr>
        <tr>
          <th>2</th>
          <td>802</td>
          <td>1023</td>
        </tr>
        <tr>
          <th>3</th>
          <td>1023</td>
          <td>1242</td>
        </tr>
        <tr>
          <th>4</th>
          <td>1242</td>
          <td>1458</td>
        </tr>
        <tr>
          <th>5</th>
          <td>1458</td>
          <td>1672</td>
        </tr>
        <tr>
          <th>6</th>
          <td>1672</td>
          <td>1887</td>
        </tr>
        <tr>
          <th>7</th>
          <td>1887</td>
          <td>2104</td>
        </tr>
        <tr>
          <th>8</th>
          <td>2104</td>
          <td>2327</td>
        </tr>
        <tr>
          <th>9</th>
          <td>2327</td>
          <td>2546</td>
        </tr>
        <tr>
          <th>10</th>
          <td>2546</td>
          <td>2773</td>
        </tr>
        <tr>
          <th>11</th>
          <td>2773</td>
          <td>2998</td>
        </tr>
        <tr>
          <th>12</th>
          <td>2998</td>
          <td>3231</td>
        </tr>
        <tr>
          <th>13</th>
          <td>3231</td>
          <td>3453</td>
        </tr>
        <tr>
          <th>14</th>
          <td>3934</td>
          <td>4163</td>
        </tr>
        <tr>
          <th>15</th>
          <td>4163</td>
          <td>4382</td>
        </tr>
        <tr>
          <th>16</th>
          <td>4382</td>
          <td>4603</td>
        </tr>
        <tr>
          <th>17</th>
          <td>4603</td>
          <td>4822</td>
        </tr>
        <tr>
          <th>18</th>
          <td>4822</td>
          <td>5043</td>
        </tr>
        <tr>
          <th>19</th>
          <td>5043</td>
          <td>5267</td>
        </tr>
        <tr>
          <th>20</th>
          <td>5267</td>
          <td>5489</td>
        </tr>
        <tr>
          <th>21</th>
          <td>5489</td>
          <td>5713</td>
        </tr>
        <tr>
          <th>22</th>
          <td>5713</td>
          <td>5936</td>
        </tr>
        <tr>
          <th>23</th>
          <td>5936</td>
          <td>6167</td>
        </tr>
        <tr>
          <th>24</th>
          <td>6167</td>
          <td>6395</td>
        </tr>
        <tr>
          <th>25</th>
          <td>6395</td>
          <td>6628</td>
        </tr>
        <tr>
          <th>26</th>
          <td>6628</td>
          <td>6858</td>
        </tr>
        <tr>
          <th>27</th>
          <td>6858</td>
          <td>7091</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 239-272

Pure Parameters
---------------
As mentioned above, some parameters in this search do not affect the outcome of the optimization step.
We call these parameters *pure* parameters.
In this example `max_cost` is a *pure* parameter.
In contrast, `n_train_strides` is a *Hyperparameter*, as changing the parameter does change the outcome of the
pipeline optimization step.

However, during our GridSearch we treat both types of parameters the same.
This means, `self_optimize` is called once for each parameter combination above, even though we expect the same output
of `self_optimize` for the e.g. parameter combinations `{"max_cost": 3, "n_train_strides": None}` and
`{"max_cost": 5, "n_train_strides": None}`.
In this example this didn't really matter, because the optimization was fast, but in other cases it could be very
wasteful to rerun the optimization multiple times, even though the outcome would be identical.

A better approach would be to only run the training for all parameter combinations that are actually expected to
change its output and set the rest of the parameters only during the `run` step.
To learn more about this approach review the concept of *Group 3 algorithms* in the
:ref:`evaluation guide <algorithm_evaluation>`.

`GridSearchCV` has the option to make exactly this optimization.
However, it can not magically know, which parameters should be considered "pure".
This information needs to be provided manually via the `pure_parameter_names` parameter.
If provided, the output of the optimization will be cached and reused, if only pure parameters are modified.

.. warning :: Setting the wrong parameters as *pure* can result in hard to debug issues.
              Make sure you fully understand your pipeline, before using this option and compare the results of you
              pipeline on a subset of your data with and without the option before using it!

In our case, `max_cost` is a pure parameter.
We will rerun the pipeline below and mark `max_cost` as a pure parameter explicitly.
We will also set the verbosity to 2, to see the caching in action.
Note, that we also need to reset the random seed, otherwise we would get different results than above.

.. GENERATED FROM PYTHON SOURCE LINES 272-286

.. code-block:: default

    random.seed(1)

    gs_cached = GridSearchCV(
        pipeline=MyPipeline(),
        parameter_grid=parameters,
        scoring=score,
        pure_parameters=True,
        cv=cv,
        return_optimized="f1_score",
        verbose=2,
    )
    gs_cached = gs_cached.optimize(MyDataset())
    cached_results = gs_cached.cv_results_


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Split-Para Combos:   0%|          | 0/8 [00:00<?, ?it/s]________________________________________________________________________________
    [Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
    cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows]

         participant   foot
       0        test  right, {})
    ________________________________________________cachable_optimize - 0.1s, 0.0min

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]
    Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.95it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00,  9.91it/s]
    Split-Para Combos:  12%|#2        | 1/8 [00:00<00:01,  4.66it/s]________________________________________________________________________________
    [Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
    cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': None}, MyDataset [1 groups/rows]

         participant  foot
       0        test  left, {})
    ________________________________________________cachable_optimize - 0.1s, 0.0min

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.53it/s]
    Split-Para Combos:  25%|##5       | 2/8 [00:00<00:01,  5.28it/s]________________________________________________________________________________
    [Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
    cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows]

         participant   foot
       0        test  right, {})
    ________________________________________________cachable_optimize - 0.1s, 0.0min

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.63it/s]
    Split-Para Combos:  38%|###7      | 3/8 [00:00<00:00,  5.60it/s]________________________________________________________________________________
    [Memory] Calling tpcp._utils._score._cached_optimize.<locals>.cachable_optimize...
    cachable_optimize(<class 'tpcp.optimize._optimize.Optimize'>, {'pipeline__n_train_strides': 1}, MyDataset [1 groups/rows]

         participant  foot
       0        test  left, {})
    ________________________________________________cachable_optimize - 0.1s, 0.0min

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.72it/s]
    Split-Para Combos:  50%|#####     | 4/8 [00:00<00:00,  5.77it/s][Memory]0.7s, 0.0min    : Loading cachable_optimize...

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.33it/s]
    Split-Para Combos:  62%|######2   | 5/8 [00:00<00:00,  6.70it/s][Memory]0.8s, 0.0min    : Loading cachable_optimize...

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.42it/s]
    Split-Para Combos:  75%|#######5  | 6/8 [00:00<00:00,  7.44it/s][Memory]0.9s, 0.0min    : Loading cachable_optimize...

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.96it/s]
    Split-Para Combos:  88%|########7 | 7/8 [00:01<00:00,  8.07it/s][Memory]1.0s, 0.0min    : Loading cachable_optimize...

    Datapoints:   0%|          | 0/1 [00:00<?, ?it/s]    Datapoints: 100%|##########| 1/1 [00:00<00:00, 12.79it/s]
    Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  8.52it/s]    Split-Para Combos: 100%|##########| 8/8 [00:01<00:00,  7.05it/s]


.. GENERATED FROM PYTHON SOURCE LINES 287-294

When inspecting the debug output above, we can see that the function `cachable_optimize` (which handles the
optimization internally) was called 4 times for all combinations of the hyper parameter and data folds.
Then these cached results were used in 4 further cases, which correspond to the already run combinations,
but with a different value for `max_cost`.
This means, we saved 50% of all calls to `self_optimize` of our pipeline.

Just to make sure, you can see that the results below are still identical to our first run.

.. GENERATED FROM PYTHON SOURCE LINES 294-295

.. code-block:: default

    pd.DataFrame(cached_results).filter(like="mean")


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>mean__debug__optimize_time</th>
          <th>mean__debug__score_time</th>
          <th>mean__test__agg__precision</th>
          <th>mean__test__agg__recall</th>
          <th>mean__test__agg__f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0.079231</td>
          <td>0.104237</td>
          <td>1.000000</td>
          <td>0.933333</td>
          <td>0.964286</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.064390</td>
          <td>0.092016</td>
          <td>1.000000</td>
          <td>0.933333</td>
          <td>0.964286</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0.004036</td>
          <td>0.093557</td>
          <td>1.000000</td>
          <td>0.966667</td>
          <td>0.982759</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0.003866</td>
          <td>0.090319</td>
          <td>0.982759</td>
          <td>0.966667</td>
          <td>0.974576</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  7.961 seconds)

**Estimated memory usage:**  9 MB


.. _sphx_glr_download_auto_examples_datasets_and_pipelines_gridsearch_cv.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: gridsearch_cv.py <gridsearch_cv.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: gridsearch_cv.ipynb <gridsearch_cv.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_