# Scope and Code Structure (proj_struct)= To ensure that the project is easy to use, easy to maintain, and easy to expand in the future, all developers should adhere to the guidelines outlined below. Further the developers should familiarize themselves with aim and the scope of the project to make better decision when it comes to including new functionality. If you are new developer and need to familiarize yourself and need help with setting up your development environment to work on this project, have a look at the [Development Guide](development_guide.md). **Everything that follows are recommendations.** As for every project you should: - value your future self over your current self (don't use shortcuts or hacks that might become liabilities in the long-term). - think before you act. - know the rules, before you break them. - ask if in doubt. - ask for a third opinion if you have two competing ones. ## Aim and Scope The aim of the project is to provide a **library** that provides all the necessary tools to run a state-of-the-art gait analysis. This means this library is meant to be a toolbox and specifically **not** a single pipeline. In the long run, this library should include multiple algorithms that allow users to build a multitude of different pipelines. Following the reasoning, the library should not force users into a "certain way of doing thing". This means specifically: 1. Classes and functions should only require the bare minimum of input they require 2. Variations in functionality should be made available to users either by providing appropriate keyword arguments, separate methods, or even separate functions or classes. 3. Following 2., each function/class should have one function and one function only. In the case where "one function" consist of multiple steps (e.g. event detection consists of HS-detection, TO-detection, MS-detection) and it makes sense to group them together to provide an easier interface, the individual functions should still be available to users to allow him to skip or modify steps as they desire. 4. The library should not prevent users from "intentional stupidity". For example, if users decide to apply an event detection method designed to only run on data from sensors attached to the shoe on data from a hip sensor, **let them**. 5. Whenever possible the library should allow to provide native datatypes as outputs. Native datatypes in this case includes all the container objects Python supports (lists, dicts, etc.) and the base datatypes of `numpy` and `pandas` (np.array, pd.DataFrame, pd.Series). Only if it improves usability and adds significant value (either for understanding or manipulating the output), custom datatypes should be used. One example of that would be the `tpcp.Dataset` datatype used for high level pipelines. 6. The library should be agnostic to sensor systems and should not contain any code that is highly specific to a certain IMU system. This means that loading and preprocessing should be handled by the user or other libraries. However, gaitmap can provide utils to make preprocessing easier for users. (Loading is out of scope, as to many data formats exist and it is not feasible to support all of them.) ## Code Structure ### Library Structure For the `gaitmap` vs. `gaitmap_mad` discussion, see [here](gaitmap_mad). As the library aims to support multiple algorithms, each algorithm with similar function should be grouped into individual modules/folders (e.g. Stride-Segmentation, Event Detection, Orientation Estimation, ...). Each algorithm should than be implemented in a separate file unless multiple algorithms are really tightly coupled e.g. the scaler methods: `gaitmap/data_transform/_scaler.py`. If an algorithm requires large amount of code and multiple classes/functions, it can be refactored into its own submodule (for example see `packages/gaitmap_mad/src/gaitmap_mad/stride_segmentation/hmm`) ### Import Paths To ensure that users can use short import path, we discourage the import from the python files directly. Instead, new algorithms/functions should be added to the init of the parent folder to be accessible from there. The implementation file itself, should have a leading underscore, to mark it as private and discourage users from importing from it. For example, the algorithm `HerzerEventDetection` is defined in `gaitmap/event_detection/_herzer_event_detection.py`. This means the recommended import should be `from gaitmap.event_detection import HerzerEventDetection`. To make that possible, `HerzerEventDetection` is imported in `gaitmap/event_detection/__init__.py`. In the respective `__init__` files, the `__all__` list should be used to allow for `*` imports without cluttering the namespace. In general, we use absolute imports internally (instead of relative imports), even though it requires more typing. .. warning:: With absolute imports and the use of `__init__` imports, you can sometimes create circular dependencies. E.g. when you want to import a function `a()` defined in `mod/_a.py` in `mod/_b.py` and `mod/_b.py` and `mod/_a.py` are both imported in `mod/__init__.py` (as by convention explained above), you must import `c()` with its full internal path `from mod._c import c` instead of `from mod import c` to avoid circular imports. In case a single function from a external package is used, just import this function. In case multiple functions from an external package are used, import this package/module under a commonly used alias (e.g. `np` for numpy, `pd` for pandas, ...) ### Algorithms with specific dependencies Some algorithms require specific dependencies, that are ot required by anything else. In this case, they should be made optional dependencies of gaitmap to reduce the number of dependencies for users. To make this possible, the algorithm should be implemented in a separate submodule (e.g. `packages/gaitmap_mad/src/gaitmap_mad/stride_segmentation/hmm`), that is not imported anywhere else in the library. In case, where only a few methods from a different library are required, it might also be feasible to just copy the required methods into the submodule and add a comment to the source of the code. Make sure to check the license of the original code and provide sufficient attribution. One example of this can be found in `packages/gaitmap_mad/src/gaitmap_mad/stride_segmentation/dtw/_vendored_tslearn.py`. ### Helper Functions and Utils Functions that can be reused across multiple algorithms of similar type should be placed in a module level `_utils.py` file (e.g. `stride_segmentation/_utils.py`). Functions that are reusable across multiple modules should be placed in an appropriate file in the package level `utils` module (e.g. `utils/_math_helper.py`). Utils are generally "private", but when you can imagine usecases where users might reasonably need to use them when working with the library, you can expose them in the respective `__init__` files. ### Class Structure All larger algorithms should be represented by classes and not by functions for a couple of reasons that are explained below. See the [general guides in tpcp](https://tpcp.readthedocs.io/en/latest/guides/algorithms_pipelines_datasets.html). From the guide: - The `__init__` of each class should only be there to set parameters. No parameter validation or any functionality should be performed here. - No actual data should be passed to the `__init__`. Think of it as configuring the algorithm. - Defaults for **all** parameters should be provided in the `__init__`. - The names of the class attributes should be identical to the parameter names used in the `__init__` (i.e. the init should contain statements like: `self.parameter_name = parameter_name`). - All parameters that are not directly depended on the input data, should be set in the `__init__` and not in the *action* method (see below). This also includes parameters that should be adapted based on the data, but can theoretically estimated without having the data available (e.g. the optimal threshold for DTW). All these parameters should be set in the `__init__`. The data and all other measured/directly data-depended parameters are passed in the action method. This includes for example, the raw IMU data, the sampling rate demographic information, etc.. - Results and outputs are stored with a trailing underscore (e.g. `filtered_stride_list_`. - All algorithms of the same type should have a consistent interface with (as far as possible), identical input parameters to allow drop-in replacements of different algorithms - Each type of algorithm has one (or multiple) "action" methods with a descriptive name. These *action* methods take the actual data as input and will produce results. - All *action* methods just return `self` (the object itself) - Multiple action methods might be required in one of the following cases: - Give users more granular functionality (e.g. a `detect_hs` method in addition to a general `detect` method) - Multiple steps are required for the algorithm (e.g. fit and predict for ML based algos) - Alternative functionality that can/should not be handled by arguments (e.g. `fit` vs `fit_proba` in sklearn) Additions to the guide: - All algorithms classes must directly or indirectly inherit from `tpcp.Algorithm`, and `gaitmap.base._BaseSerializable` - All classes should store the data (and other arguments) passed in the "action" step in the class object unless the amount of data would result in an unreasonable performance issue. Ideally this should be a reference and not a copy of the data! This allows to path the final object as a whole to helper functions, that e.g. can visualize in and outputs. These parameters should be documented under "Other Parameters" to not clutter the docstring. - Mutable defaults in the init are as always a bad idea, but in gaitmap we make specific exceptions. See the section on *Mutable defaults* below. - All methods should take care that they do not modify the original data passed to the function. If required a copy of the data can be created, but **not** stored in the object. - All classes should validate their input parameters during the "action" (or whenever the parameters are first needed). Don't overdue the validation and focus on logical validation (e.g. a value can not be larger than x) and not on type validation. For type validation, we should trust that Python provides the correct error message once an invalid step is performed. - All classes should inherent from a BaseClass specific to their type that implements common functionality and enforces the interface. Remember to call respective `super` methods when required. The resulting class structure should look like this: ``` tpcp.Algorithm/gaitmap.base._BaseSerializable -> Basic setting of parameters | Base -> Basic interface to ensure all algos of the same type use the same input and outputs for their | action methods | | -> A improved version of an algorithm, when it does not make sense to toggle the improvement via an inputparameter on the algorithm ``` #### Example class structure Below you can find the simplified class structure of the `RamppEventDetection` algorithm. It should serve as an example on how further algorithms should be implemented and documented. Also review the actual implementation of the other algorithms for further inspiration and guidance. ```python import numpy as np import pandas as pd from gaitmap.base import BaseEventDetection, BaseType from typing import Optional, Tuple, Union, Dict from gaitmap.utils.datatype_helper import SensorData, is_multi_sensor_data, is_single_sensor_data class RamppEventDetection(BaseEventDetection): """Find gait events in the IMU raw signal based on signal characteristics. RamppEventDetection uses signal processing approaches to find temporal gait events by searching for characteristic features in the signals as described in Rampp et al. (2014) [1]_. For more details refer to the `Notes` section. Parameters ---------- ic_search_region_ms ... min_vel_search_win_size_ms ... Attributes ---------- min_vel_event_list_ : A stride list or dictionary with such values ... start_ : 1D array or dictionary with such values ... end_ : 1D array or dictionary with such values ... tc_ : 1D array or dictionary with such values ... min_vel_ : 1D array or dictionary with such values ... ic_ : 1D array or dictionary with such values ... pre_ic_ : 1D array or dictionary with such values ... Other Parameters ---------------- data ... sampling_rate_hz ... stride_list ... """ ic_search_region_ms: Tuple[float, float] min_vel_search_win_size_ms: float start_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] end_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] tc_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] min_vel_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] ic_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] pre_ic_: Optional[Union[np.ndarray, Dict[str, np.ndarray]]] stride_events_: Optional[Union[pd.DataFrame, Dict[str, pd.DataFrame]]] data: SensorData sampling_rate_hz: float segmented_stride_list: pd.DataFrame def __init__(self, ic_search_region_ms: Tuple[float, float] = (80, 50), min_vel_search_win_size_ms: float = 100): # Just add the Parameters without any logic self.ic_search_region_ms = ic_search_region_ms self.min_vel_search_win_size_ms = min_vel_search_win_size_ms super().__init__() def detect(self: BaseType, data: SensorData, sampling_rate_hz: float, segmented_stride_list: pd.DataFrame) -> BaseType: """Find gait events in data within strides provided by stride_list. Parameters ---------- data The data set holding the imu raw data sampling_rate_hz The sampling rate of the data segmented_stride_list A list of strides provided by a stride segmentation method Returns ------- self The class instance with all result attributes populated Examples -------- Get gait events from single sensor signal >>> event_detection = RamppEventDetection() >>> event_detection.detect(data=data, sampling_rate_hz=204.8, stride_list=stride_list) >>> event_detection.min_vel_event_list_ s_id start end ic tc min_vel pre_ic 0 0 519.0 710.0 651.0 584.0 519.0 498.0 1 1 710.0 935.0 839.0 802.0 710.0 651.0 2 2 935.0 1183.0 1089.0 1023.0 935.0 839.0 ... """ # Set all "Other Parameters" self.data = data self.sampling_rate_hz = sampling_rate_hz self.segmented_stride_list = segmented_stride_list # Potential validation of parameters should be performed here if is_single_sensor_data(data): # Handle single sensors ... elif is_multi_sensor_data(data): # Handle multiple sensors ... else: raise ValueError("Provided data set is not supported by gaitmap") return self ``` ### Random and Initial State If any algorithms rely on random processes/operations, the random state should be configurable, by an optional kwarg in the `__init__` called `random_state`. We follow the [`sklearn` recommendations](https://scikit-learn.org/stable/glossary.html#term-random-state) on this. Algorithms that require an initial value for some optimization should expose this value via the `__init__`. If the parameter is `None` a random initial value should be used that is controlled by the additional `random_state` argument. ### Mutable Defaults See [this guide in tpcp](https://tpcp.readthedocs.io/en/latest/guides/general_concepts.html#mutable-defaults). ## Code guidelines All code should follow coherent best practices. As far as possible the adherence to these best practices should be tested using linters or a testsuite that runs as part of the CI. ### General Codestyle For general codestyle we follow [PEP8](https://www.python.org/dev/peps/pep-0008/) with a couple of exceptions (e.g. line length). These are documented in the linter config (`.prospector.yml`) ### Naming We follow the naming conventions outlined in [PEP8](https://www.python.org/dev/peps/pep-0008/#naming-conventions). For algorithms (if no better name is available) we use `AuthorNameType` (e.g. `BarthEventDetection`). ### Documentation For documentation, we follow [numpys guidelines](https://numpydoc.readthedocs.io/en/latest/format.html). If the datatype is already provided as TypeHint (see below) it does not need to be specified in the docstring again. However, it might be helpful to document additional type information (e.g. the shape of an array that can not be captured by the TypeHint) All user-facing functions (all functions and methods that do not have a leading underscore) are expected to be properly and fully documented for potential users. All private functions are expected to be documented in a way that other developer can understand them. Additionally, each module should have a docstring explaining its content. If a module contains only one class this can a single sentence/word (e.g. `"""Event detection based on ... ."""`). ### Typehints To provide a better developer experience the library should use [TypeHints](https://numpydoc.readthedocs.io/en/latest/format.html) where ever possible. Remember to use `np.ndarray` instead of `np.array` as type specification of numpy arrays.