Contributing Guide#

This is an exhaustive guide to ease the contribution process for both novice and experienced contributors. Geomstats is a community effort, and everyone is welcome to contribute.

Development Setup#

The instructions in this section detail the step-by-step process on setting up your development environment before contribution. One can also use it as a development checklist for an overview of development steps.

Source control with Git#

The Geomstats project is available on GitHub and uses Git for source control to allow collaboration. Typical interaction with the project involves using git to pull/push code and submitting bugs/feature requests to the geomstats repository.

Be sure to follow the Git installation and configuration instructions for your respective operating system from the official Git documentation, before you follow along the next section of this documentation.

Note

For more basic information about the git command line client, refer to this detailed tutorial. You can also checkout GUI options.

Getting the code#

For development, you have to get the geomstats source code to your local machine first, from the project GitHub repository using the following instructions:

  1. Using your browser, go to github.com and create an account if you don’t have one.

  2. While there, navigate to the geomstats repository.

  3. Fork the repository, obtaining your own copy. You can do this using a button at the top right corner on github, under your username. The following link will become available then:

    https://github.com/<username>/geomstats
    
  4. Clone your forked repository using:

    $ git clone https://github.com/<username>/geomstats
    

    or via ssh if you’ve set up SSH keys in your account:

    $ git clone git@github.com:<username>/geomstats.git
    
  5. It is recommended practice to add the main geomstats repository as the upstream remote repository:

    $ cd geomstats
    $ git remote add upstream https://github.com/<username>/geomstats.git
    

    This is so that later you can bring the upstream updates locally by doing:

    $ git pull upstream main
    
  6. Verify your remote configuration:

    $ git remote -v
    upstream  https://github.com/geomstats/geomstats (fetch)
    upstream  https://github.com/geomstats/geomstats (push)
    origin    https://github.com/<username>/geomstats (fetch)
    origin    https://github.com/<username>/geomstats (push)
    
  7. At this point you have the geomstats code on your machine ready for development. Create a new development branch where the new changes will be commited:
    $ git checkout -b <branch-name>
    

    (main could have been used to develop new code. Nevertheless, the process is cleaner if you create a new branch - e.g. the merge from upstream is easier to handle when there’s conflicts - and allows you to develop several features independently, each in its own branch.)

  8. Verify that you are on the new branch:

    $ git branch
    * <branch-name>
      main
    

Dependencies and a virtual environment#

We recommend using conda virtual environments to separate your development environment from any other geomstats versions installed on your system (this simplifies e.g. requirements management).

From the geomstats folder, create a virtual environment:

$ conda create -n geomstats-3.11 python=3.11

This command will create a new environment named geomstats-3.11.

Then, activate the environment and install geomstats in editable mode:

$ conda activate geomstats-3.11
$ pip install -e .

Editable mode means that your changes in geomstats will be immediately reflected in any code that runs within this environment.

Note

venv is an alternative for creating lightweight environments.

Note

See the pyproject.toml file for details on all project requirements.

Backends#

Geomstats supports several backends, namely: numpy , autograd, pytorch.

The default backend is numpy, install the other backends using:

$ pip install -e .[<backend_name>]

Then use an environment variable to set the backend:

$ export GEOMSTATS_BACKEND=<backend_name>

Run the tests#

Geomstats tests can be run using pytest. To run tests with pytest, first install the required packages:

$ pip install -e .[test]

Then run all tests using:

$ pytest tests

Optionally, run a particular test file using:

$ pytest tests/test_geomstats/<test_filename.py>

Alternatively, run only the package tests using:

$ tests/tests_geomstats

Or only the notebooks and scripts using:

$ tests/tests_scripts

Build the docs#

Documentation in the geomstats project is implemented using sphinx. Install the sphinx dependencies using:

$ pip install -e .[doc]

Then while in the project root folder, build the docs using:

$ cd docs
$ make html

Note

The steps in this section are unix-specific, for windows users, consult the official documentation on how to install and use make.

Folder Structure#

When you open the Geomstats github page, you will see the top-level directories of the package. Below is a description of each directory.

geomstats

Has the core implementation of the geomstats package features like geometry, distributions, learning, visualization etc.

tests

Has unit tests for the core library features.

docs

Has the official documentation found at https://geomstats.github.io.

benchmarks

Has code for benchmarking several aspects of geomstats.

examples

Has sample code demonstrating different geomstats features.

notebooks

Has example code.

Testing#

Test Driven Development#

High-quality unit testing is a corner-stone of the geomstats development process. The tests consist of classes appropriately named, located in the tests subdirectory, that check the validity of the algorithms and the different options of the code.

TDD with pytest#

Geomstats uses the pytest Python tool for testing different functions and features. Install the test requirements using:

$ pip install -e .[test]

By convention all test functions should be located in files with file names that start with test_. For example a unit test that exercises the Python addition functionality can be written as follows:

# test_add.py

def add(x, y):
   return x + y

def test_capital_case():
   assert add(4, 5) == 9

Use an assert statement to check that the function under test returns the correct output. Then run the test using:

$ pytest test_add.py

Writing tests for geomstats#

For each function my_fun that you implement in a given my_module.py, you should add the corresponding test function test_my_fun in the file test_my_module.py.

We expect code coverage of new features to be at least 90%, which is automatically verified by the codecov software when you submit a PR. You should also add test_my_fun_vect tests to ensure that your code is vectorized.

Running tests#

First, run the tests related to your changes. For example, if you changed something in geomstats/spd_matrices_space.py, you can run tests by file name:

$ pytest tests/tests_geomstats/test_spd_matrices.py

Then run the tests of the whole codebase to check that your feature is not breaking anything:

$ pytest tests/test_geomstats/

This way, further modifications on the code base are guaranteed to be consistent with the desired behavior. Merging your PR should not break any test.

Workflow of a contribution#

The best way to start contributing is by finding a part of the project that is more familiar to you (e.g. a specific manifold or metric, a learning algorithm, etc). Instead, if these concepts are new to you and you would like to contribute while learning, look at some of the existing issues.

Create or choose an issue for new contributors#

New contributors should look for the following tags when searching for issues. We strongly recommend that new contributors tackle easy issues first. This helps the contributor become familiar with the contribution workflow, and for the core devs to become acquainted with the contributor; besides, we frequently underestimate how easy an issue is to solve!

Making changes#

The preferred way to contribute to geomstats is to fork the main repository and submit a “pull request” (PR).

Follow the guidelines detailed in Getting the code to setup the development environment. Then, follow the next steps before submitting a PR:

  1. Synchronize your main branch with the upstream main branch:

    $ git checkout main
    $ git pull upstream main
    
  2. Create a feature branch to hold your development changes:
    $ git checkout -b <branch-name>
    
  3. Make changes.

  4. When you’re done editing, add changed files using git add and then git commit:

    $ git add <modified_files>
    $ git commit -m "Add my feature"
    

    to record your changes. Your commit message should respect the good commit messages guidelines. (How to Write a Git Commit Message also provides good advice.)

    Note

    Before commit, make sure you have run the black and flake8 tools for proper code formatting.

    Then push the changes to your GitHub account with:

    $ git push origin <branch-name>
    

    Use the -u flag if the branch does not exist yet remotely.

  5. Follow these instructions to create a pull request from your fork. This will send an email to the committers. You may want to consider sending an email to the mailing list hi@geomstats.ai for more visibility.

  6. Repeat 3. and 4. following the reviewers requests.

It is often helpful to keep your local feature branch synchronized with the latest changes of the main geomstats repository. Bring remote changes locally:

$ git checkout main
$ git pull upstream main

And then merge them into your branch:

$ git checkout <branch-name>
$ git merge main

Note

Refer to the Git documentation related to resolving merge conflict using the command line. The Git documentation and http://try.github.io are excellent resources to get started with git, and understanding all of the commands shown here.

Pull Request Checklist#

In order to ease the reviewing process, we recommend that your contribution complies with the following rules. The bolded ones are especially important:

  1. Give your pull request a helpful title. This summarises what your contribution does. This title will often become the commit message once merged so it should summarise your contribution for posterity. In some cases Fix <ISSUE TITLE> is enough. Fix #<ISSUE NUMBER> is never a good title.

  2. Make sure that your code is vectorized. For vectorized matrix operations we recommend using the methods of the Matrices class instead of lower level backend functions, as they are automatically vectorized.

  3. Submit your code with associated unit tests. High-quality unit testing is a corner-stone of the geomstats development process. The tests are functions appropriately named, located in the tests subdirectory, that check the validity of the algorithms and the different options of the code. For each function my_fun that you implement in a given my_module.py, you should add the corresponding test function test_my_fun in the file test_my_module.py. We expect code coverage of new features to be at least 90%, which is automatically verified by the codecov software when you submit a PR. You should also add test_my_fun_vect tests to ensure that your code is vectorized.

  4. Make sure your code passes all unit tests. First, run the tests related to your changes. For example, if you changed something in geomstats/spd_matrices_space.py:

    $ pytest tests/tests_geomstats/test_spd_matrices.py
    

    and then run the tests of the whole codebase to check that your feature is not breaking any of them:

    $ pytest tests/
    

    This way, further modifications on the code base are guaranteed to be consistent with the desired behavior. Merging your PR should not break any test in any backend.

  5. Make sure that your PR follows Python international style guidelines, PEP8. The flake8 package automatically checks for style violations when you submit your PR. We recommend installing flake8 with its plugins on your machine by running:

    $ pip install -e .[dev]
    

    Then before any commit, run:

    $ flake8 geomstats tests
    

    To prevent adding commits which fail to adhere to the PEP8 guidelines, we include a pre-commit config, which immediately invokes flake8 on all files staged for commit when running git commit. To enable the hook, simply run pre-commit install after installing pre-commit either manually via pip or as part of the development requirements.

    Please avoid reformatting parts of the file that your pull request doesn’t change, as it distracts during code reviews.

  6. Make sure that your PR follows geomstats coding style and API (see Coding Style Guidelines). Ensuring style consistency throughout geomstats allows using tools to automatically parse the codebase, for example searching all instances where a given function is used, or use automatic find-and-replace during code’s refactorizations. It also speeds up the code review and acceptance of PR, as the maintainers do not spend time getting used to new conventions and coding preferences.

  7. Make sure your code is properly documented, and make sure the documentation renders properly. To build the documentation, please see our Documentation guidelines. The plugin flake8-docstrings automatically checks that your the documentation follows our guidelines when you submit a PR.

  8. Often pull requests resolve one or more other issues (or pull requests). If merging your pull request means that some other issues/PRs should be closed, you should use keywords to create link to them (e.g., fixes #1234; multiple issues/PRs are allowed as long as each one is preceded by a keyword). Upon merging, those issues/PRs will automatically be closed by GitHub. If your pull request is simply related to some other issues/PRs, create a link to them without using the keywords (e.g., see also #1234).

  9. PRs should often substantiate the change, through benchmarks of performance and efficiency or through examples of usage. Examples also illustrate the features and intricacies of the library to users. Have a look at other examples in the examples/ subdirectory for reference. Examples should demonstrate why the new functionality is useful in practice and, if possible, compare it to other methods available in geomstats.

  10. The user guide should also include expected time and space complexity of the algorithm and scalability, e.g. “this algorithm can scale to a large number of samples > 100000, but does not scale in dimensionality: n_features is expected to be lower than 100”.
  11. Each PR needs to be accepted by a core developer before being merged.

You can also check our Code Review Guidelines to get an idea of what reviewers will expect.

Bonus points for contributions that include a performance analysis with a benchmark script and profiling output (please report on the mailing list hi@geomstats.ai or on the GitHub issue).

Note

The current state of the geomstats code base is not compliant with all of those guidelines, but we expect that enforcing those constraints on all new contributions will get the overall code base quality in the right direction.

Stalled Pull Requests#

As contributing a feature can be a lengthy process, some pull requests appear inactive but unfinished. In such a case, taking them over is a great service for the project.

A good etiquette to take over is:

  • Determine if a PR is stalled

    • A pull request may have the label “stalled” or “help wanted” if we have already identified it as a candidate for other contributors.

    • To decide whether an inactive PR is stalled, ask the contributor if she/he plans to continue working on the PR in the near future. Failure to respond within 2 weeks with an activity that moves the PR forward suggests that the PR is stalled and will result in tagging that PR with “help wanted”.

      Note that if a PR has received earlier comments on the contribution that have had no reply in a month, it is safe to assume that the PR is stalled and to shorten the wait time to one day.

  • Taking over a stalled PR: To take over a PR, it is important to comment on the stalled PR that you are taking over and to link from the new PR to the old one. The new PR should be created by pulling from the old one.

Coding Style Guidelines#

The following are some guidelines on how new code should be written. Of course, there are special cases and there will be exceptions to these rules. However, following these rules when submitting new code makes the review easier so new code can be integrated in less time. Uniformly formatted code makes it easier to share code ownership.

In addition to the PEP8 standards, geomstats follows the following guidelines:

  1. Use underscores to separate words in non class names: n_samples rather than nsamples.

  2. Avoid single-character variable names. This prevents using automatic tools to find-and-replace code, as searching for x in geomstats will return the whole database. At least 3 characters are advised for a variable name.

  3. Use meaningful function and variable names. The naming should help the maintainers reading faster through your code. Thus, my_array, aaa, result, res are generally bad variable names, whereas rotation_vec or symmetric_mat read well.

  4. Avoid comment in the code, the documentation goes in the docstrings. This allows the explanations to be included in the documentation generated automatically on the website. Furthermore, forbidding comments forces us to write clean code, and clean docstrings.

  5. Follow geomstats’ API. For example, points on manifolds are denoted point, tangent vectors tangent_vec, matrices mat, exponential exp and logarithms log.

  6. Avoid multiple statements on one line. Divide complex computations on several lines. Prefer a line return after a control flow statement (if/for).

  7. Don’t use import * in any case. It is considered harmful by the official Python recommendations. It makes the code harder to read as the origin of symbols is no longer explicitly referenced, but most important, it prevents using a static analysis tool like pyflakes to automatically find bugs in geomstats.

  8. Avoid the use of import ... as and of from ... import foo, bar, i.e. do not rename modules or modules’ functions, because you would create objects living in several namespaces which creates confusion (see Language Constructs You Should Not Use). Keeping the original namespace ensures naming consistency in the codebase and speeds up the code reviews: co-developers and maintainers do not have to check if you are using the original module’s method or if you have overwritten it.

  9. Use double quotes and not single quotes for strings.

  10. If you need several lines for a function call, use the syntax:
    my_function_with_a_very_long_name(
       my_param_1=value_1, my_param_2=value_2)
    

    and not:

    my_function_with_a_very_long_name(my_param_1=value_1,
                                      my_param_2=value_2)
    

    as the indentation will break and raise a flake8 error if the name of the function is changed.

These guidelines can be revised and modified at any time, the only constraint is that they should remain consistent through the codebase. To change geomstats style guidelines, submit a PR to this contributing file, together with the corresponding changes in the codebase.

Documentation#

We are glad to accept any sort of documentation: function docstrings, reStructuredText documents (like this one), tutorials, etc. reStructuredText documents live in the source code repository under the docs/ directory.

Building the Documentation#

Building the documentation requires installing specific requirements:

pip install -e .[doc]

To build the documentation, follow the steps discussed in build the docs to install other dependencies and build the documentation.

Writing Docstrings#

Intro to Docstrings#

A docstring is a well-formatted description of your function/class/module which includes its purpose, usage, and other information.

There are different markdown languages/formats used for docstrings in Python. The most common three are reStructuredText, numpy, and google docstring styles. For geomstats, we are using the numpy docstring standard. When writing up your docstrings, please review the NumPy docstring guide to understand the role and syntax of each section. Following this syntax is important not only for readability, it is also required for automated parsing for inclusion into our generated API Reference.

You can look at these for any object by printing out the __doc__ attribute. Try this out with the np.array class and the np.mean function to see good examples:

>>> import numpy as np
>>> print(np.mean.__doc__)

The Anatomy of a Docstring#

These are some of the most common elements for functions (and ones we’d like you to add where appropriate):

  1. Summary - a one-line (here <79 char) description of the object

    1. Begins immediately after the first “”” with a capital letter, ends with a period

    2. If describing a function, use a verb with the imperative mood (e.g. Compute vs Computes)

    3. Use a verb which is as specific as possible, but default to Compute when uncertain (as opposed to Calculate or Evaluate, for example)

  2. Description - a more informative multi-line description of the function

    1. Separated from the summary line by a blank line

    2. Begins with a capital letter and ends with period

  3. Parameters - a formatted list of arguments with type information and description

    1. On the first line, state the parameter name, type, and shape when appropriate. The parameter name should be separated from the rest of the line by a : (with a space on either side). If a parameter is optional, write Optional, default: default_value. as a separate line in the description.

    2. On the next line, indent and write a summary of the parameter beginning with a capital letter and ending with a period.

    3. See Docstring Examples.

  4. Returns (esp. for functions) - a formatted list of returned objects type information and description

    1. The syntax here is the same as in the parameters section above.

    2. See Docstring Examples.

If documenting a class, you would also want to include an Attributes section. There are many other optional sections you can include which are very helpful. For example: Raises, See Also, Notes, Examples, References, etc.

N.B. Within Notes, you can
  • include LaTex code

  • cite references in text using ids placed in References

Docstring Examples#

Here’s a generic docstring template:

def my_method(self, my_param_1, my_param_2="vector"):
   """Write a one-line summary for the method.

   Write a description of the method, including "big O"
   (:math:`O\left(g\left(n\right)\right)`) complexities.

   Parameters
   ----------
   my_param_1 : array-like, shape=[..., dim]
      Write a short description of parameter my_param_1.
   my_param_2 : str, {"vector", "matrix"}
      Write a short description of parameter my_param_2.
      Optional, default: "vector".

   Returns
   -------
   my_result : array-like, shape=[..., dim, dim]
      Write a short description of the result returned by the method.

   Notes
   -----
   If relevant, provide equations with (:math:)
   describing computations performed in the method.

   Example
   -------
   Provide code snippets showing how the method is used.
   You can link to scripts of the examples/ directory.

   Reference
   ---------
   If relevant, provide a reference with associated pdf or
   wikipedia page.
   """

And here’s a filled-in example from the Scikit-Learn project, modified to our syntax:

def fit_predict(self, X, y=None, sample_weight=None):
   """Compute cluster centers and predict cluster index for each sample.

   Convenience method; equivalent to calling fit(X) followed by
   predict(X).

   Parameters
   ----------
   X : {array-like, sparse_matrix} of shape=[..., n_features]
      New data to transform.
   y : Ignored
      Not used, present here for API consistency by convention.
   sample_weight : array-like, shape [...,], optional
      The weights for each observation in X. If None, all observations
      are assigned equal weight (default: None).

   Returns
   -------
   labels : array, shape=[...,]
      Index of the cluster each sample belongs to.
   """
   return self.fit(X, sample_weight=sample_weight).labels_

In general, have the following in mind:

  1. Use built-in Python types. (bool instead of boolean)

  2. Use [ for defining shapes: array-like, shape=[..., dim]

  3. If a shape can vary, use a list-like notation: array-like, shape=[dimension[:axis], n, dimension[axis:]]

  4. For strings with multiple options, use brackets: input: str, {"log", "squared", "multinomial"}

  5. 1D or 2D data can be a subset of {array-like, ndarray, sparse matrix, dataframe}. Note that array-like can also be a list, while ndarray is explicitly only a numpy.ndarray.

  6. Add “See Also” in docstrings for related classes/functions. “See Also” in docstrings should be one line per reference, with a colon and an explanation.

For Class and Module Examples see the scikit-learn _weight_boosting.py module. The class AdaBoost has a great example using the elements we’ve discussed here. Of course, these examples are rather verbose, but they’re good for understanding the components.

When editing reStructuredText (.rst) files, try to keep line length under 80 characters (exceptions include links and tables).

Code Review Guidelines#

Reviewing code contributed to the project as PRs is a crucial component of geomstats development. We encourage anyone to start reviewing code of other developers.

The code review process is often highly educational for everybody involved. This is particularly appropriate if it is a feature you would like to use, and so can respond critically about whether the PR meets your needs. While each pull request needs to be signed off by two core developers, you can speed up this process by providing your feedback.

Here are a few important aspects that need to be covered in any code review, from high-level questions to a more detailed check-list.

  • Do we want this in the library? Is it likely to be used? Do you, as a geomstats user, like the change and intend to use it? Is it in the scope of geomstats? Will the cost of maintaining a new feature be worth its benefits?

  • Is the code consistent with the API of geomstats? Are public functions/classes/parameters well named and intuitively designed?

  • Are all public functions/classes and their parameters, return types, and stored attributes named according to geomstats conventions and documented clearly?

  • Is every public function/class tested? Are a reasonable set of parameters, their values, value types, and combinations tested? Do the tests validate that the code is correct, i.e. doing what the documentation says it does? If the change is a bug-fix, is a non-regression test included? Look at this to get started with testing in Python.

  • Do the tests pass in the continuous integration build? If appropriate, help the contributor understand why tests failed.

  • Do the tests cover every line of code (see the coverage report in the build log)? If not, are the lines missing coverage good exceptions?

  • Is the code easy to read and low on redundancy? Should variable names be improved for clarity or consistency?

  • Could the code easily be rewritten to run much more efficiently for relevant settings?

  • Will the new code add any dependencies on other libraries? (this is unlikely to be accepted)

  • Does the documentation render properly (see the Documentation section for more details), and are the plots instructive?

  • Upon merging, use the Rebase and Merge option to keep git history clean.

Reporting bugs and features#

Sharing bugs and potential new features for the geomstats project is an equally significant contribution. We encourage reports for any module including documentation and missing tests.

Issue tracker#

The geomstats project uses the GitHub issue tracker for all bugs and feature reports. Therefore, to create an issue navigate to the issue tab on project on Github, and click the New issue button on the upper right corner.

Template of a bug/issue report#

We offer two templates for reporting issues, one for bug reports and another for issues about the documentation as shown in the figure below:

../_images/template.png

If none of these suite your needs, feel free to open an issue with default GitHub blank issue template.

Issue Triaging#

Other than reporting bugs, another important aspect of contribution is issue triaging. This is about issue management and includes certain aspects that are described in the sequel.

Reproducing issues#

Sometimes reported issues need to be verified to ascertain if they are actual issues or false alarms. Part of triaging is trying to simulate the bugs in their reported environments and other relevant environments.

We encourage you to help with this and comment on the issue if you can or can not reproduce it as described. This allows core devs to close the issue if it does not require fixing.

Commenting on alternative solutions#

If an issue is verified as valid but the author and/or triager, you can choose to share any valuable information to solve the issue before a fix is merged. This helps the issue author and potential contributors to open pull requests if you do not have time to work on a fix.

Answering questions#

Some issues are questions about how different aspects of the project work. Part of triaging to provide answers to these questions that even others in the community may be facing.

Labelling and assigning the issue#

Part of triaging also involves labeling issues by their types, modules they belong to or even their priority. See Create or choose an issue for new contributors on what labels can be applied to issues.