# Learning¶

## Estimating the mean¶

Frechet mean.

class geomstats.learning.frechet_mean.FrechetMean(metric, max_iter=32, epsilon=0.0001, point_type=None, method='default', lr=0.001, tau=0.005, verbose=False)[source]

Empirical Frechet mean.

Parameters
• metric (RiemannianMetric) – Riemannian metric.

• max_iter (int) – Maximum number of iterations for gradient descent. Optional, default: 32.

• point_type (str, {'vector', 'matrix'}) – Point type. Optional, default: None.

• method (str, {'default', 'adaptive', 'ball'}) – Gradient descent method. Optional, default: 'default'.

• verbose (bool) – Verbose option. Optional, default: False.

fit(X, y=None, weights=None)[source]

Compute the empirical Frechet mean.

Parameters
• X ({array-like, sparse matrix}, shape=[…, n_features]) – Training input samples.

• y (array-like, shape=[…,] or […, n_outputs]) – Target values (class labels in classification, real numbers in regression). Ignored.

• weights (array-like, shape=[…,]) – Weights associated to the points. Optional, default: None.

Returns

self (object) – Returns self.

geomstats.learning.frechet_mean.linear_mean(points, weights=None, point_type='vector')[source]

Compute the weighted linear mean.

The linear mean is the Frechet mean when points: - lie in a Euclidean space with Euclidean metric, - lie in a Minkowski space with Minkowski metric.

Parameters
• points (array-like, shape=[…, dim]) – Points to be averaged.

• weights (array-like, shape=[…,]) – Weights associated to the points. Optional, default: None.

Returns

mean (array-like, shape=[dim,]) – Weighted linear mean of the points.

geomstats.learning.frechet_mean.variance(points, base_point, metric, weights=None, point_type='vector')[source]

Variance of (weighted) points wrt a base point.

Parameters
• points (array-like, shape=[…, dim]) – Points.

• weights (array-like, shape=[…,]) – Weights associated to the points. Optional, default: None.

Returns

var (float) – Weighted variance of the points.

Exponential barycenter.

class geomstats.learning.exponential_barycenter.ExponentialBarycenter(group, max_iter=32, epsilon=1e-06, step=1.0, point_type=None, verbose=False)[source]

Empirical Exponential Barycenter for Matrix groups.

Parameters
• group (LieGroup) – Lie group instance on which the data lie.

• max_iter (int) – Maximum number of iterations to perform in the gradient descent. Optional, default: 32.

• epsilon (float) – Tolerance to reach convergence. The exstrinsic norm of the gradient is used as criterion. Optional, default: 1e-6.

• step (float) – Learning rate in the gradient descent. Optional, default: 1.

• verbose (bool) – Level of verbosity to inform about convergence. Optional, default: 1.

estimate_
Type

array-like, shape=[dim, dim]

fit(X, y=None, weights=None)[source]

Compute the empirical Exponential Barycenter mean.

Parameters
• X ({array-like, sparse matrix}, shape=[…, n_features]) – Training input samples.

• y (array-like, shape=[…,] or […, n_outputs]) – Target values (class labels in classification, real numbers in regression). Ignored.

• weights (array-like, shape=[…,]) – Weights associated to the points. Optional, default: None.

Returns

self (object) – Returns self.

## Clustering¶

K-means clustering.

class geomstats.learning.kmeans.RiemannianKMeans(metric, n_clusters=8, init='random', tol=0.01, mean_method='default', verbose=0, point_type='vector')[source]

Class for k-means clustering on manifolds.

K-means algorithm using Riemannian manifolds.

Parameters
• n_clusters (int) – Number of clusters (k value of the k-means). Optional, default: 8.

• metric (object of class RiemannianMetric) – The geomstats Riemmanian metric associate to the space used.

• init (str) – How to initialize centroids at the beginning of the algorithm. The choice ‘random’ will select training points as initial centroids uniformly at random. Optional, default: ‘random’.

• tol (float) – Convergence factor. Convergence is achieved when the difference of mean distance between two steps is lower than tol. Optional, default: 1e-2.

• verbose (int) – If verbose > 0, information will be printed during learning. Optional, default: 0.

Example

Available example on the Poincaré Ball and Hypersphere manifolds examples.plot_kmeans_manifolds

fit(X, max_iter=100)[source]

Provide clusters centroids and data labels.

Alternate between computing the mean of each cluster and labelling data according to the new positions of the centroids.

Parameters
• X (array-like, shape=[…, n_features]) – Training data, where n_samples is the number of samples and n_features is the number of features.

• max_iter (int) – Maximum number of iterations. Optional, default: 100.

Returns

self (array-like, shape=[n_clusters,]) – Centroids.

predict(X)[source]

Predict the labels for each data point.

Label each data point with the cluster having the nearest centroid using metric distance.

Parameters

X (array-like, shape=[…, n_features]) – Input data.

Returns

self (array-like, shape=[…,]) – Array of predicted cluster indices for each sample.

Online kmeans algorithm on Manifolds.

class geomstats.learning.online_kmeans.OnlineKMeans(metric, n_clusters, n_repetitions=20, tolerance=1e-05, max_iter=50000.0, point_type='vector')[source]

Online k-means clustering.

Online k-means clustering seeks to divide a set of data points into a specified number of classes, while minimizing intra-class variance. It is closely linked to discrete quantization, which computes the closest approximation of the empirical distribution of the dataset by a discrete distribution supported by a smaller number of points with respect to the Wasserstein distance. The algorithm used can either be seen as an online version of the k-means algorithm or as Competitive Learning Riemannian Quantization (see [LBP2019]).

Parameters
• metric (object) – Metric of the space in which the data lives. At each iteration, one of the cluster centers is moved in the direction of the new datum, according the exponential map of the underlying space, which is a method of metric.

• n_clusters (int) – Number of clusters of the k-means clustering, or number of desired atoms of the quantized distribution.

• n_repetitions (int, default=20) – The cluster centers are updated using decreasing step sizes, each of which stays constant for n_repetitions iterations to allow a better exploration of the data points.

• max_iter (int, default=5e4) – Maximum number of iterations. If it is reached, the quantization may be inacurate.

cluster_centers_

Coordinates of cluster centers.

Type

array, [n_clusters, n_features]

labels_

Labels of each point.

Example

>>> from geomstats.geometry.hypersphere import Hypersphere
>>> from geomstats.learning.onlinekmeans import OnlineKmeans
>>> sphere = Hypersphere(dim=2)
>>> metric = sphere.metric
>>> X = sphere.random_von_mises_fisher(kappa=10, n_samples=50)
>>> clustering = OnlineKmeans(metric=metric,n_clusters=4).fit(X)
>>> clustering.cluster_centers_
>>> clustering.labels_


References

LBP2019

A. Le Brigant and S. Puechmorel, Optimal Riemannian quantization with an application to air traffic analysis. J. Multivar. Anal. 173 (2019), 685 - 703.

fit(X)[source]

Perform clustering.

Parameters

X (array-like, shape=[n_features, n_samples]) – Samples to cluster.

predict(point)[source]

Predict the closest cluster each sample in X belongs to.

Parameters

X (array-like, shape=[n_features]) – New data to predict.

Returns

labels (int) – Index of the cluster each sample belongs to.

geomstats.learning.online_kmeans.online_kmeans(X, metric, n_clusters, n_repetitions=20, tolerance=1e-05, max_iter=50000.0)[source]

Perform online K-means clustering.

Perform online version of k-means algorithm on data contained in X. The data points are treated sequentially and the cluster centers are updated one at a time. This version of k-means avoids computing the mean of each cluster at each iteration and is therefore less computationally intensive than the offline version.

In the setting of quantization of probability distributions, this algorithm is also known as Competitive Learning Riemannian Quantization. It computes the closest approximation of the empirical distribution of data by a discrete distribution supported by a smaller number of points with respect to the Wasserstein distance. This smaller number of points is n_clusters.

Parameters
• X (array-like, shape=[…, n_features]) – Input data. It is treated sequentially by the algorithm, i.e. one datum is chosen randomly at each iteration.

• metric (object) – Metric of the space in which the data lives. At each iteration, one of the cluster centers is moved in the direction of the new datum, according the exponential map of the underlying space, which is a method of metric.

• n_clusters (int) – Number of clusters of the k-means clustering, or number of desired atoms of the quantized distribution.

• n_repetitions (int, default=20) – The cluster centers are updated using decreasing step sizes, each of which stays constant for n_repetitions iterations to allow a better exploration of the data points.

• max_iter (int, default=5e4) – Maximum number of iterations. If it is reached, the quantization may be inacurate.

Returns

• cluster_centers (array, shape=[n_clusters, n_features]) – Coordinates of cluster centers.

• labels (array, shape=[n_samples]) – Cluster labels for each point.

The Agglomerative Hierarchical Clustering (AHC) on manifolds.

class geomstats.learning.agglomerative_hierarchical_clustering.AgglomerativeHierarchicalClustering(n_clusters=2, distance='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='average', distance_threshold=None)[source]

The Agglomerative Hierarchical Clustering on manifolds.

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Parameters
• n_clusters (int or None, default=2) – The number of clusters to find. It must be None if distance_threshold is not None.

• distance (str or callable, default=’euclidean’) – Distance used to compute the linkage. Can be ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘cosine’, or ‘precomputed’. If linkage is ‘ward’, only ‘euclidean’ is accepted. If ‘precomputed’, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.

• memory (str or object, default=None) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

• connectivity (array-like or callable, default=None) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix. Default is None, i.e, the hierarchical clustering algorithm is unstructured.

• compute_full_tree (‘auto’ or bool, default=’auto’) – Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is ‘auto’, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, ‘auto’ is equivalent to False.

• linkage ({‘ward’, ‘complete’, ‘average’, ‘single’}, default=’average’) – Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. - average uses the average of the distances of each observation of

the two sets.

• complete or maximum linkage uses the maximum distances between all observations of the two sets.

• single uses the minimum of the distances between all observations of the two sets.

• ward minimizes the variance of the clusters being merged. It works for the ‘euclidean’ distance only.

• distance_threshold (float, default=None) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

n_clusters_

The number of clusters found by the algorithm. If distance_threshold=None, it will be equal to the given n_clusters.

Type

int

labels_

Cluster labels for each point.

Type

ndarray, shape=[..,]

n_leaves_

Number of leaves in the hierarchical tree.

Type

int

n_connected_components_

The estimated number of connected components in the graph.

Type

int

children_

The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.

Type

array-like, shape=[n_samples-1, 2]

References

This algorithm uses the scikit-learn library: https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/ cluster/_agglomerative.py#L656

The KNN classifier on manifolds.

class geomstats.learning.knn.KNearestNeighborsClassifier(n_neighbors=5, weights='uniform', p=2, distance='minkowski', distance_params=None, n_jobs=None, **kwargs)[source]

Classifier implementing the k-nearest neighbors vote on manifolds.

Parameters
• n_neighbors (int, optional (default = 5)) – Number of neighbors to use by default.

• weights (string or callable, optional (default = ‘uniform’)) – Weight function used in prediction. Possible values: - ‘uniform’ : uniform weights. All points in each neighborhood

are weighted equally.

• ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

• [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

• p (integer, optional (default = 2)) – Power parameter for the ‘minkowski’ string distance. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

• distance (string or callable, optional (default = ‘minkowski’)) – The distance metric to use. The default distance is minkowski, and with p=2 is equivalent to the standard Euclidean distance. See the documentation of the DistanceMetric class in the scikit-learn library for a list of available distances. If distance is “precomputed”, X is assumed to be a distance matrix and must be square during fit.

• distance_params (dict, optional (default = None)) – Additional keyword arguments for the distance function.

• n_jobs (int or None, optional (default = None)) – The number of parallel jobs to run for neighbors search. None means 1; -1 means using all processors.

classes_

Class labels known to the classifier

Type

array, shape=[n_classes,]

effective_metric_

The distance metric used. It will be same as the distance parameter or a synonym of it, e.g. ‘euclidean’ if the distance parameter set to ‘minkowski’ and p parameter set to 2.

Type

string or callable

effective_metric_params_

Additional keyword arguments for the distance function. For most distances will be same with distance_params parameter, but may also contain the p parameter value if the effective_metric_ attribute is set to ‘minkowski’.

Type

dict

outputs_2d_

False when y’s shape is (n_samples, ) or (n_samples, 1) during fit otherwise True.

Type

bool

References

This algorithm uses the scikit-learn library: https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/ neighbors/_classification.py#L25

## Dimension reduction¶

Principal Component Analysis on Manifolds.

class geomstats.learning.pca.TangentPCA(metric, n_components=None, copy=True, whiten=False, tol=0.0, iterated_power='auto', random_state=None)[source]

Tangent Principal component analysis (tPCA).

Linear dimensionality reduction using Singular Value Decomposition of the Riemannian Log of the data at the tangent space of the Frechet mean.

Parameters
• metric (RiemannianMetric) – Riemannian metric.

• n_components (int) – Number of principal components. Optional, default: None.

fit(X, y=None, base_point=None)[source]

Fit the model with X.

Parameters
• X (array-like, shape=[…, n_features]) – Training data, where n_samples is the number of samples and n_features is the number of features.

• y (Ignored (Compliance with scikit-learn interface))

• base_point (array-like, shape=[…, n_features], optional) – Point at which to perform the tangent PCA Optional, default to Frechet mean if None.

Returns

self (object) – Returns the instance itself.

fit_transform(X, y=None, base_point=None)[source]

Fit the model with X and apply the dimensionality reduction on X.

Parameters
• X (array-like, shape=[…, n_features]) – Training data, where n_samples is the number of samples and n_features is the number of features.

• y (Ignored (Compliance with scikit-learn interface))

• base_point (array-like, shape=[…, n_features]) – Point at which to perform the tangent PCA Optional, default to Frechet mean if None.

Returns

X_new (array-like, shape=[…, n_components]) – Projected data.

inverse_transform(X)[source]

Low-dimensional reconstruction of X.

The reconstruction will match X_original whose transform would be X if n_components=min(n_samples, n_features).

Parameters

X (array-like, shape=[…, n_components]) – New data, where n_samples is the number of samples and n_components is the number of components.

Returns

X_original (array-like, shape=[…, n_features]) – Original data.

transform(X, y=None)[source]

Project X on the principal components.

Parameters
• X (array-like, shape=[…, n_features]) – Data, where n_samples is the number of samples and n_features is the number of features.

• y (Ignored (Compliance with scikit-learn interface))

Returns

X_new (array-like, shape=[…, n_components]) – Projected data.