Datasets

Prepare EMG data

Pre-process time series into batched covariance matrices.

The user defines the number of time steps of the batches. It starts by removing the transient signal by taking a margin on each side of the sign change. It then creates batches of data that will be used to build the covariance matrices. In practice, one needs to choose the size of the batches big enough to get enough information, and small enough so that the online classifier is reactive enough.

class geomstats.datasets.prepare_emg_data.TimeSeriesCovariance(data, n_steps, n_timeseries, label_map, margin=0)[source]

Class for generating a list of covariance matrices from time series.

Prepare a TimeSeriesCovariance Object from time series in dictionary.

Parameters
  • data_dict (dict) – Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.

  • n_steps (int) – Size of the batches.

  • n_timeseries (int) – The number of electrodes used for the recording.

  • label_map (dictionary) – Encode the label into digits.

  • margin (int) – Number of index to remove before and after a sign change (Can help getting a stationary signal).

label_map

Encode the label into digits.

Type

dictionary

data_dict

Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.

Type

dict

n_steps

Size of the batches.

Type

int

n_timeseries

The number of electrodes used for the recording.

Type

int

batches

The start indexes of the batches to use to compute covariance matrices.

Type

array

margin

Number of index to remove before and after a sign change (Can help getting a stationary signal).

Type

int

covs

The covariance matrices.

Type

array

labels

The digit labels corresponding to each batch.

Type

array

covec

The vectorized version of the covariance matrices.

Type

array

diags

The covariance matrices diagonals.

Type

array

transform()[source]

Transform the time series into batched covariance matrices.

We also compute the corresponding vectors, variance vector, labels, and experiments.

Prepare Graph Data

Prepare and process graph-structured data.

class geomstats.datasets.prepare_graph_data.Graph(graph_matrix_path, labels_path)[source]

Class for generating a graph object from a dataset.

Prepare Graph object from a dataset file.

Parameters
  • graph_matrix_path (string) – Path to graph adjacency matrix.

  • labels_path (string) – Path to labels of the nodes of the graph.

edges

Dictionary with node number as key and edge connected node numbers as values.

Type

dict

n_nodes

Number of nodes in the graph.

Type

int

labels

Dictionary with node number as key and the true label number as values.

Type

dict

random_walk(walk_length=5, n_walks_per_node=1)[source]

Compute a set of random walks on a graph.

For each node of the graph, generates a a number of random walks of a specified length. Two consecutive nodes in the random walk, are necessarily related with an edge. The walks capture the structure of the graph.

Parameters
  • walk_length (int) – Length of a random walk in terms of number of edges.

  • n_walks_per_node (int) – Number of generated walks starting from each node of the graph.

Returns

self (array-like,) – Shape=[n_walks_per_node*self.n_edges), walk_length] array containing random walks.

class geomstats.datasets.prepare_graph_data.HyperbolicEmbedding(dim=2, max_epochs=100, lr=0.05, n_context=1, n_negative=2)[source]

Class for learning embeddings of graphs on hyperbolic space.

Parameters
  • dim (object) – Dimensions of the used hyperbolic space.

  • max_epochs (int) – Maximum number of iterations for embedding.

  • lr (int) – Learning rate for embedding.

  • n_context (int) – Number of nodes to consider from a neighborhood of nodes around a particular node.

  • n_negative (int) – Number of nodes to consider when searching for a set of nodes that are far from a particular node.

embed(graph)[source]

Compute embedding.

Optimize a loss function to obtain a representable embedding.

Parameters

graph (object) – An instance of the Graph class.

Returns

embeddings (array-like, shape=[n_samples, dim]) – Return the embedding of the data. Each data sample is represented as a point belonging to the manifold.

static grad_log_sigmoid(vector)[source]

Gradient of log sigmoid function.

Parameters

vector (array-like, shape=[n_samples, dim])

Returns

gradient (array-like, shape=[n_samples, dim])

grad_squared_distance(point_a, point_b)[source]

Gradient of squared hyperbolic distance.

Gradient of the squared distance based on the Ball representation according to point_a.

Parameters
  • point_a (array-like, shape=[n_samples, dim]) – First point in hyperbolic space.

  • point_b (array-like, shape=[n_samples, dim]) – Second point in hyperbolic space.

Returns

dist (array-like, shape=[n_samples, 1]) – Geodesic squared distance between the two points.

static log_sigmoid(vector)[source]

Logsigmoid function.

Apply log sigmoid function.

Parameters

vector (array-like, shape=[n_samples, dim])

Returns

result (array-like, shape=[n_samples, dim])

loss(example_embedding, context_embedding, negative_embedding)[source]

Compute loss and grad.

Compute loss and grad given embedding of the current example, embedding of the context and negative sampling embedding.

Parameters
  • example_embedding (array-like, shape=[dim]) – Current data sample embedding.

  • context_embedding (array-like, shape=[dim]) – Current context embedding.

  • negative_embedding (array-like, shape=[dim]) – Current negative sample embedding.

Returns

  • total_loss (int) – The current value of the loss function.

  • example_grad (array-like, shape=[dim]) – The gradient of the loss function at the embedding of the current data sample.

Utils

Loading toy datasets.

Refer to notebook: geomstats/notebooks/01_data_on_manifolds.ipynb to visualize these datasets.

geomstats.datasets.utils.load_cities()[source]

Load data from data/cities/cities.json.

Returns

  • data (array-like, shape=[50, 2]) – Array with each row representing one sample, i. e. latitude and longitude of a city. Angles are in radians.

  • name (list) – List of city names.

geomstats.datasets.utils.load_connectomes(as_vectors=False)[source]

Load data from brain connectomes.

Load the correlation data from the kaggle MSLP 2014 Schizophrenia Challenge. The original data came as flattened vectors, but if raw=True is passed, the correlation values are reshaped as symmetric matrices with ones on the diagonal.

Parameters

as_vectors (bool) – Whether to return raw data as vectors or as symmetric matrices. Optional, default: False

Returns

  • mat (array-like, shape=[86, {[28, 28], 378}) – Connectomes.

  • patient_id (array-like, shape=[86,]) – Patient unique identifiers

  • target (array-like, shape=[86,]) – Labels, whether patients belong to the diseased class (1) or control (0).

geomstats.datasets.utils.load_emg()[source]

Load data from data/emg/emg.csv.

Returns

data_emg (pandas.DataFrame, shape=[731682, 10]) – Emg time serie for each of the 8 electrodes, with the time stamps and the label of the hand sign.

geomstats.datasets.utils.load_hands()[source]

Load data from data/hands/hands.txt and labels.txt.

Load the dataset of hand poses, where a hand is represented as a set of 22 landmarks - the hands joints - in 3D.

The hand poses represent two different hand poses: - Label 0: hand is in the position “Grab” - Label 1: hand is in the position “Expand”

This is a subset of the SHREC 2017 dataset [SWVGLF2017].

References

SWVGLF2017
  1. De Smedt, H. Wannous, J.P. Vandeborre,

J. Guerry, B. Le Saux, D. Filliat, SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset, 10th Eurographics Workshop on 3D Object Retrieval, 2017. https://doi.org/10.2312/3dor.20171049

Returns

  • data (array-like, shape=[52, 22, 3]) – Hand data, represented as a list of 22 joints, specifically as the 3D coordinates of these joints.

  • labels (array-like, shape=[52,]) – Label representing hands poses. Label 0: “Grab”, Label 1: “Expand”

  • bone_list (array-like) – List of bones, as a list of connexions between joints.

geomstats.datasets.utils.load_karate_graph()[source]

Load data from data/graph_karate.

Returns

graph (prepare_graph_data.Graph) – Graph containing nodes, edges, and labels from the karate dataset.

geomstats.datasets.utils.load_leaves()[source]

Load data from data/leaves/leaves.xlsx.

Returns

  • beta_param (array-like, shape=[172, 2]) – Beta parameters of the beta distributions fitted to each leaf orientation angle sample of 172 species of plants.

  • distrib_type (array-like, shape=[172, ]) – Leaf orientation angle distribution type for each of the 172 species.

geomstats.datasets.utils.load_optical_nerves()[source]

Load data from data/optical_nerves/optical_nerves.txt.

Load the dataset of sets of 5 landmarks, labelled S, T, I, N, V, in 3D on monkeys’ optical nerve heads: - 1st landmark (S): superior aspect of the retina, - 2nd landmark (T): side of the retina closest to the temporal

bone of the skull,

  • 3rd landmark (N): nose side of the retina,

  • 4th landmark (I): inferior point,

  • 5th landmarks (V): optical nerve head deepest point.

For each monkey, an experimental glaucoma was introduced in one eye, while the second eye was kept as control. This dataset can be used to investigate a significant difference between the glaucoma and the control eyes.

Label 0 refers to a normal eye, and Label 1 to an eye with glaucoma.

References

PE2015

V. Patrangenaru and L. Ellingson. Nonparametric Statistics on Manifolds and Their Applications to Object Data, 2015. https://doi.org/10.1201/b18969

Returns

  • data (array-like, shape=[22, 5, 3]) – Data representing the 5 landmarks, in 3D, for 11 different monkeys.

  • labels (array-like, shape=[22,]) – Labels in {0, 1} classifying the corresponding optical nerve as normal (label = 0) or glaucoma (label = 1).

  • monkeys (array-like, shape=[22,]) – Indices in 0…10 referencing the index of the monkey to which a given optical nerve belongs.

geomstats.datasets.utils.load_poses(only_rotations=True)[source]

Load data from data/poses/poses.csv.

Returns

  • data (array-like, shape=[5, 3] or shape=[5, 6]) – Array with each row representing one sample, i. e. one 3D rotation or one 3D rotation + 3D translation.

  • img_paths (list) – List of img paths.

geomstats.datasets.utils.load_random_graph()[source]

Load data from data/graph_random.

Returns

graph (prepare_graph_data.Graph) – Graph containing nodes, edges, and labels from the random dataset.