geomstats.datasets package#

Submodules#

geomstats.datasets.prepare_emg_data module#

Pre-process time series into batched covariance matrices.

The user defines the number of time steps of the batches. It starts by removing the transient signal by taking a margin on each side of the sign change. It then creates batches of data that will be used to build the covariance matrices. In practice, one needs to choose the size of the batches big enough to get enough information, and small enough so that the online classifier is reactive enough.

Lead author: Marius Guerard.

class geomstats.datasets.prepare_emg_data.TimeSeriesCovariance(data, n_steps, n_timeseries, label_map, margin=0)[source]#

Bases: object

Class for generating a list of covariance matrices from time series.

Prepare a TimeSeriesCovariance Object from time series in dictionary.

Parameters:
  • data_dict (dict) – Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.

  • n_steps (int) – Size of the batches.

  • n_timeseries (int) – The number of electrodes used for the recording.

  • label_map (dictionary) – Encode the label into digits.

  • margin (int) – Number of index to remove before and after a sign change (Can help getting a stationary signal).

label_map#

Encode the label into digits.

Type:

dictionary

data_dict#

Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.

Type:

dict

n_steps#

Size of the batches.

Type:

int

n_timeseries#

The number of electrodes used for the recording.

Type:

int

batches#

The start indexes of the batches to use to compute covariance matrices.

Type:

array

margin#

Number of index to remove before and after a sign change (Can help getting a stationary signal).

Type:

int

covs#

The covariance matrices.

Type:

array

labels#

The digit labels corresponding to each batch.

Type:

array

covec#

The vectorized version of the covariance matrices.

Type:

array

diags#

The covariance matrices diagonals.

Type:

array

transform()[source]#

Transform the time series into batched covariance matrices.

We also compute the corresponding vectors, variance vector, labels, and experiments.

geomstats.datasets.prepare_graph_data module#

Prepare and process graph-structured data.

Lead author: Hadi Zaatiti.

class geomstats.datasets.prepare_graph_data.Graph(graph_matrix_path, labels_path)[source]#

Bases: object

Class for generating a graph object from a dataset.

Prepare Graph object from a dataset file.

Parameters:
  • graph_matrix_path (string) – Path to graph adjacency matrix.

  • labels_path (string) – Path to labels of the nodes of the graph.

edges#

Dictionary with node number as key and edge connected node numbers as values.

Type:

dict

n_nodes#

Number of nodes in the graph.

Type:

int

labels#

Dictionary with node number as key and the true label number as values.

Type:

dict

random_walk(walk_length=5, n_walks_per_node=1)[source]#

Compute a set of random walks on a graph.

For each node of the graph, generates a a number of random walks of a specified length. Two consecutive nodes in the random walk, are necessarily related with an edge. The walks capture the structure of the graph.

Parameters:
  • walk_length (int) – Length of a random walk in terms of number of edges.

  • n_walks_per_node (int) – Number of generated walks starting from each node of the graph.

Returns:

self (array-like, shape=[n_walks_per_node*self.n_edges, walk_length]) – Array containing random walks.

class geomstats.datasets.prepare_graph_data.HyperbolicEmbedding(dim=2, max_epochs=100, lr=0.05, n_context=1, n_negative=2)[source]#

Bases: object

Class for learning embeddings of graphs on hyperbolic space.

Parameters:
  • dim (object) – Dimensions of the used hyperbolic space.

  • max_epochs (int) – Maximum number of iterations for embedding.

  • lr (int) – Learning rate for embedding.

  • n_context (int) – Number of nodes to consider from a neighborhood of nodes around a particular node.

  • n_negative (int) – Number of nodes to consider when searching for a set of nodes that are far from a particular node.

embed(graph)[source]#

Compute embedding.

Optimize a loss function to obtain a representable embedding.

Parameters:

graph (object) – An instance of the Graph class.

Returns:

embeddings (array-like, shape=[n_samples, dim]) – Return the embedding of the data. Each data sample is represented as a point belonging to the manifold.

static grad_log_sigmoid(vector)[source]#

Gradient of log sigmoid function.

Parameters:

vector (array-like, shape=[n_samples, dim])

Returns:

gradient (array-like, shape=[n_samples, dim])

grad_squared_distance(point_a, point_b)[source]#

Gradient of squared hyperbolic distance.

Gradient of the squared distance based on the Ball representation according to point_a.

Parameters:
  • point_a (array-like, shape=[n_samples, dim]) – First point in hyperbolic space.

  • point_b (array-like, shape=[n_samples, dim]) – Second point in hyperbolic space.

Returns:

dist (array-like, shape=[n_samples, 1]) – Geodesic squared distance between the two points.

static log_sigmoid(vector)[source]#

Logsigmoid function.

Apply log sigmoid function.

Parameters:

vector (array-like, shape=[n_samples, dim])

Returns:

result (array-like, shape=[n_samples, dim])

loss(example_embedding, context_embedding, negative_embedding)[source]#

Compute loss and grad.

Compute loss and grad given embedding of the current example, embedding of the context and negative sampling embedding.

Parameters:
  • example_embedding (array-like, shape=[dim]) – Current data sample embedding.

  • context_embedding (array-like, shape=[dim]) – Current context embedding.

  • negative_embedding (array-like, shape=[dim]) – Current negative sample embedding.

Returns:

  • total_loss (int) – The current value of the loss function.

  • example_grad (array-like, shape=[dim]) – The gradient of the loss function at the embedding of the current data sample.

geomstats.datasets.utils module#

Loading toy datasets.

Refer to notebook: geomstats/notebooks/01_data_on_manifolds.ipynb to visualize these datasets.

Lead author: Nina Miolane.

geomstats.datasets.utils.load_cells()[source]#

Load cell data.

This cell dataset contains cell boundaries of mouse osteosarcoma (bone cancer) cells. The dlm8 cell line is derived from dunn and is more aggressive as a cancer. The cells have been treated with one of three treatments : control (no treatment), jasp (jasplakinolide) and cytd (cytochalasin D). These are drugs which perturb the cytoskelet of the cells.

Returns:

  • cells (list of 650 planar discrete curves) – Each curve represents the boundary of a cell in counterclockwise order, their lengths are not necessarily equal.

  • cell_lines (list of 650 strings) – List of the cell lines of each cell (dlm8 or dunn).

  • treatments (list of 650 strings) – List of the treatments given to each cell (control, cytd or jasp).

geomstats.datasets.utils.load_cities()[source]#

Load data from data/cities/cities.json.

Returns:

  • data (array-like, shape=[50, 2]) – Array with each row representing one sample, i. e. latitude and longitude of a city. Angles are in radians.

  • name (list) – List of city names.

geomstats.datasets.utils.load_connectomes(as_vectors=False)[source]#

Load data from brain connectomes.

Load the correlation data from the kaggle MSLP 2014 Schizophrenia Challenge. The original data came as flattened vectors, but if raw=True is passed, the correlation values are reshaped as symmetric matrices with ones on the diagonal.

Parameters:

as_vectors (bool) – Whether to return raw data as vectors or as symmetric matrices. Optional, default: False

Returns:

  • mat (array-like, shape=[86, {[28, 28], 378}) – Connectomes.

  • patient_id (array-like, shape=[86,]) – Patient unique identifiers

  • target (array-like, shape=[86,]) – Labels, whether patients belong to the diseased class (1) or control (0).

geomstats.datasets.utils.load_cube()[source]#

Load data from the cube mesh.

Returns:

  • vertices (gs.array, shape=[8, 3]) – Vertices of the cube.

  • faces (gs.array, shape=[12, 3]) – Faces of the cube. Each face contains the 3 indices of the vertices that compose it.

geomstats.datasets.utils.load_emg(file_path='/home/runner/work/geomstats/geomstats/geomstats/datasets/data/emg/emg.csv')[source]#

Load data from data/emg/emg.csv.

Returns:

data_emg (pandas.DataFrame, shape=[731682, 10]) – Emg time serie for each of the 8 electrodes, with the time stamps and the label of the hand sign.

geomstats.datasets.utils.load_football()[source]#

Load data from data/graph_space/Footballs_scores.npy and footballs_ppn.npy.

Returns:

  • data_football (gs.array, shape=[128, 11, 11]) – Adjacency matrices of player passing networks of all the matches and teams in Fifa 2018.

  • data_scores (gs.array, shape=[128, 1]) – Scores of the team during the match.

geomstats.datasets.utils.load_hands()[source]#

Load data from data/hands/hands.txt and labels.txt.

Load the dataset of hand poses, where a hand is represented as a set of 22 landmarks - the hands joints - in 3D.

The hand poses represent two different hand poses:

  • Label 0: hand is in the position “Grab”

  • Label 1: hand is in the position “Expand”

This is a subset of the SHREC 2017 dataset [SWVGLF2017].

References

[SWVGLF2017]

Q. De Smedt, H. Wannous, J.P. Vandeborre, J. Guerry, B. Le Saux, D. Filliat, SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset, 10th Eurographics Workshop on 3D Object Retrieval, 2017. https://doi.org/10.2312/3dor.20171049

Returns:

  • data (array-like, shape=[52, 22, 3]) – Hand data, represented as a list of 22 joints, specifically as the 3D coordinates of these joints.

  • labels (array-like, shape=[52,]) – Label representing hands poses. Label 0: “Grab”, Label 1: “Expand”

  • bone_list (array-like) – List of bones, as a list of connexions between joints.

geomstats.datasets.utils.load_karate_graph()[source]#

Load data from data/graph_karate.

Returns:

graph (prepare_graph_data.Graph) – Graph containing nodes, edges, and labels from the karate dataset.

geomstats.datasets.utils.load_leaves()[source]#

Load data from data/leaves/leaves.xlsx.

Returns:

  • beta_param (array-like, shape=[172, 2]) – Beta parameters of the beta distributions fitted to each leaf orientation angle sample of 172 species of plants.

  • distrib_type (array-like, shape=[172, ]) – Leaf orientation angle distribution type for each of the 172 species.

geomstats.datasets.utils.load_mammals(file_path='/home/runner/work/geomstats/geomstats/geomstats/datasets/data/graph_space/mammals_grooming.npy')[source]#

Load data from data/graph_space/mammals_grooming.npy.

Returns:

data_mammals (gs.array, shape=[26, 18, 18]) – Adjacency matrices of different group of mammals measuring the mammals grooming.

References

[Franz2015]

Franz, M., Altmann, J., & Alberts, S. C. “Knockouts of high-ranking males have limited impact on baboon social networks.” Current Zoology, 61(1), 107-113, 2015.

[Rossi2015]

Rossi, R., & Ahmed, N. “The network data repository with interactive graph analytics and visualization.” In Twenty-ninth AAAI conference on artificial intelligence, 2015

geomstats.datasets.utils.load_optical_nerves()[source]#

Load data from data/optical_nerves/optical_nerves.txt.

Load the dataset of sets of 5 landmarks, labelled S, T, I, N, V, in 3D on monkeys’ optical nerve heads:

  • 1st landmark (S): superior aspect of the retina,

  • 2nd landmark (T): side of the retina closest to the temporal bone of the skull,

  • 3rd landmark (N): nose side of the retina,

  • 4th landmark (I): inferior point,

  • 5th landmarks (V): optical nerve head deepest point.

For each monkey, an experimental glaucoma was introduced in one eye, while the second eye was kept as control. This dataset can be used to investigate a significant difference between the glaucoma and the control eyes.

Label 0 refers to a normal eye, and Label 1 to an eye with glaucoma.

References

[PE2015]

V. Patrangenaru and L. Ellingson. Nonparametric Statistics on Manifolds and Their Applications to Object Data, 2015. https://doi.org/10.1201/b18969

Returns:

  • data (array-like, shape=[22, 5, 3]) – Data representing the 5 landmarks, in 3D, for 11 different monkeys.

  • labels (array-like, shape=[22,]) – Labels in {0, 1} classifying the corresponding optical nerve as normal (label = 0) or glaucoma (label = 1).

  • monkeys (array-like, shape=[22,]) – Indices in 0…10 referencing the index of the monkey to which a given optical nerve belongs.

geomstats.datasets.utils.load_poses(only_rotations=True)[source]#

Load data from data/poses/poses.csv.

Returns:

  • data (array-like, shape=[5, 3] or shape=[5, 6]) – Array with each row representing one sample, i. e. one 3D rotation or one 3D rotation + 3D translation.

  • img_paths (list) – List of img paths.

geomstats.datasets.utils.load_random_graph()[source]#

Load data from data/graph_random.

Returns:

graph (prepare_graph_data.Graph) – Graph containing nodes, edges, and labels from the random dataset.

geomstats.datasets.utils.load_sao_paulo(dirname=None)[source]#

Load data from data/sao_paulo/jam_count.csv and data/sao_paulo/jam_table.csv.

Load the dataset of traffic jams in Sao Paulo from 2001 to 2019.

jam_count.csv lists the number of traffic jams for each road in Sao Paulo in that time span. jam_table.csv lists the dates, roads, and durations of all these traffic jams.

The dataset is accessible here: https://www.kaggle.com/datasets/danlessa/sao-paulo-traffic-jams-since-2001

Returns:

  • jam_table (pandas.DataFrame) – Columns : name (of the road), date (of traffic jam), duration.

  • jam_count (dictionary) – Keys : name of the road Values : count of traffic jams between 2001 and 2019.

Module contents#

Module for exposing datasets to users.