Datasets
Prepare EMG data
Pre-process time series into batched covariance matrices.
The user defines the number of time steps of the batches. It starts by removing the transient signal by taking a margin on each side of the sign change. It then creates batches of data that will be used to build the covariance matrices. In practice, one needs to choose the size of the batches big enough to get enough information, and small enough so that the online classifier is reactive enough.
Lead author: Marius Guerard.
- class geomstats.datasets.prepare_emg_data.TimeSeriesCovariance(data, n_steps, n_timeseries, label_map, margin=0)[source]
Class for generating a list of covariance matrices from time series.
Prepare a TimeSeriesCovariance Object from time series in dictionary.
- Parameters
data_dict (dict) – Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.
n_steps (int) – Size of the batches.
n_timeseries (int) – The number of electrodes used for the recording.
label_map (dictionary) – Encode the label into digits.
margin (int) – Number of index to remove before and after a sign change (Can help getting a stationary signal).
- label_map
Encode the label into digits.
- Type
dictionary
- data_dict
Dictionary with ‘time’, ‘raw_data’, ‘label’ as key and the corresponding array as values.
- Type
dict
- n_steps
Size of the batches.
- Type
int
- n_timeseries
The number of electrodes used for the recording.
- Type
int
- batches
The start indexes of the batches to use to compute covariance matrices.
- Type
array
- margin
Number of index to remove before and after a sign change (Can help getting a stationary signal).
- Type
int
- covs
The covariance matrices.
- Type
array
- labels
The digit labels corresponding to each batch.
- Type
array
- covec
The vectorized version of the covariance matrices.
- Type
array
- diags
The covariance matrices diagonals.
- Type
array
Prepare Graph Data
Prepare and process graph-structured data.
Lead author: Hadi Zaatiti.
- class geomstats.datasets.prepare_graph_data.Graph(graph_matrix_path, labels_path)[source]
Class for generating a graph object from a dataset.
Prepare Graph object from a dataset file.
- Parameters
graph_matrix_path (string) – Path to graph adjacency matrix.
labels_path (string) – Path to labels of the nodes of the graph.
- edges
Dictionary with node number as key and edge connected node numbers as values.
- Type
dict
- n_nodes
Number of nodes in the graph.
- Type
int
- labels
Dictionary with node number as key and the true label number as values.
- Type
dict
- random_walk(walk_length=5, n_walks_per_node=1)[source]
Compute a set of random walks on a graph.
For each node of the graph, generates a a number of random walks of a specified length. Two consecutive nodes in the random walk, are necessarily related with an edge. The walks capture the structure of the graph.
- Parameters
walk_length (int) – Length of a random walk in terms of number of edges.
n_walks_per_node (int) – Number of generated walks starting from each node of the graph.
- Returns
self (array-like,) – Shape=[n_walks_per_node*self.n_edges), walk_length] array containing random walks.
- class geomstats.datasets.prepare_graph_data.HyperbolicEmbedding(dim=2, max_epochs=100, lr=0.05, n_context=1, n_negative=2)[source]
Class for learning embeddings of graphs on hyperbolic space.
- Parameters
dim (object) – Dimensions of the used hyperbolic space.
max_epochs (int) – Maximum number of iterations for embedding.
lr (int) – Learning rate for embedding.
n_context (int) – Number of nodes to consider from a neighborhood of nodes around a particular node.
n_negative (int) – Number of nodes to consider when searching for a set of nodes that are far from a particular node.
- embed(graph)[source]
Compute embedding.
Optimize a loss function to obtain a representable embedding.
- Parameters
graph (object) – An instance of the Graph class.
- Returns
embeddings (array-like, shape=[n_samples, dim]) – Return the embedding of the data. Each data sample is represented as a point belonging to the manifold.
- static grad_log_sigmoid(vector)[source]
Gradient of log sigmoid function.
- Parameters
vector (array-like, shape=[n_samples, dim])
- Returns
gradient (array-like, shape=[n_samples, dim])
- grad_squared_distance(point_a, point_b)[source]
Gradient of squared hyperbolic distance.
Gradient of the squared distance based on the Ball representation according to point_a.
- Parameters
point_a (array-like, shape=[n_samples, dim]) – First point in hyperbolic space.
point_b (array-like, shape=[n_samples, dim]) – Second point in hyperbolic space.
- Returns
dist (array-like, shape=[n_samples, 1]) – Geodesic squared distance between the two points.
- static log_sigmoid(vector)[source]
Logsigmoid function.
Apply log sigmoid function.
- Parameters
vector (array-like, shape=[n_samples, dim])
- Returns
result (array-like, shape=[n_samples, dim])
- loss(example_embedding, context_embedding, negative_embedding)[source]
Compute loss and grad.
Compute loss and grad given embedding of the current example, embedding of the context and negative sampling embedding.
- Parameters
example_embedding (array-like, shape=[dim]) – Current data sample embedding.
context_embedding (array-like, shape=[dim]) – Current context embedding.
negative_embedding (array-like, shape=[dim]) – Current negative sample embedding.
- Returns
total_loss (int) – The current value of the loss function.
example_grad (array-like, shape=[dim]) – The gradient of the loss function at the embedding of the current data sample.
Utils
Loading toy datasets.
Refer to notebook: geomstats/notebooks/01_data_on_manifolds.ipynb to visualize these datasets.
Lead author: Nina Miolane.
- geomstats.datasets.utils.load_cells()[source]
Load cell data.
This cell dataset contains cell boundaries of mouse osteosarcoma (bone cancer) cells. The dlm8 cell line is derived from dunn and is more aggressive as a cancer. The cells have been treated with one of three treatments : control (no treatment), jasp (jasplakinolide) and cytd (cytochalasin D). These are drugs which perturb the cytoskelet of the cells.
- Returns
cells (list of 650 planar discrete curves) – Each curve represents the boundary of a cell in counterclockwise order, their lengths are not necessarily equal.
cell_lines (list of 650 strings) – List of the cell lines of each cell (dlm8 or dunn).
treatments (list of 650 strings) – List of the treatments given to each cell (control, cytd or jasp).
- geomstats.datasets.utils.load_cities()[source]
Load data from data/cities/cities.json.
- Returns
data (array-like, shape=[50, 2]) – Array with each row representing one sample, i. e. latitude and longitude of a city. Angles are in radians.
name (list) – List of city names.
- geomstats.datasets.utils.load_connectomes(as_vectors=False)[source]
Load data from brain connectomes.
Load the correlation data from the kaggle MSLP 2014 Schizophrenia Challenge. The original data came as flattened vectors, but if raw=True is passed, the correlation values are reshaped as symmetric matrices with ones on the diagonal.
- Parameters
as_vectors (bool) – Whether to return raw data as vectors or as symmetric matrices. Optional, default: False
- Returns
mat (array-like, shape=[86, {[28, 28], 378}) – Connectomes.
patient_id (array-like, shape=[86,]) – Patient unique identifiers
target (array-like, shape=[86,]) – Labels, whether patients belong to the diseased class (1) or control (0).
- geomstats.datasets.utils.load_emg()[source]
Load data from data/emg/emg.csv.
- Returns
data_emg (pandas.DataFrame, shape=[731682, 10]) – Emg time serie for each of the 8 electrodes, with the time stamps and the label of the hand sign.
- geomstats.datasets.utils.load_hands()[source]
Load data from data/hands/hands.txt and labels.txt.
Load the dataset of hand poses, where a hand is represented as a set of 22 landmarks - the hands joints - in 3D.
The hand poses represent two different hand poses: - Label 0: hand is in the position “Grab” - Label 1: hand is in the position “Expand”
This is a subset of the SHREC 2017 dataset [SWVGLF2017].
References
- SWVGLF2017
De Smedt, H. Wannous, J.P. Vandeborre,
J. Guerry, B. Le Saux, D. Filliat, SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset, 10th Eurographics Workshop on 3D Object Retrieval, 2017. https://doi.org/10.2312/3dor.20171049
- Returns
data (array-like, shape=[52, 22, 3]) – Hand data, represented as a list of 22 joints, specifically as the 3D coordinates of these joints.
labels (array-like, shape=[52,]) – Label representing hands poses. Label 0: “Grab”, Label 1: “Expand”
bone_list (array-like) – List of bones, as a list of connexions between joints.
- geomstats.datasets.utils.load_karate_graph()[source]
Load data from data/graph_karate.
- Returns
graph (prepare_graph_data.Graph) – Graph containing nodes, edges, and labels from the karate dataset.
- geomstats.datasets.utils.load_leaves()[source]
Load data from data/leaves/leaves.xlsx.
- Returns
beta_param (array-like, shape=[172, 2]) – Beta parameters of the beta distributions fitted to each leaf orientation angle sample of 172 species of plants.
distrib_type (array-like, shape=[172, ]) – Leaf orientation angle distribution type for each of the 172 species.
- geomstats.datasets.utils.load_optical_nerves()[source]
Load data from data/optical_nerves/optical_nerves.txt.
Load the dataset of sets of 5 landmarks, labelled S, T, I, N, V, in 3D on monkeys’ optical nerve heads: - 1st landmark (S): superior aspect of the retina, - 2nd landmark (T): side of the retina closest to the temporal
bone of the skull,
3rd landmark (N): nose side of the retina,
4th landmark (I): inferior point,
5th landmarks (V): optical nerve head deepest point.
For each monkey, an experimental glaucoma was introduced in one eye, while the second eye was kept as control. This dataset can be used to investigate a significant difference between the glaucoma and the control eyes.
Label 0 refers to a normal eye, and Label 1 to an eye with glaucoma.
References
- PE2015
V. Patrangenaru and L. Ellingson. Nonparametric Statistics on Manifolds and Their Applications to Object Data, 2015. https://doi.org/10.1201/b18969
- Returns
data (array-like, shape=[22, 5, 3]) – Data representing the 5 landmarks, in 3D, for 11 different monkeys.
labels (array-like, shape=[22,]) – Labels in {0, 1} classifying the corresponding optical nerve as normal (label = 0) or glaucoma (label = 1).
monkeys (array-like, shape=[22,]) – Indices in 0…10 referencing the index of the monkey to which a given optical nerve belongs.
- geomstats.datasets.utils.load_poses(only_rotations=True)[source]
Load data from data/poses/poses.csv.
- Returns
data (array-like, shape=[5, 3] or shape=[5, 6]) – Array with each row representing one sample, i. e. one 3D rotation or one 3D rotation + 3D translation.
img_paths (list) – List of img paths.