API

Preprocessing

pyemb.preprocessing.find_connected_components(A, attributes, n_components=None)

Find connected components of a multipartite graph.

Parameters:

A (scipy.sparse.csr_matrix) – The matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
n_components (int) – The number of components to be found.

Returns:

The matrices of the connected components and their attributes.

Return type:

list of scipy.sparse.csr_matrix, list of lists

pyemb.preprocessing.find_subgraph(A, attributes, subgraph_attributes)

Find a subgraph of a multipartite graph.

Parameters:

A (scipy.sparse.csr_matrix) – The matrix of the multipartite graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
subgraph_attributes (list of lists) – The attributes of the nodes of the wanted in the subgraph. The first list contains the attributes of the nodes wanted in the rows. The second list contains the attributes of the nodes wanted in the column.

Returns:

The matrix and attributes of the subgraph.

Return type:

scipy.sparse.csr_matrix, list of lists

pyemb.preprocessing.graph_from_dataframes(tables, relationship_cols, same_attribute=False, dynamic_col=None, weight_col=None, join_token='::')

Create a graph from a list of tables and relationships.

Parameters:

tables (pandas.DataFrame or list of pandas.DataFrames) – Dataframe of relationships or list of dataframes. The column names of the dataframe(s) indicate the partition of the entities therein.
relationship_cols (list of lists) – The pairs of partitions we are interested in. This can be one of two formats. Either, a list of pairs of partitions, e.g. [['A','B'], ['C','B']] and each pair is looked for in each table. This allows for the case where the same relationships appear in different table. Or, len(relationship_cols) == len(tables) and the pairs of paritions to create relationships from for each table are given in the corresponding index of the list.
same_attribute (bool) – Whether the entities in the columns are from the same attribute. This allows for intra-partition relationships.
dynamic_col (str or list of str) – The name of the column containing the time information. If a list is given then dynamic_col[i] is the name of the time column for tables[i]. If None, the time information is not used.
weight_col (str or list of str) – The name of the column containing the edge weight information. If a list is given then weight_col[i] is the name of the weight column for tables[i]. If None, the time information is not used.
join_token (str) – The token used to join the names of the partitions and the names of the entities to create a unique ID. Default is ::.

Returns:

A (scipy.sparse.csr_matrix) – The adjacency matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in the rows. The second list contains the attributes of the nodes in the columns.

Examples

>>> import pyemb as eb
>>> # Create dataframes
>>> df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> df2 = pd.DataFrame({'A': [1, 2, 3], 'C': [7, 8, 9]})
>>> # Create graph from dataframes
>>> A, attributes = eb.graph_from_dataframes([df1, df2], [['A', 'B'], ['A', 'C']])
>>> print(A.todense())
>>> print(attributes)

pyemb.preprocessing.largest_cc_of(A, attributes, partition, dynamic=False)

Find the connected component containing the most nodes from a partition.

Parameters:

A (scipy.sparse.csr_matrix) – The matrix of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
partition (str) – The partition to be searched.
dynamic (bool) – Whether we want the connected component containing the most nodes from dynamic part or not.

Returns:

The matrix of the connected component and its attributes.

Return type:

scipy.sparse.csr_matrix. list of lists

pyemb.preprocessing.time_series_matrix_and_attributes(data, time_col, drop_nas=True)

Create a matrix from a time series.

Parameters:

data (pandas.DataFrame) – The data to be used to create the matrix.
time_col (str) – The name of the column containing the time information.
drop_nas (bool) – Whether to drop rows with missing values. Default is True.

Returns:

The matrix created from the time series and the attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

Return type:

numpy.ndarray, list of lists

pyemb.preprocessing.to_networkx(A, attributes, symmetric=None): Convert a multipartite graph to a networkx graph.

Embedding

pyemb.embedding.AUASE(As, Cs, d, alpha, norm=True, flat=True, sparse_matrix=False, return_left=False)

Computes the attributed unfolded adjacency spectral embedding (AUASE).

Parameters:

As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
Cs (numpy.ndarray) – An attribute matrix series of shape (T, n, p).
d (int) – Embedding dimension.
alpha (float) – Weighting parameter between the adjacency and attribute matrices.
norm (bool, optional) – Whether to normalise the attributes. Default is True.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.
return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.

Returns:

numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).
numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.

pyemb.embedding.ISE(As, d, flat=True, procrustes=False, consistent_orientation=True)

Computes the spectral embedding (ISE) for each adjacency snapshot.

Parameters:

As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
procrustes (bool, optional) – Whether to align each embedding with the previous embedding. Default is False.
consistent_orientation (bool, optional) – Whether to ensure the eigenvector orientation is consistent. Default is True.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

pyemb.embedding.OMNI(As, d, flat=True, sparse_matrix=False)

Computes the omnibus dynamic spectral embedding. For more details, see: https://arxiv.org/abs/1705.09355

Parameters:

As (numpy.ndarray) – Adjacency matrices of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether to use sparse matrices. Default is False.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

pyemb.embedding.UASE(As, d, flat=True, sparse_matrix=False, return_left=False)

Computes the unfolded adjacency spectral embedding (UASE). For more details, see: https://arxiv.org/abs/2007.10455 https://arxiv.org/abs/2106.01282

Parameters:

As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.
return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.

Returns:

numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).
numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.

pyemb.embedding.dyn_embed(As, d=50, method='UASE', regulariser='auto', flat=True, **kwargs)

Computes the dynamic embedding using a specified method.

Parameters:

As (numpy.ndarray or list) – An adjacency matrix series which is either a numpy array of shape (T, n, n), a list of numpy arrays of shape (n, n), or a series of CSR matrices.
d (int, optional) – Embedding dimension. Default is 50.
method (str, optional) – The embedding method to use. Options are ISE, ISE PROCRUSTES, UASE, AUASE, OMNI, ULSE, URLSE, RANDOM. Default is UASE.
regulariser (float or auto, optional) – Regularisation parameter for the Laplacian matrix. If auto, the regulariser is set to the average node degree. Default is auto.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.

Returns:

Dynamic embedding of shape (n*T, d) or (T, n, d).

Return type:

numpy.ndarray

Raises:

Exception – If the specified method is not recognized.

pyemb.embedding.eigen_decomp(A, dim=None)

Perform eigenvalue decomposition of a matrix.

Parameters:

A (numpy.ndarray) – The matrix to be decomposed.
dim (int) – The number of eigenvalues and eigenvectors to be returned. If None, all eigenvalues and eigenvectors are returned.

Returns:

eigenvalues (numpy.ndarray) – The eigenvalues.
eigenvectors (numpy.ndarray) – The eigenvectors.

pyemb.embedding.embed(Y, d=50, version='sqrt', return_right=False, flat=True, make_laplacian=False, regulariser=0)

Embed a matrix.

Parameters:

Y (numpy.ndarray or list of numpy.ndarray) – The matrix to embed.
d (int) – The number of dimensions to embed into.
version (str) – Whether to take the square root of the singular values. Options are full or sqrt (default).
return_right (bool) – Whether to return the right embedding.
flat (bool) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d).
make_laplacian (bool) – Whether to use the Laplacian matrix.
regulariser (float) – The regulariser to be added to the degrees of the nodes. (only used if make_laplacian=True)

Returns:

numpy.ndarray – The left embedding.
numpy.ndarray – The right embedding.

pyemb.embedding.regularised_ULSE(As, d, regulariser='auto', flat=True, sparse_matrix=False, return_left=False)

Computes the regularised unfolded Laplacian spectral embedding (regularised ULSE).

Parameters:

As (numpy.ndarray) – An adjacency matrix series of shape (T, n, n).
d (int) – Embedding dimension.
regulariser (float, optional) – Regularisation parameter for the Laplacian matrix. By default, this is the average node degree.
flat (bool, optional) – Whether to return a flat embedding (n*T, d) or a 3D embedding (T, n, d). Default is True.
sparse_matrix (bool, optional) – Whether the adjacency matrices are sparse. Default is False.
return_left (bool, optional) – Whether to return the left (anchor) embedding as well as the right (dynamic) embedding. Default is False.

Returns:

numpy.ndarray – Dynamic embedding of shape (n*T, d) or (T, n, d).
numpy.ndarray, optional – Anchor embedding of shape (n, d) if return_left is True.

Plotting

pyemb.plotting.get_fig_legend_handles_labels(fig)

Get the legend handles and labels from a figure.

Parameters:: fig (matplotlib.figure.Figure) – The figure object.
Returns:: The handles and labels of the legend
Return type:: list, list

pyemb.plotting.quick_plot(embedding, n, T=1, node_labels=None, **kwargs)

Produces an interactive plot an embedding. If the embedding is dynamic (i.e. T > 1), then the embedding will be animated over time.

Parameters:

embedding (numpy.ndarray (n*T, d) or (T, n, d)) – The dynamic embedding.
n (int) – The number of nodes.
T (int (optional)) – The number of time points (> 1 animates the embedding).
node_labels (list of length n (optional)) – The labels of the nodes (time-invariant).
return_df (bool (optional)) – Option to return the plotting dataframe.
title (str (optional)) – The title of the plot.

pyemb.plotting.snapshot_plot(embedding, n=None, node_labels=None, c=None, idx_of_interest=None, max_cols=4, title=None, title_fontsize=20, sharex=False, sharey=False, tick_labels=False, xaxis_label='', yaxis_label='', axis_fontsize=12, figsize_scale=5, figsize=None, show_plot=True, add_legend=False, move_legend=(0.5, -0.1), loc='lower center', max_legend_cols=4, **kwargs)

Plot a snapshot of an embedding at a given time point.

Parameters:

embedding (np.ndarray or list of np.ndarray) – The embedding to plot.
n (int (optional)) – The number of nodes in the graph. Should be provided if the embedding is a single numpy array and n is not the first dimension of the array.
node_labels (list (optional)) – The labels of the nodes. Default is None.
c (list or dict (optional)) – The colors of the nodes. If a list is provided, it should be a list of length n. If a dictionary is provided, it should map each unique label to a colour.
idx_of_interest (list (optional)) – The indices which to plot. For example if embedding is a list, idx_of_interest can be used to plot only a subset of the embeddings. By default, all embeddings are plotted.
max_cols (int (optional)) – The maximum number of columns in the plot. Default is 4.
title (str (optional)) – The title of the plot. If a list is provided, each element will be the title of a subplot. Default is None.
title_fontsize (int (optional)) – The fontsize of the title. Default is 20.
sharex (bool (optional)) – Whether to share the x-axis across subplots. Default is False.
sharey (bool (optional)) – Whether to share the y-axis across subplots. Default is False.
tick_labels (bool (optional)) – Whether to show tick labels. Default is False.
xaxis_label (str (optional)) – The x-axis label. Default is None.
yaxis_label (str (optional)) – The y-axis label. Default is None.
figsize_scale (int (optional)) – The scale of the figure size. Default is 5.
figsize (tuple (optional)) – The figure size. Default is None.
show_plot (bool (optional)]) – Whether to show the plot. Default is True.
add_legend (bool (optional)) – Whether to add a legend to the plot. Default is False.
loc (str (optional)) – The anchor point for where the legend will be placed. Default is lower center.
move_legend (tuple (optional)) – This adjusts the exact coordinates of the anchor point. Default is (0.5,-.1).
max_legend_cols (int (optional)) – The maximum number of columns in the legend. Default is 4.
kwargs (dict (optional)) – Additional keyword arguments for the scatter plot.

Returns:

The figure object.

Return type:

matplotlib.figure.Figure

Hierarchical Clustering

class pyemb.hc.ConstructTree(point_cloud=None, model=None, epsilon=0.25)

Bases: object

Construct a condensed tree from a hierarchical clustering model.

Parameters:

model (AgglomerativeClustering, optional) – The fitted model.
point_cloud (ndarray, optional) – The data points.
epsilon (float, optional) – The threshold for condensing the tree.
**kwargs (dict, optional) – Additional keyword arguments.

model

The fitted model.

Type:: AgglomerativeClustering

point_cloud

The data points.

Type:: ndarray

epsilon

The threshold for condensing the tree.

Type:: float

linkage

The linkage matrix.

Type:: ndarray

tree

The condensed tree.

Type:: nx.Graph

collapsed_branches

The collapsed branches.

Type:: dict

fit(**kwargs): Fit the condensed tree.

plot(**kwargs)

class pyemb.hc.DotProductAgglomerativeClustering(metric='dot_product', linkage='average', distance_threshold=0, n_clusters=None)

Bases: object

Perform hierarchical clustering using dot product as the metric. If a different metric is used, the AgglomerativeClustering class from scikit-learn is used.

Parameters:

metric (str, optional) – The metric to use for clustering. Default is dot_product.
linkage (str, optional) – The linkage criterion to use. Default is average.
distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged. Default is 0.
n_clusters (int, optional) – The number of clusters to find. Default is None.

distances_

Distance between the corresponding nodes in children_.

Type:: ndarray

children_

The children of each non-leaf node.

Type:: ndarray

labels_

Cluster labels of each point.

Type:: ndarray

n_clusters_

The number of clusters.

Type:: int

n_connected_components_

The estimated number of connected components.

Type:: int

n_leaves_

The number of leaves.

Type:: int

n_features_in_

The number of features seen during fit.

Type:: int

fit(X)

pyemb.hc.branch_lengths(Z, point_cloud=None)

Calculate branch lengths for a hierarchical clustering dendrogram.

Parameters:

Z (ndarray) – The linkage matrix.
point_cloud (ndarray, optional) – The data points. If not provided, the leaf heights are set to the maximum height.

Returns:

Matrix of branch lengths.

Return type:

ndarray

pyemb.hc.cophenetic_distances(Z)

Calculate the cophenetic distances between each observation and internal nodes.

Parameters:: Z (ndarray) – The linkage matrix.
Returns:: d – The full cophenetic distance matrix (2n-1) x (2n-1).
Return type:: ndarray

pyemb.hc.find_descendents(Z, node, desc=None, just_leaves=True)

Find all descendants of a given node in a hierarchical clustering tree.

Parameters:

Z (ndarray) – The linkage matrix.
node (int) – The node to find descendants of.
desc (dict, optional) – Dictionary to store descendants.
just_leaves (bool, optional) – Whether to include only leaf nodes.

Returns:

List of descendants.

Return type:

list

pyemb.hc.get_ranking(model)

Get the ranking of the samples.

Parameters:: model (AgglomerativeClustering) – The fitted model.
Returns:: mh_rank – The ranking of the samples.
Return type:: numpy.ndarray

pyemb.hc.kendalltau_similarity(model, true_ranking)

Calculate the Kendall’s tau similarity between the model and true ranking.

Parameters:

model (AgglomerativeClustering) – The fitted model.
true_ranking (array-like, shape (n_samples, n_samples)) – The true ranking of the samples.

Returns:

The mean Kendall’s tau similarity between the model and true ranking.

Return type:

float

pyemb.hc.linkage_matrix(model)

Convert a hierarchical clustering model to a linkage matrix.

Parameters:: model (AgglomerativeClustering) – The fitted model.
Returns:: The linkage matrix.
Return type:: ndarray

pyemb.hc.plot_dendrogram(model, dot_product_clustering=True, rescale=False, **kwargs)

Create linkage matrix and then plot the dendrogram

Parameters:

model (AgglomerativeClustering) – The fitted model to plot.
**kwargs (dict) – Keyword arguments for dendrogram function.

Return type:

None

pyemb.hc.sample_hyperbolicity(data, metric='dot_products', num_samples=5000)

Calculate the hyperbolicity of the data.

Parameters:

data (numpy.ndarray) – The data to calculate the hyperbolicity.
metric (str) – The metric to use. Options are dot_products, cosine_similarity, precomputed or any metric supported by scikit-learn.
num_samples (int) – The number of samples to calculate the hyperbolicity.

Returns:

The hyperbolicity of the data.

Return type:

float

Datasets

pyemb.datasets.load_lyon()

Load the Lyon dataset. Returns a dictionary with the following keys.

Returns:

data (numpy array of shape (n_edges, 3)) – The edges of the network. The first column is time and the second and third columns are the nodes. The nodes are indices from 0.
labels (numpy array of shape (n_nodes,)) – The labels of the nodes. The index of the label corresponds to the node index.

pyemb.datasets.load_newsgroup()

Load the Newsgroup dataset. Returns a pandas DataFrame with the following columns.

Returns:

data (str) – The text of the newsgroup post.
target (int) – The label of the newsgroup post.
target_names (str) – The label name of the newsgroup post.
layer1 (str) – The category of the newsgroup post.
layer2 (str) – The subcategory of the newsgroup post.

pyemb.datasets.load_planaria()

Load the Planaria dataset. Returns a dictionary with the following keys.

Returns:

Y (numpy array of shape (n_samples, n_features)) – The preprocessed data matrix.
labels (numpy array of shape (n_samples,)) – The cell type of each data point.
labels (numpy array) – The unique cell types.
colour_dict (dict) – A dictionary mapping cell types to colours.

Matrix and Graph Tools

pyemb.tools.degree_correction(embedding)

Perform degree correction.

Parameters:: embedding (numpy.ndarray) – The embedding of the graph, either 2D or 3D.
Returns:: The degree-corrected embedding.
Return type:: numpy.ndarray

pyemb.tools.recover_subspaces(embedding, attributes)

Recover the subspaces for each partition from an embedding.

Parameters:

embedding (numpy.ndarray) – The embedding of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.

Returns:

The embeddings and attributes of the partitions.

Return type:

dict, dict

pyemb.tools.select(embedding, attributes, select_attributes)

Select portion of embedding and attributes associated with a set of attributes.

Parameters:

embedding (numpy.ndarray) – The embedding of the graph.
attributes (list of lists) – The attributes of the nodes. The first list contains the attributes of the nodes in rows. The second list contains the attributes of the nodes in the columns.
select_attributes (dict or list of dicts) – The attributes to select by. If a list of dicts is provided, the intersection of the nodes satisfying each dict is selected.

Returns:

The selected embedding and its attributes.

Return type:

numpy.ndarray, list of lists

pyemb.tools.to_laplacian(A, regulariser=0)

Convert an adjacency matrix to a Laplacian matrix.

Parameters:

A (scipy.sparse.csr_matrix) – The adjacency matrix.
regulariser (float) – The regulariser to be added to the degrees of the nodes. If auto, the regulariser is set to the mean of the degrees.

Returns:

The Laplacian matrix.

Return type:

scipy.sparse.csr_matrix

pyemb.tools.varimax(Phi, gamma=1, q=20, tol=1e-06)

Perform varimax rotation.

Parameters:

Phi (numpy.ndarray) – The matrix to rotate.
gamma (float, optional) – The gamma parameter.
q (int, optional) – The number of iterations.
tol (float, optional) – The tolerance.

Returns:

The rotated matrix.

Return type:

numpy.ndarray

Simulation

pyemb.simulation.SBM(n=200, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))

Generate an adjacency matrix from a stochastic block model.

Parameters:

n (int, optional) – The number of nodes. Default is 200.
B (numpy.ndarray, optional) – The block matrix. Default is a 2-by-2 matrix.
pi (numpy.ndarray, optional) – The block probability vector. Default is a vector of 1/2.

Returns:

The adjacency matrix and the block assignment.

Return type:

tuple

pyemb.simulation.iid_SBM(n=200, T=2, B=array([[0.5, 0.5], [0.5, 0.4]]), pi=array([0.5, 0.5]))

Generate dynamic adjacency matrices from a stochastic block model.

Parameters:

n (int, optional) – The number of nodes. Default is 200.
T (int, optional) – The number of time steps. Default is 2.
B (numpy.ndarray, optional) – The block matrix. Default is a 2-by-2 matrix.
pi (numpy.ndarray, optional) – The block probability vector. Default is a vector of 1/2.

Returns:

The sequence of adjacency matrices and the block assignment.

Return type:

tuple

pyemb.simulation.symmetrises(A, diag=False)

Symmetrise a matrix.

Parameters:

A (numpy.ndarray) – The matrix to symmetrise.
diag (bool, optional) – Whether to include the diagonal. Default is False.

Returns:

The symmetrised matrix.

Return type:

numpy.ndarray