Overview

The package includes the following modules:
  • preprocessing

  • matrix and graph tools

  • embedding

  • visualisation

  • hierarchical clustering

The functionality of the package is demonstrated in the tutorials through a few real datasets. However, further details on functionality can be found here.

Preprocessing

This module contains a variety of functions for preprocessing data with two outputs

  1. matrix of the data,

  2. list of two dictionaries , one for the rows and one for the columns, that contain the metadata of the data.

The types of data that can be processed include:

  • relational database: graph_from_dataframes, where pairs of columns are specified to indicate nodes in the same row have an edge between them,

  • time series data: time_series_matrix_and_attributes (in progress),

  • text data: text_matrix_and_attributes` where the column on text data is converted to tf-idf features (columns).

There is also functionality for finding connected components, subgraphs and converting to a networkx object.

Relational Database

The graph_from_dataframes function takes a few parameters but two main ones we’ll talk about here: tables and relationship_cols. tables is a list of dataframes. relationship_cols is a list of lists of columns, which indicates which pairs of columns we’re interested in. For example: If I have tables = [df1,df2] where the columns of each are cols(df1) = [A,B,C] , cols(df2) = [B,C,D,E], there are a couple of ways to use relationship_cols:

  1. If relationship_cols = [[A,B], [B,C], [D,E]]
    • the function will look for those pairs of columns in each of the tables,

    • from df1 I’d get the elements which are connected from A,B and B,C and from df2 I’d get the relationships from B,C and D,E

  2. I can be more specific and have relationship_cols = [[[A,B], [B,C]], [D,E]]
    • relationship_cols[0] would be the columns we search for in df1 and relationship_cols[1] would be the columns we search for in df2

    • In this case, from df1 I’d get the elements which are connected from A,B and B,C and from df2 I’d only get the relationships from D,E

A simplified version is described in this diagram:

_images/graph_from_tables.png

Time Series Data

_images/timeseries.png

Text Data

Matrix and Graph Tools

Embedding

Visualisation

Hierarchical Clustering

Simulation