20Newsgroup documents

This sections aims to use the packages functionality on text data. This includes creating a amtrix of tf-idf features, PCA and hierarchical clustering. For this, we will demonstrate on a sample of the 20Newsgroup data. Each document is associated with 1 of 20 newsgroup topics, organized at two hierarchical levels.

Data load

Import data and create dataframe.

df = eb.load_newsgroup()
Data loaded successfully

eb.text_matrix_and_attributes - creates a Y matrix of tf-idf features. It takes in a dataframe and the column which contains the data. Further functionality includes: removing general stopwords, adding stopwords, removing email addresses, cleaning (lemmatize and remove symbol, lowercase letters) and a threshold for the min/max number of documents a word needs to appear in to be included.

Y, attributes = eb.text_matrix_and_attributes(df, 'data', remove_stopwords=True, clean_text=True,
                                    remove_email_addresses=True, update_stopwords=['subject'],
                                    min_df=5, max_df=len(df)-1000)
(n,p) = Y.shape
print("n = {}, p = {}".format(n,p))
n = 5000, p = 12804

Perform dimension selection using Wasserstein distances, see Whiteley et al., 2022 for details.

ws, dim = eb.wasserstein_dimension_select(Y, range(40), split=0.5)
print("Selected dimension: {}".format(dim))
Selected dimension: 28

PCA and tSNE

Now we perform PCA Whiteley et al., 2022.

zeta = p**-.5 * eb.embed(Y, d=dim, version='full')

Apply t-SNE.

from sklearn.manifold import TSNE

tsne_zeta = TSNE(n_components=2, perplexity=30).fit_transform(zeta)

Colours dictionary where topics from the same theme have different shades of the same colour

target_colour = {'alt.atheism': 'goldenrod',
                 'comp.graphics': 'steelblue',
                 'comp.os.ms-windows.misc': 'skyblue',
                 'comp.sys.ibm.pc.hardware': 'lightblue',
                 'comp.sys.mac.hardware': 'powderblue',
                 'comp.windows.x': 'deepskyblue',
                 'misc.forsale': 'maroon',
                 'rec.autos': 'limegreen',
                 'rec.motorcycles': 'green',
                 'rec.sport.baseball': 'yellowgreen',
                 'rec.sport.hockey': 'olivedrab',
                 'sci.crypt': 'pink',
                 'sci.electronics': 'plum',
                 'sci.med': 'orchid',
                 'sci.space': 'palevioletred',
                 'soc.religion.christian': 'darkgoldenrod',
                 'talk.politics.guns': 'coral',
                 'talk.politics.mideast': 'tomato',
                 'talk.politics.misc': 'darksalmon',
                 'talk.religion.misc': 'gold'}

Plot PCA on the LHS and PCA + t-SNE on the RHS

pca_fig = eb.snapshot_plot(
    embedding = [zeta[:, :2],tsne_zeta],
    node_labels = df['target_names'].tolist(),
    c = target_colour,
    title = ['PCA','tSNE'],

    add_legend=True,
    max_legend_cols = 6,
    figsize = (15,6),
    move_legend = (.5,-.15),
    # tick_labels = True,
    # Apply other matplotlib settings
    s=10,
)
plt.tight_layout()
../_images/newsgroup_19_0.png

Hierarchical clustering with dot products, Gray et al., 2024

First we do HC for the centroids of each topic and plot the dendrogram. Then we do HC on the whole dataset and visualise the output tree.

On centroids

Find centroids

idxs = [np.where(np.array(df['target']) == t)[0]
        for t in sorted(df['target'].unique())]
t_zeta = np.array([np.mean(zeta[idx, :], axis=0) for idx in idxs])

Topic HC clustering

t_dp_hc = eb.DotProductAgglomerativeClustering()
t_dp_hc.fit(t_zeta);

Plot dendrogram

plt.title("Hierarchical Clustering Dendrogram")
eb.plot_dendrogram(t_dp_hc, dot_product_clustering=True, orientation='left',
                   labels=sorted(df['target_names'].unique()))
plt.show()
../_images/newsgroup_28_0.png

On documents

dp_hc = eb.DotProductAgglomerativeClustering()
dp_hc.fit(zeta);

Use construct tree graph from hierarchical clustering, epsilon is set to zero as we don’t want to prune the tree.

tree = eb.ConstructTree(model= dp_hc, epsilon=0)
tree.fit()
Constructing tree...
<pyemb.hc.ConstructTree at 0x74ee20fbf280>
tree.plot(labels = list(df["target_names"]), colours = target_colour, node_size=25, forceatlas_iter=100)
100%|██████████| 100/100 [00:11<00:00,  9.00it/s]
BarnesHut Approximation  took  6.12  seconds
Repulsion forces  took  4.49  seconds
Gravitational forces  took  0.04  seconds
Attraction forces  took  0.03  seconds
AdjustSpeedAndApplyForces step  took  0.20  seconds
../_images/newsgroup_33_2.png

References

  • Whiteley, N., Gray, A. and Rubin-Delanchy, P., 2022. Statistical exploration of the Manifold Hypothesis.

  • Gray, A., Modell, A., Rubin-Delanchy, P. and Whiteley, N., 2024. Hierarchical clustering with dot products recovers hidden tree structure. Advances in Neural Information Processing Systems, 36.