20Newsgroup documents
This sections aims to use the packages functionality on text data. This includes creating a amtrix of tf-idf features, PCA and hierarchical clustering. For this, we will demonstrate on a sample of the 20Newsgroup data. Each document is associated with 1 of 20 newsgroup topics, organized at two hierarchical levels.
Data load
Import data and create dataframe.
df = eb.load_newsgroup()
Data loaded successfully
eb.text_matrix_and_attributes
- creates a Y matrix of tf-idf
features. It takes in a dataframe and the column which contains the
data. Further functionality includes: removing general stopwords, adding
stopwords, removing email addresses, cleaning (lemmatize and remove
symbol, lowercase letters) and a threshold for the min/max number of
documents a word needs to appear in to be included.
Y, attributes = eb.text_matrix_and_attributes(df, 'data', remove_stopwords=True, clean_text=True,
remove_email_addresses=True, update_stopwords=['subject'],
min_df=5, max_df=len(df)-1000)
(n,p) = Y.shape
print("n = {}, p = {}".format(n,p))
n = 5000, p = 12804
Perform dimension selection using Wasserstein distances, see Whiteley et al., 2022 for details.
ws, dim = eb.wasserstein_dimension_select(Y, range(40), split=0.5)
print("Selected dimension: {}".format(dim))
Selected dimension: 28
PCA and tSNE
Now we perform PCA Whiteley et al., 2022.
zeta = p**-.5 * eb.embed(Y, d=dim, version='full')
Apply t-SNE.
from sklearn.manifold import TSNE
tsne_zeta = TSNE(n_components=2, perplexity=30).fit_transform(zeta)
Colours dictionary where topics from the same theme have different shades of the same colour
target_colour = {'alt.atheism': 'goldenrod',
'comp.graphics': 'steelblue',
'comp.os.ms-windows.misc': 'skyblue',
'comp.sys.ibm.pc.hardware': 'lightblue',
'comp.sys.mac.hardware': 'powderblue',
'comp.windows.x': 'deepskyblue',
'misc.forsale': 'maroon',
'rec.autos': 'limegreen',
'rec.motorcycles': 'green',
'rec.sport.baseball': 'yellowgreen',
'rec.sport.hockey': 'olivedrab',
'sci.crypt': 'pink',
'sci.electronics': 'plum',
'sci.med': 'orchid',
'sci.space': 'palevioletred',
'soc.religion.christian': 'darkgoldenrod',
'talk.politics.guns': 'coral',
'talk.politics.mideast': 'tomato',
'talk.politics.misc': 'darksalmon',
'talk.religion.misc': 'gold'}
Plot PCA on the LHS and PCA + t-SNE on the RHS
pca_fig = eb.snapshot_plot(
embedding = [zeta[:, :2],tsne_zeta],
node_labels = df['target_names'].tolist(),
c = target_colour,
title = ['PCA','tSNE'],
add_legend=True,
max_legend_cols = 6,
figsize = (15,6),
move_legend = (.5,-.15),
# tick_labels = True,
# Apply other matplotlib settings
s=10,
)
plt.tight_layout()
Hierarchical clustering with dot products, Gray et al., 2024
First we do HC for the centroids of each topic and plot the dendrogram. Then we do HC on the whole dataset and visualise the output tree.
On centroids
Find centroids
idxs = [np.where(np.array(df['target']) == t)[0]
for t in sorted(df['target'].unique())]
t_zeta = np.array([np.mean(zeta[idx, :], axis=0) for idx in idxs])
Topic HC clustering
t_dp_hc = eb.DotProductAgglomerativeClustering()
t_dp_hc.fit(t_zeta);
Plot dendrogram
plt.title("Hierarchical Clustering Dendrogram")
eb.plot_dendrogram(t_dp_hc, dot_product_clustering=True, orientation='left',
labels=sorted(df['target_names'].unique()))
plt.show()
On documents
dp_hc = eb.DotProductAgglomerativeClustering()
dp_hc.fit(zeta);
Use construct tree graph from hierarchical clustering, epsilon is set to zero as we don’t want to prune the tree.
tree = eb.ConstructTree(model= dp_hc, epsilon=0)
tree.fit()
Constructing tree...
<pyemb.hc.ConstructTree at 0x74ee20fbf280>
tree.plot(labels = list(df["target_names"]), colours = target_colour, node_size=25, forceatlas_iter=100)
100%|██████████| 100/100 [00:11<00:00, 9.00it/s]
BarnesHut Approximation took 6.12 seconds
Repulsion forces took 4.49 seconds
Gravitational forces took 0.04 seconds
Attraction forces took 0.03 seconds
AdjustSpeedAndApplyForces step took 0.20 seconds
References
Whiteley, N., Gray, A. and Rubin-Delanchy, P., 2022. Statistical exploration of the Manifold Hypothesis.
Gray, A., Modell, A., Rubin-Delanchy, P. and Whiteley, N., 2024. Hierarchical clustering with dot products recovers hidden tree structure. Advances in Neural Information Processing Systems, 36.