20Newsgroup documents
=====================
This sections aims to use the packages functionality on text data. This
includes creating a amtrix of tf-idf features, PCA and hierarchical
clustering. For this, we will demonstrate on a sample of the
`20Newsgroup data `__. Each
document is associated with 1 of 20 newsgroup topics, organized at two
hierarchical levels.
Data load
---------
Import data and create dataframe.
.. code:: ipython3
df = eb.load_newsgroup()
.. parsed-literal::
Data loaded successfully
``eb.text_matrix_and_attributes`` - creates a Y matrix of tf-idf
features. It takes in a dataframe and the column which contains the
data. Further functionality includes: removing general stopwords, adding
stopwords, removing email addresses, cleaning (lemmatize and remove
symbol, lowercase letters) and a threshold for the min/max number of
documents a word needs to appear in to be included.
.. code:: ipython3
Y, attributes = eb.text_matrix_and_attributes(df, 'data', remove_stopwords=True, clean_text=True,
remove_email_addresses=True, update_stopwords=['subject'],
min_df=5, max_df=len(df)-1000)
.. code:: ipython3
(n,p) = Y.shape
print("n = {}, p = {}".format(n,p))
.. parsed-literal::
n = 5000, p = 12804
Perform dimension selection using Wasserstein distances, see `Whiteley
et al., 2022 `__ for details.
.. code:: ipython3
ws, dim = eb.wasserstein_dimension_select(Y, range(40), split=0.5)
.. code:: ipython3
print("Selected dimension: {}".format(dim))
.. parsed-literal::
Selected dimension: 28
PCA and tSNE
------------
Now we perform PCA `Whiteley et al.,
2022 `__.
.. code:: ipython3
zeta = p**-.5 * eb.embed(Y, d=dim, version='full')
Apply t-SNE.
.. code:: ipython3
from sklearn.manifold import TSNE
tsne_zeta = TSNE(n_components=2, perplexity=30).fit_transform(zeta)
Colours dictionary where topics from the same theme have different
shades of the same colour
.. code:: ipython3
target_colour = {'alt.atheism': 'goldenrod',
'comp.graphics': 'steelblue',
'comp.os.ms-windows.misc': 'skyblue',
'comp.sys.ibm.pc.hardware': 'lightblue',
'comp.sys.mac.hardware': 'powderblue',
'comp.windows.x': 'deepskyblue',
'misc.forsale': 'maroon',
'rec.autos': 'limegreen',
'rec.motorcycles': 'green',
'rec.sport.baseball': 'yellowgreen',
'rec.sport.hockey': 'olivedrab',
'sci.crypt': 'pink',
'sci.electronics': 'plum',
'sci.med': 'orchid',
'sci.space': 'palevioletred',
'soc.religion.christian': 'darkgoldenrod',
'talk.politics.guns': 'coral',
'talk.politics.mideast': 'tomato',
'talk.politics.misc': 'darksalmon',
'talk.religion.misc': 'gold'}
Plot PCA on the LHS and PCA + t-SNE on the RHS
.. code:: ipython3
pca_fig = eb.snapshot_plot(
embedding = [zeta[:, :2],tsne_zeta],
node_labels = df['target_names'].tolist(),
c = target_colour,
title = ['PCA','tSNE'],
add_legend=True,
max_legend_cols = 6,
figsize = (15,6),
move_legend = (.5,-.15),
# tick_labels = True,
# Apply other matplotlib settings
s=10,
)
plt.tight_layout()
.. image:: newsgroup_files/newsgroup_19_0.png
Hierarchical clustering with dot products, `Gray et al., 2024 `__
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
First we do HC for the centroids of each topic and plot the dendrogram.
Then we do HC on the whole dataset and visualise the output tree.
On centroids
------------
Find centroids
.. code:: ipython3
idxs = [np.where(np.array(df['target']) == t)[0]
for t in sorted(df['target'].unique())]
t_zeta = np.array([np.mean(zeta[idx, :], axis=0) for idx in idxs])
Topic HC clustering
.. code:: ipython3
t_dp_hc = eb.DotProductAgglomerativeClustering()
t_dp_hc.fit(t_zeta);
Plot dendrogram
.. code:: ipython3
plt.title("Hierarchical Clustering Dendrogram")
eb.plot_dendrogram(t_dp_hc, dot_product_clustering=True, orientation='left',
labels=sorted(df['target_names'].unique()))
plt.show()
.. image:: newsgroup_files/newsgroup_28_0.png
On documents
------------
.. code:: ipython3
dp_hc = eb.DotProductAgglomerativeClustering()
dp_hc.fit(zeta);
Use construct tree graph from hierarchical clustering, epsilon is set to
zero as we don’t want to prune the tree.
.. code:: ipython3
tree = eb.ConstructTree(model= dp_hc, epsilon=0)
tree.fit()
.. parsed-literal::
Constructing tree...
.. parsed-literal::
.. code:: ipython3
tree.plot(labels = list(df["target_names"]), colours = target_colour, node_size=25, forceatlas_iter=100)
.. parsed-literal::
100%|██████████| 100/100 [00:11<00:00, 9.00it/s]
.. parsed-literal::
BarnesHut Approximation took 6.12 seconds
Repulsion forces took 4.49 seconds
Gravitational forces took 0.04 seconds
Attraction forces took 0.03 seconds
AdjustSpeedAndApplyForces step took 0.20 seconds
.. image:: newsgroup_files/newsgroup_33_2.png
References
----------
- Whiteley, N., Gray, A. and Rubin-Delanchy, P., 2022. Statistical
exploration of the Manifold Hypothesis.
- Gray, A., Modell, A., Rubin-Delanchy, P. and Whiteley, N., 2024.
Hierarchical clustering with dot products recovers hidden tree
structure. Advances in Neural Information Processing Systems, 36.