20Newsgroup documents ===================== This sections aims to use the packages functionality on text data. This includes creating a amtrix of tf-idf features, PCA and hierarchical clustering. For this, we will demonstrate on a sample of the `20Newsgroup data `__. Each document is associated with 1 of 20 newsgroup topics, organized at two hierarchical levels. Data load --------- Import data and create dataframe. .. code:: ipython3 df = eb.load_newsgroup() .. parsed-literal:: Data loaded successfully ``eb.text_matrix_and_attributes`` - creates a Y matrix of tf-idf features. It takes in a dataframe and the column which contains the data. Further functionality includes: removing general stopwords, adding stopwords, removing email addresses, cleaning (lemmatize and remove symbol, lowercase letters) and a threshold for the min/max number of documents a word needs to appear in to be included. .. code:: ipython3 Y, attributes = eb.text_matrix_and_attributes(df, 'data', remove_stopwords=True, clean_text=True, remove_email_addresses=True, update_stopwords=['subject'], min_df=5, max_df=len(df)-1000) .. code:: ipython3 (n,p) = Y.shape print("n = {}, p = {}".format(n,p)) .. parsed-literal:: n = 5000, p = 12804 Perform dimension selection using Wasserstein distances, see `Whiteley et al., 2022 `__ for details. .. code:: ipython3 ws, dim = eb.wasserstein_dimension_select(Y, range(40), split=0.5) .. code:: ipython3 print("Selected dimension: {}".format(dim)) .. parsed-literal:: Selected dimension: 28 PCA and tSNE ------------ Now we perform PCA `Whiteley et al., 2022 `__. .. code:: ipython3 zeta = p**-.5 * eb.embed(Y, d=dim, version='full') Apply t-SNE. .. code:: ipython3 from sklearn.manifold import TSNE tsne_zeta = TSNE(n_components=2, perplexity=30).fit_transform(zeta) Colours dictionary where topics from the same theme have different shades of the same colour .. code:: ipython3 target_colour = {'alt.atheism': 'goldenrod', 'comp.graphics': 'steelblue', 'comp.os.ms-windows.misc': 'skyblue', 'comp.sys.ibm.pc.hardware': 'lightblue', 'comp.sys.mac.hardware': 'powderblue', 'comp.windows.x': 'deepskyblue', 'misc.forsale': 'maroon', 'rec.autos': 'limegreen', 'rec.motorcycles': 'green', 'rec.sport.baseball': 'yellowgreen', 'rec.sport.hockey': 'olivedrab', 'sci.crypt': 'pink', 'sci.electronics': 'plum', 'sci.med': 'orchid', 'sci.space': 'palevioletred', 'soc.religion.christian': 'darkgoldenrod', 'talk.politics.guns': 'coral', 'talk.politics.mideast': 'tomato', 'talk.politics.misc': 'darksalmon', 'talk.religion.misc': 'gold'} Plot PCA on the LHS and PCA + t-SNE on the RHS .. code:: ipython3 pca_fig = eb.snapshot_plot( embedding = [zeta[:, :2],tsne_zeta], node_labels = df['target_names'].tolist(), c = target_colour, title = ['PCA','tSNE'], add_legend=True, max_legend_cols = 6, figsize = (15,6), move_legend = (.5,-.15), # tick_labels = True, # Apply other matplotlib settings s=10, ) plt.tight_layout() .. image:: newsgroup_files/newsgroup_19_0.png Hierarchical clustering with dot products, `Gray et al., 2024 `__ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- First we do HC for the centroids of each topic and plot the dendrogram. Then we do HC on the whole dataset and visualise the output tree. On centroids ------------ Find centroids .. code:: ipython3 idxs = [np.where(np.array(df['target']) == t)[0] for t in sorted(df['target'].unique())] t_zeta = np.array([np.mean(zeta[idx, :], axis=0) for idx in idxs]) Topic HC clustering .. code:: ipython3 t_dp_hc = eb.DotProductAgglomerativeClustering() t_dp_hc.fit(t_zeta); Plot dendrogram .. code:: ipython3 plt.title("Hierarchical Clustering Dendrogram") eb.plot_dendrogram(t_dp_hc, dot_product_clustering=True, orientation='left', labels=sorted(df['target_names'].unique())) plt.show() .. image:: newsgroup_files/newsgroup_28_0.png On documents ------------ .. code:: ipython3 dp_hc = eb.DotProductAgglomerativeClustering() dp_hc.fit(zeta); Use construct tree graph from hierarchical clustering, epsilon is set to zero as we don’t want to prune the tree. .. code:: ipython3 tree = eb.ConstructTree(model= dp_hc, epsilon=0) tree.fit() .. parsed-literal:: Constructing tree... .. parsed-literal:: .. code:: ipython3 tree.plot(labels = list(df["target_names"]), colours = target_colour, node_size=25, forceatlas_iter=100) .. parsed-literal:: 100%|██████████| 100/100 [00:11<00:00, 9.00it/s] .. parsed-literal:: BarnesHut Approximation took 6.12 seconds Repulsion forces took 4.49 seconds Gravitational forces took 0.04 seconds Attraction forces took 0.03 seconds AdjustSpeedAndApplyForces step took 0.20 seconds .. image:: newsgroup_files/newsgroup_33_2.png References ---------- - Whiteley, N., Gray, A. and Rubin-Delanchy, P., 2022. Statistical exploration of the Manifold Hypothesis. - Gray, A., Modell, A., Rubin-Delanchy, P. and Whiteley, N., 2024. Hierarchical clustering with dot products recovers hidden tree structure. Advances in Neural Information Processing Systems, 36.