Machine Learning

Spectral Clustering Explained: How Eigenvectors Reveal Complex Clustering Structures

and eigenvectors are important techniques in linear algebra that also play an important role in data science and machine learning. Earlier, we discussed how dimensionality reduction can be done with the eigenvalues ​​and eigenvectors of the covariance matrix.

Today, we will discuss another interesting application: How eigenvalues ​​and eigenvectors can be used to perform visual integration, which works well with complex cluster structures.

In this article, we will explore how eigenvalues ​​and eigenvectors make visual integration possible and why this method can outperform conventional K-methods.

We will start with a simple visualization that will show you the importance of spectral integration and encourage you to continue learning how spectral integration can be done with eigenvalues ​​and eigenvectors.

Motivation for Spectral Clustering

A good way to learn spectral clustering is to compare it with a standard clustering algorithm like K-means on a dataset where K-means struggles to perform well.

Here, we use a simulated two-month dataset where clusters are bent. Scikit-learn do_months the algorithm produces two months in a 2-dimensional space. Then, we use Scikit-learn KMeans again SpectralClustering K-means algorithms and virtual clustering. Finally, we compare cluster observations.

Making monthly data

# Make moon data
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=400, noise=0.05,
                  random_state=0)

plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], s=20)
plt.title("Original Moon Data")
plt.savefig("Moon data.png")
Original monthly data (Photo by author)

The original dataset has two curved cluster structures called months. That's why we call you monthly data.

Applying K-means to monthly data

# Apply K-means
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

# Predict cluster index for each data point
labels_kmeans = kmeans.fit_predict(X)

# Visualize Clusters
plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], c=labels_kmeans, s=20)
plt.title("K-Means Clustering")
plt.savefig("K-means.png")
K-means aggregation to monthly data (Photo by author)

K-methods often cluster monthly data incorrectly (incorrectly cluster data points).

It applies spectral clustering to lunar data

# Apply spectral clustering
from sklearn.cluster import SpectralClustering

spectral = SpectralClustering(n_clusters=2,
                              affinity='nearest_neighbors',
                              random_state=0)

# Predict cluster index for each data point
labels_spectral = spectral.fit_predict(X)

# Visualize Clusters
plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], c=labels_spectral, s=20)
plt.title("Spectral Clustering")
plt.savefig("Spectral.png")
Spectral clustering of lunar data (Photo by author)

Now the data points are correctly assigned to months, which look similar to the original data. Spectral clustering works well for complex cluster structures. This is because the eigenvectors of the Laplacian matrix allow it to recognize complex cluster structures.

So far, we have used spectral clustering using the built-in SpectralClustering algorithm in Scikit-learn. Next, you will learn how to use spectral clustering from scratch. This will help you understand how eigenvalues ​​and eigenvectors work behind the scenes in an algorithm.

What is Spectral Clustering?

Spectral clustering groups data points based on their similarity instead of distances. This allows it to generate non-linear, complex cluster structures without following the clustering assumptions of native k-means.

The intuition behind spectral clustering is as follows:

Steps to perform spectral clustering

  1. Get the data
  2. Build similarities the matrix
  3. Construct a degree matrix
  4. Construct the Laplacian matrix (graph Laplacian)
  5. Find the eigenvalues ​​and eigenvectors of the Laplacian matrix. Eigenvectors reveal cluster structure (how the data points fit together), act as new features, and eigenvalues ​​show the power of cluster separation.
  6. Select the most important eigenvectors to embed the data in lower dimensions (dimensionality reduction)
  7. Apply K-means to the new feature space (clustering)

Spectral clustering combines dimensionality reduction and K-means clustering. We embed the data in a low-dimensional space (where clusters are easy to separate) and create K clusters in the new feature space. Briefly, K-means clustering works on the original feature space while visual clustering works on the new reduced feature space.

Using Spectral Clustering – Step by Step

We have summarized the steps for making a spectral cluster with the eigenvalues ​​and eigenvectors of the Laplacian matrix. Let's implement these steps in Python.

1. Get the data

We will use the same data that was used before.

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=400, noise=0.05,
                  random_state=0)

2. Construct a similarity (compatibility) matrix

Spectral clustering groups data points based on their similarity. Therefore, we need to measure the similarity between the data points and enter these values ​​into the matrix. This matrix is ​​called the similarity matrix (W). Here, we measure similarity using a Gaussian kernel.

If you have n number of data points, shape of W is something (n,n). Each value represents the similarity between two data points. The higher the values ​​in the mean matrix the more similar the scores.

from sklearn.metrics.pairwise import rbf_kernel

W = rbf_kernel(X, gamma=20)

3. Construct a degree matrix

The degree matrix (D) contains the total similarity of each node. This is a diagonal matrix and each diagonal value indicates the total similarity of that point to all other points. All off-diagonal components are zero. The shape of the degree matrix is ​​also (n,n).

import numpy as np

D = np.diag(np.sum(W, axis=1))

np.sum(W, axis=1)includes each row of the similarity matrix.

4. Construct the Laplacian matrix

The Laplacian matrix (L) represents the structure of a similarity graph, where nodes represent each data point, and edges connect similar points. Therefore, this matrix is ​​also called graph Laplacian and it is explained as follows.

Calculates the Laplacian matrix (Photo by author)

In Python, of course

L = D - W

D – W for L mathematically it ensures that spectral clustering will find groups of data points that are strongly connected within a group but less connected to other groups.

The Laplacian matrix (L) is also (n,n) a square matrix. This structure is important L since the eigendecomposition is defined only for square matrices.

5. Eigendecomposition of the Laplacian matrix

Eigendecomposition of the Laplacian matrix is ​​a decomposition process (factorizing) that matrix into eigenvalues again eigenvectors [ref: Eigendecomposition of a Covariance Matrix with NumPy]

If the Laplacian matrix (L) has n eigenvectors, we can decompose it as:

Construction of Eigende L (Photo by author)

Where:

  • X = matrix of eigenvectors
  • Λ = diagonal matrix of eigenvalues

Matrices X again Λ can be represented as:

Matrices of eigenvectors and eigenvalues (Photo by author)

Vectors x1, x2 again x3 eigenvectors and λ1, λ2 again λ3 corresponding eigenvalues.

Eigenvalues ​​and eigenvectors come in pairs. Such a couple is known as eigenpair. So, the matrix L can have many eigenpairs [ref: Eigendecomposition of a Covariance Matrix with NumPy]

The following eigenvalue equations show the relationship between L and one of its eigenpairs.

Eigenvalue equation of L (Photo by author)

Where:

  • L = Laplacian matrix (must be a square matrix)
  • x = eigenvector
  • λ = eigenvalue (scaling factor)

Let's calculate all the eigenpairs of the Laplacian matrix.

eigenvalues, eigenvectors = np.linalg.eigh(L)

6. Select the most important eigenvectors

In spectral integration, the algorithm uses the smallest eigenvectors of the Laplacian matrix. Therefore, we need to choose the smallest ones in the area eigenvectors the matrix.

The smallest eigenvalues ​​correspond to the smallest eigenvectors. I eight() function returns eigenvalues ​​and eigenvectors in ascending order. Therefore, we need to look at the first few values ​​of eigenvalues is a vector.

print(eigenvalues)
The first few eigenvalues ​​of L (Photo by author)

We pay attention to the difference between successive eigenvalues. This difference is known as eigangap. We choose the eigenvalue that maximizes the eigengap. It represents the number of clusters. This method is called the eigangap heuristic.

According to the eigengap heuristic, the total number of clusters k is chosen at the point where a large jump occurs between successive eigenvalues.

If they are there k the smaller the eigenvalues, the more likely they will be k collections! In our example, the first two small eigenvalues ​​suggest two clusters, which is exactly what we expect. This is the role of eigenvalues ​​in spectral integration. They are very useful for determining the number of clusters and the smallest eigenvectors!

We choose the first two eigenvectors corresponding to these minimum eigenvalues.

k = 2
U = eigenvectors[:, :k]
U contains two eigenvectors (Photo by author)

These two eigenvectors in the matrix U represent a new feature area called spectral embedding, where the clusters are linearly separable. Here is a demonstration of spectral embedding.

import matplotlib.pyplot as plt

plt.figure(figsize=[4.2, 3])
plt.scatter(U[:,0], U[:,1], s=20)
plt.title("Spectral Embedding")
plt.xlabel("Eigenvector 1")
plt.ylabel("Eigenvector 2")
plt.savefig("Spectral embedding.png")
Visualization of spectral embedding (Photo by author)

This plot shows how the eigenvectors transform the original data into a new space where the clusters are linearly separable.

7. Apply K-means to spectral embedding

Now, we can simply apply K-means to the spectral embedding (the new eigenvector space) to find the cluster labels and then assign those labels to the original data to form the clusters. K-means works well here because the clusters are separated linearly in the new eigenvector space.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=k)
labels_spectral = kmeans.fit_predict(U)
# U represents spectral embedding

plt.figure(figsize=[4.2, 3])
# Assign cluster labels to original data
plt.scatter(X[:,0], X[:,1], c=labels_spectral, s=20)
plt.title("Spectral Clustering")
plt.savefig("Spectral Manual.png")
Spectral clustering from eigendecomposition (Photo by author)

This is similar to what we found in the learning version of Scikit!

Choosing the right Gamma value

When creating a similarity matrix or estimating similarity using a Gaussian kernel, we need to define an appropriate value gamma hyperparameter, which controls how quickly the similarity decreases with distance between data points.

from sklearn.metrics.pairwise import rbf_kernel

W = rbf_kernel(X, gamma=?)

At smaller gamma values, the similarity decreases slightly, and more points appear similar. Therefore, this results in incorrect cluster properties.

For large values ​​of gamma, the similarity decreases very quickly, and only very close points are connected. Therefore, the clusters are widely separated.

For mid-range prices, you'll find equal collections.

It is better to try several values, such as 0.1, 0.5, 1, 5, 10, 15, and visualize the combination results to choose the best one.

Closing Thoughts

In spectral clustering, the dataset is represented as a graph instead of a collection of points. In that graph, each data point is a node and the lines (edges) between the nodes describe how similar points fit together.

Monthly data set as a graph (Photo by author)

The spectral clustering algorithm requires this representation of the graph in mathematical form. That's why we created an affinity matrix (W). Each value in that matrix measures the similarity between data points. Larger values ​​in the matrix mean two points are more similar, while smaller values ​​mean two points are more different.

Next, we create a degree matrix (D), which is a diagonal matrix where the value of each diagonal shows the total similarity of that point to all other points.

Using the degree matrix and the similarity matrix, we construct a graph Laplacian matrix, which captures the structure of the graph and is important for spectral integration.

We calculated the eigenvalues ​​and eigenvectors of The Laplacian matrix. The eigenvalues ​​help to choose the best number of clusters and the smallest eigenvectors. They also show the power of clustering. The eigenvectors reveal the structure of the cluster (the boundaries of the cluster or the way the data points are connected) and are used to find a new feature space when the points that are strongly connected in the graph are close together in this space. Clusters become easier to distinguish, and K-means works well in the new space.

Here is the complete workflow of the spectral collection.

Dataset → Similarity graph → Laplacian graph → Eigenvectors → Clusters


This is the end of today's article.

Please let me know if you have any questions or feedback.

See you in the next article. Enjoy reading from you!

Designed and written by:
Rukshan Pramoditha

2025–03–08

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button