Curse of Dimensionality: An Intuitive Exploration | by Salih Salih | Dec, 2023


Thank you for reading this post, don't forget to subscribe!

Salih Salih

Towards Data Science
Picture by Mathew Schwartz on Unsplash

Within the earlier article, we mentioned the stunning conduct of information in larger dimensions. We discovered that quantity tends to build up within the corners of areas in a wierd manner, and we simulated a hypersphere inscribed inside a hypercube to analyze this, observing an fascinating lower of their quantity ratio as the scale grew. Examples that demonstrated some great benefits of multi-dimensional pondering have been the DVD-paper experiment and the kernel trick in assist vector machines(SVMs).

At the moment, we will probably be among the tough features of high-dimensional knowledge which is known as curse of dimensionality. Our purpose is to have an intuitive understanding of this idea and its sensible implications. The diagram under outlines how our article is structured.

Picture by Creator

Understanding the Curse of Dimensionality

“Curse of dimensionality” is a time period that was first utilized by Richard E. Bellman again within the Sixties. It started as Bellman’s concept from dynamic optimization and it turned out to be a basic idea for understanding complexity in high-dimensional areas.

Good, however what’s “curse of dimensionality”?

It’s at its core the difficulties and distinctive traits one faces when working with knowledge in high-dimensional areas( in our case this refers to having many options, columns or attributes in datasets). These areas go far past our expertise of on a regular basis life in three-dimensional house.

After we enhance the variety of dimensions on a dataset, the quantity it occupies expands exponentially. This may seem initially as a bonus — extra space may imply extra knowledge and possibly extra insights? Nevertheless, that’s not the case as a result of having many dimensions comes with various challenges which change how we have to cope with and perceive these high-dimensional knowledge.

The shift from low-dimensional to high-dimensional knowledge faces a number of harsh challenges. There are two, which stand out as a result of they’ve probably the most important results: 1) sparsity of information; 2) the problem with distance metric. Every of them makes evaluation in larger dimensions much more complicated.

Knowledge Sparsity: Islands in an Ocean of Vacancy

Knowledge sparsity in extremely dimensional areas is like few small islands misplaced inside an enormous ocean. When dimensionality will increase, knowledge factors that have been shut collectively in decrease dimensions grow to be more and more separated. This is because of the truth that the quantity of house expands exponentially with every new addition of one other dimension. Simply think about a dice changing into a hypercube; its corners transfer additional away from its heart and make it extra empty inside. This rising vacancy is what we confer with as knowledge sparsity.

Many knowledge evaluation methods battle with sparsity. For instance, many clustering algorithms depend upon intently located knowledge factors to type significant clusters. Nevertheless, when knowledge factors grow to be too dispersed, these algorithms face difficulties.

Distance Metric Issues: When Proximity Loses That means

In high-dimensional areas, distance metrics encounter important challenges. Metrics like Euclidean or Manhattan distances, that are helpful for measuring proximity between knowledge factors in decrease dimensions, lose their effectiveness. In these expanded areas, distances begin to converge. Which means most pairs of factors grow to be almost equidistant from one another and from a reference level. This convergence makes it more durable to differentiate between shut neighbors and distant ones.

In duties like classification, the place distance measurements are essential for categorizing new knowledge factors, these metrics grow to be much less efficient. Because of this, algorithm efficiency drops, resulting in much less correct predictions and analyses.

To higher perceive how distance conduct modifications in larger dimensions, let’s carry out a easy simulation. We’ll generate random factors in each low and high-dimensional areas. This can enable us to watch and examine the distribution of distances, displaying us how these distances evolve as we transfer to larger dimensions.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

def generate_points(dimensions, num_points, range_min, range_max):
return np.random.uniform(range_min, range_max, (num_points, dimensions))

def calculate_pairwise_distances(factors):
distances = np.sqrt(((factors[:, np.newaxis, :] - factors[np.newaxis, :, :]) ** 2).sum(axis=-1))
np.fill_diagonal(distances, np.nan) # Ignore self-distances by setting them to NaN
return distances

def calculate_distances_from_reference(factors, reference_point):
distances = np.sqrt(((factors - reference_point) ** 2).sum(axis=1))
return distances

def calculate_stats_for_dimensions(num_points, dimensions_range, range_min, range_max):
means_pairwise = []
stds_pairwise = []
means_ref = []
stds_ref = []

for dim in dimensions_range:
factors = generate_points(dim, num_points, range_min, range_max)
pairwise_distances = calculate_pairwise_distances(factors)
reference_point = generate_points(dim, 1, range_min, range_max)
distances_from_ref = calculate_distances_from_reference(factors, reference_point)

means_pairwise.append(np.nanmean(pairwise_distances))
stds_pairwise.append(np.nanstd(pairwise_distances))
means_ref.append(np.imply(distances_from_ref))
stds_ref.append(np.std(distances_from_ref))

return means_pairwise, stds_pairwise, means_ref, stds_ref

def plot_histograms_and_stats(num_points, dimensions_range, range_min, range_max):
fig, axs = plt.subplots(2, 3, figsize=(12, 7), tight_layout=True)

# Plotting histograms for 3D and 100D
for i, dim in enumerate([3, 100]):
factors = generate_points(dim, num_points, range_min, range_max)
pairwise_distances = calculate_pairwise_distances(factors)
reference_point = generate_points(dim, 1, range_min, range_max)
distances_from_ref = calculate_distances_from_reference(factors, reference_point)

axs[i, 0].hist(pairwise_distances[~np.isnan(pairwise_distances)], bins=50, alpha=0.7, coloration='blue', edgecolor='black')
axs[i, 0].set_title(f'Pairwise Distances in dimD')
axs[i, 1].hist(distances_from_ref, bins=30, alpha=0.7, coloration='inexperienced', edgecolor='black', vary=(0, max(distances_from_ref)))
axs[i, 1].set_title(f'Distances to Reference in dimD')

# Calculating and plotting imply and std deviation tendencies throughout dimensions
means_pairwise, stds_pairwise, means_ref, stds_ref = calculate_stats_for_dimensions(num_points, dimensions_range, range_min, range_max)

# Plotting imply and std deviation graphs for pairwise distances
axs[0, 2].plot(dimensions_range, means_pairwise, label='Imply Pairwise', marker='o', coloration='blue')
axs[0, 2].plot(dimensions_range, stds_pairwise, label='Std Dev Pairwise', marker='x', coloration='cyan')
axs[0, 2].set_title('Pairwise Distances Stats')

# Plotting imply and std deviation graphs for distances to reference level
axs[1, 2].plot(dimensions_range, means_ref, label='Imply Reference', marker='o', coloration='inexperienced')
axs[1, 2].plot(dimensions_range, stds_ref, label='Std Dev Reference', marker='x', coloration='lime')
axs[1, 2].set_title('Reference Level Distances Stats')

axs[0, 2].legend()
axs[1, 2].legend()

plt.present()

plot_histograms_and_stats(1000, vary(1, 101), 1, 100)

Picture by Creator

The code output exhibits how distances change throughout dimensions. In 3D, there are totally different distances between factors. In 100D, distances between factors are usually comparable. Graphs to the precise additionally present that as dimensions enhance, the imply distance between factors will get greater, however the usual deviation stays roughly the identical because it was on 2D or 3D house.

One other be aware right here is that as dimensions enhance, the imply distance between factors will get greater and approaches the utmost distance. This occurs as a result of a lot of the house is concentrated within the corners.

To higher perceive this, we are able to simulate random factors in dimensions as much as 100D. This can allow us to examine the typical distance to the utmost distance.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

def generate_points(dimensions, num_points, range_min, range_max):
return np.random.uniform(range_min, range_max, (num_points, dimensions))

def calculate_distances_stats(factors):
# Compute pairwise distances
distances = pdist(factors)

# Calculate common and most distance
average_distance = np.imply(distances)
max_distance = np.max(distances)

return average_distance, max_distance
def plot_normalized_difference(num_points, dimensions_range, range_min, range_max):
normalized_differences = []

for dim in dimensions_range:
factors = generate_points(dim, num_points, range_min, range_max)
average_distance, max_distance = calculate_distances_stats(factors)
normalized_difference = (max_distance - average_distance) / max_distance
normalized_differences.append(normalized_difference)

plt.determine(figsize=(8, 6))
plt.plot(dimensions_range, normalized_differences, label='Normalized Distinction', marker='o', coloration='blue')
plt.xlabel('Variety of Dimensions')
plt.ylabel('Normalized Distinction')
plt.title('Normalized Distinction Between Max and Common Distances Throughout Dimensions')
plt.legend()
plt.present()
plot_normalized_difference(500, vary(1, 101), 0, 1)

Picture by Creator

The graph exhibits that as we go into larger dimensions, the typical distance will get nearer to the utmost distance. We used normalization in right here to ensure the scales have been correct.

It’s vital to know the distinction between absolute and relative distances. Whereas absolute distances typically enhance with extra dimensions, it’s the relative variations that matter extra. Clustering algorithms like Ok-means or DBSCAN work by how factors are positioned in contrast to one another, not their actual distances. This lets us discover patterns and relationships that we’d miss if we solely regarded on the distances.

However this results in an fascinating query: why do pairs of factors in high-dimensional areas are usually roughly the identical distance aside as we add extra dimensions? What causes this to occur?

Picture by Aakash Dhage on Unsplash

To grasp why pairs of factors in high-dimensional areas grow to be equidistant, we are able to take a look at the Legislation of Massive Numbers (LLN). This statistical precept means that as we enhance our pattern dimension or the variety of dimensions, the typical of our observations will get nearer to the anticipated worth.

Let’s think about the instance of rolling a good six-sided cube. The anticipated imply of a roll is 3.5, which is the typical of all potential outcomes. Initially, with just some rolls, like 5 or 10, the typical is likely to be considerably totally different from 3.5 as a consequence of randomness. However as we enhance the variety of rolls to tons of or hundreds, the typical roll worth will get nearer to three.5. This phenomenon, the place the typical of many trials aligns with the anticipated worth, exhibits the essence of the LLN. It demonstrates that whereas particular person outcomes are unpredictable, the typical turns into extremely predictable over many trials.

Now, how does this relate to distances in high-dimensional areas?

The Euclidean distance between two factors in an n-dimensional house is calculated by summing the squared variations throughout every dimension. We will consider every squared distinction as a random variable, much like a roll of a cube. Because the variety of dimensions (or rolls) will increase, the sum of those ‘rolls’ will get nearer to an anticipated worth.

A vital requirement for the LLN is the independence of random variables. In high-dimensional vectors, this independence is likely to be proven by way of an fascinating geometric property: the vectors are usually virtually orthogonal to one another.

import numpy as np

def test_orthogonality(dimensions, n_trials):
for i in vary(n_trials):
# Generate two random vectors
v1 = np.random.randn(dimensions)
v2 = np.random.randn(dimensions)

# Calculate dot product
dot_product = np.dot(v1, v2)

# Calculate magnitudes
magnitude_v1 = np.linalg.norm(v1)
magnitude_v2 = np.linalg.norm(v2)

# Calculate the cosine of the angle
cos_theta = dot_product / (magnitude_v1 * magnitude_v2)

# Examine if vectors are virtually orthogonal
if np.abs(cos_theta) < 0.1: # Alter this threshold as wanted
orthogonality = "Virtually Orthogonal"
else:
orthogonality = "Not Orthogonal"

# Calculate angle in levels
theta = np.arccos(cos_theta) * (180 / np.pi) # Convert to levels

print(f"Trial i+1:")
print(f" Dot Product: dot_product")
print(f" Cosine of Angle: cos_theta")
print(f" Angle: theta levels")
print(f" Standing: orthogonality")
print("--------------------------------")

# Attempt to edit this and spot the near-orthogonality of vectors in larger dimensions
dimensions = 100 # Variety of dimensions
n_trials = 10 # Variety of trials

test_orthogonality(dimensions, n_trials)

Attempt operating the code above and modifying the variety of dimensions/ trials, and you’ll discover that vectors in larger dimensions are virtually orthogonal.

The angle between two vectors, A and B, is decided by the cosine of the angle, which is derived from their dot product and magnitudes. The formulation is expressed as:

Right here, AB represents the dot product of vectors A and B, and ∥A∥ and ∥B∥ are their respective magnitudes. For 2 vectors to be orthogonal, the angle between them should be 90 levels, making cos(θ) equal to zero. Sometimes, that is achieved when the dot product AB is zero, a situation acquainted in decrease dimensions.

Nevertheless, in high-dimensional areas, one other phenomenon emerges. The ratio of the dot product to the magnitude of the vectors turns into so small that we are able to think about the vectors to be ‘virtually orthogonal.’

However what does it imply for 2 vectors to be ‘impartial’ on this context?

Navigating a Grid Metropolis: An Analogy for Independence in Excessive Dimensions

Think about you’re in a metropolis specified by a grid, like Manhattan’s streets. Image your self at an intersection, making an attempt to succeed in one other level on this metropolis. On this analogy, every road represents a dimension in a high-dimensional house. Shifting alongside a road is like altering the worth in a single dimension of a high-dimensional vector. Shifting alongside one road doesn’t have an effect on your place on one other road, identical to altering one dimension doesn’t have an effect on the others.

To succeed in a particular intersection, you make a collection of impartial choices, like calculating distance in high-dimensional house. Every determination contributes independently however leads you to your vacation spot.

This analogy additionally applies to the idea of orthogonality in high-dimensional vectors. When vectors are virtually orthogonal, they comply with their very own paths with out considerably influencing one another. This situation enhances the necessity for statistical independence for the LLN.

An vital be aware: whereas this analogy of LLN affords a useful perspective, it could not seize all the concept or causes behind this conduct. Nevertheless, it serves as a helpful proxy, offering an understanding of what the rationale may be for pairs of level to be virtually equidistant.

A method the curse of dimensionality issues present up is overfitting. Overfitting occurs when a fancy mannequin learns noise as a substitute of the patterns within the knowledge. That is very true in high-dimensional areas the place there are various options. The mannequin could make false connections or correlations and carry out poorly when it sees new knowledge(failing to generalize).

The curse additionally makes it exhausting to seek out patterns in massive datasets. Excessive-dimensional knowledge is unfold out and sparse, so it’s difficult for conventional evaluation strategies to seek out significant insights. Some modifications or specialised strategies are wanted to navigate and perceive any such knowledge.

One other implication is that processing high-dimensional knowledge takes a whole lot of computational energy and reminiscence. Algorithms that work properly in decrease dimensions grow to be far more complicated and resource-heavy because the variety of dimensions will increase. This implies both having extra highly effective {hardware} or optimizing algorithms to deal with the elevated computational load effectively.

There are a number of methods to cope with the curse of dimensionality. A method is to cut back the dimensionality whereas maintaining the vital data(ex. PCA algorithm). One other technique is manifold studying(might be thought of as a kind of dimensionality discount).which uncovers the construction throughout the high-dimensional knowledge. The important thing concept behind manifold studying is that many high-dimensional datasets really lie on a lower-dimensional manifold throughout the high-dimensional house(ex. Isomaps)

Word right here that -generally speaking- conventional dimensionality discount methods like PCA (Principal Part Evaluation) deal with preserving world knowledge construction and variance in a linear trend. In distinction, manifold studying methods like Isomap(Isometric Mapping) emphasize uncovering the underlying non-linear construction(manifold) of information, aiming to protect native relationships and geometrical options.

Characteristic choice can be an choice, the place related options are chosen to enhance mannequin efficiency. Regularization methods forestall overfitting by shrinking much less vital options. Growing the pattern dimension may also assist, though it could not all the time be potential. These strategies can assist us analyze high-dimensional knowledge extra precisely and effectively.

The curse of dimensionality is likely one of the most vital issues in knowledge science and machine studying. It occurs when coping with high-dimensional areas. Two important challenges that come up are knowledge sparsity and points with distance metrics. These challenges could cause overfitting in machine studying fashions and make computations extra complicated. To deal with these challenges, methods like dimensionality discount, characteristic choice, and regularization methods can be utilized.

In case you have made it this far, I want to thanks for spending time studying this! I hope you discovered the subject gratifying and a minimum of inspiring sufficient to delve deeper into the world of high-dimensional knowledge. Please be at liberty to counsel any edits or level out any errors or inaccuracies.



Leave a Reply

Your email address will not be published. Required fields are marked *