Curse of Dimensionality- a practical understanding
If you are a Machine learning engineer or just an enthusiast, there is a good chance that you would have encountered the word “Curse of Dimensionality”. There are many excellent resources online that provide a great theoretical understanding but I wanted a simple practical demonstration that hit the point. So with that in mind, I decided to write one.
Before jumping into the code, let us understand some key concepts that we will be discussing.
We will start by defining some important terms ( you can skip directly onto the code if you want).
- Features/Attributes: A feature is a characteristic property of a data which helps to distinguish it from others. For example, in a bag of fruits, an apple can be easily recognized by its color, or a banana by its shape. The color and shape are the features of the data (fruits).
- Dimensions: The dimension of a data refers to the number of features/attributes that are present in the dataset.
- Dimensionality: The dimensionality of the data refers to the number of dimensions in a data.
- High dimensional data vs low dimensional data: There isn’t a number which could help classify a data as high dimensional or low dimensional. In general, we say that any data that contains more than 10 dimensions to be high dimensional.
- Nearest neighbors: For a data point in space, its nearest neighbors are defined as data points which are closest to it in space.
With the definitions clear, let us now understand “Curse of Dimensionality”. The term was first coined by Richard Bellman while working on Dynamic Programming (first introduced by Bellman as well) problems. It refers to the various phenomenon that occur when analyzing data in high dimensions which do not occur in lower dimensions such as 2-D or 3-D.
The phenomenons can be broadly grouped under two criterion:
- Data sparcity
- Volume of a n-dimensional cube
- Distance concentration
In simple terms, as the dimensionality of the data increases, we require more data samples in order to generalize a model. Generalization of a model means its ability to correctly perform on unseen data.
It is generally recommended that if d is the dimensionality of the feature space of data that would be fed to a machine learning algorithm, we would require at least 10 x d number of samples for each class. An excellent example is provided by Arun K.
Volume of n-dimensional cube
Another interesting phenomenon that comes into picture is that on randomly populating an n-dimensional cube with points, we observe that most of the points accumulate on the surface of the cube. That is, most of the volume of the cube is concentrated on a thin layer around the surface of the cube. Lets dive into the code and see this
We will first randomly generate n points with each point being in a n -dimensional space.
x = np.array([[np.random.uniform() for i in range(n_dimensions)] for j in range(n_samples)])
In the following code, we will calculate how many of the 1000 points end up on the surface of the n-dimensional unit cube.
This function takes in a set of m vectors and calculates the number of vectors that are located on
the surface of an n-dimensional cube.
Each element of a vector is compared with a threshold value. If any element of the vector exceeds the
threshold, the count is increased by 1.
count = 0
for x_r in x:
d = x_r>0.9999 # increase decimal count decrease
d = d.astype(int)
sum = np.sum(d)
On running the code for 1000 points with dimensions ranging from 1 to 1000 we get the following result.
As discussed above, we see that as the number of dimensions increase, the probability of a point ending up on the surface of the cube increase. Even with around 200 dimensions, the count goes up to 850 out of 1000 points. This is because even if one dimension ends up being 1, the point ends up on the surface. A more mathematical proof of the same can be found here https://www.math.ucdavis.edu/~strohmer/courses/180BigData/180lecture1.pdf
Another interesting phenomenon that occurs at high dimension is that the points tend to cluster closer to each other. This causes a major problem for algorithms that rely on nearest neighbors. For example, if we use a Nearest Neighbour algorithm for classification of an unknown point, we won’t be able to make a decision on which point is closest to the query point as all points appear at roughly the same distance from it.
This function calculates the Euclidean distance of a new point in the n-dimensional space with all the m vectors.
Returns the mean distance of the point to every other point and the variance between the distances.
distance = 
for x_r in x:
dist = np.linalg.norm(x_r-x_new)
mean = np.mean(distance)
variance = np.var(distance)
As the number of dimensions increase, the variance in Euclidean distance goes on decreasing. With standard deviation being directly proportional to variance, it too decreases. This leads to a situation in which each point clusters near a single point. Here are the results for the same
This is the reason we never see any Nearest Neighbour algorithm applied on dimensions higher than 10. An excellent visualization for the distances is provided below
Mathematical demonstration of the distance concentration in high dimensions
begingroup$ There is a simple mathematical thought experiment that sheds light on this phenomenon, although it might…
I hope these simple code help you understand the effects of working with data in high dimensions. But, is higher dimensional data always bad? There is a concept called “Blessing of Dimensionality” coined by Donoho in the late 90s. Points in high dimension greatly simplify the expected geometry of points as well as allows great scope for linear separability of data. A simplest Fisher discriminant is enough to correctly separate the data points. This is exactly what is done in Kernel Trick though we won’t cover it in this tutorial. Here’s a visual representation of the same.
The code can be found here: https://github.com/Suyashail/Curse-of-Dimensionality . It contains additional experimentation of finding a way around the generation of points on the surface of the cube.