Curse of Dimensionality- a practical understanding

  1. Dimensions: The dimension of a data refers to the number of features/attributes that are present in the dataset.
  2. Dimensionality: The dimensionality of the data refers to the number of dimensions in a data.
  3. High dimensional data vs low dimensional data: There isn’t a number which could help classify a data as high dimensional or low dimensional. In general, we say that any data that contains more than 10 dimensions to be high dimensional.
  4. Nearest neighbors: For a data point in space, its nearest neighbors are defined as data points which are closest to it in space.
  1. Volume of a n-dimensional cube
  2. Distance concentration

Data Sparcity

In simple terms, as the dimensionality of the data increases, we require more data samples in order to generalize a model. Generalization of a model means its ability to correctly perform on unseen data.

Volume of n-dimensional cube

Another interesting phenomenon that comes into picture is that on randomly populating an n-dimensional cube with points, we observe that most of the points accumulate on the surface of the cube. That is, most of the volume of the cube is concentrated on a thin layer around the surface of the cube. Lets dive into the code and see this

x = np.array([[np.random.uniform() for i in range(n_dimensions)] for j in range(n_samples)])
def get_count_cartesian(x):
'''
This function takes in a set of m vectors and calculates the number of vectors that are located on
the surface of an n-dimensional cube.

Each element of a vector is compared with a threshold value. If any element of the vector exceeds the
threshold, the count is increased by 1.
'''
count = 0
for x_r in x:
d = x_r>0.9999 # increase decimal count decrease
d = d.astype(int)
sum = np.sum(d)
if sum>0:
count+=1
return count
Count of points landing at the surface vs number of dimensions

Distance Concentration

Another interesting phenomenon that occurs at high dimension is that the points tend to cluster closer to each other. This causes a major problem for algorithms that rely on nearest neighbors. For example, if we use a Nearest Neighbour algorithm for classification of an unknown point, we won’t be able to make a decision on which point is closest to the query point as all points appear at roughly the same distance from it.

def get_distance(x,x_new):
'''
This function calculates the Euclidean distance of a new point in the n-dimensional space with all the m vectors.

Returns the mean distance of the point to every other point and the variance between the distances.
'''
distance = []
for x_r in x:
dist = np.linalg.norm(x_r-x_new)
distance.append(dist)
mean = np.mean(distance)
variance = np.var(distance)
return mean,variance
Separation of points in 2D vs 3D. Image credits: Hachimi et al.,

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store