This page contains Clustering glossary terms. For all glossary terms, click here.

## A

## agglomerative clustering

## C

## centroid

The center of a cluster as determined by a **k-means** or
**k-median** algorithm. For instance, if k is 3,
then the k-means or k-median algorithm finds 3 centroids.

## centroid-based clustering

A category of **clustering** algorithms that organizes data
into nonhierarchical clusters. **k-means** is the most widely
used centroid-based clustering algorithm.

Contrast with **hierarchical clustering**
algorithms.

## clustering

Grouping related **examples**, particularly during
**unsupervised learning**. Once all the
examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the **k-means**
algorithm clusters examples based on their proximity to a
**centroid**, as in the following diagram:

A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on an example's distance from a center point, illustrated as follows:

## D

## divisive clustering

## H

## hierarchical clustering

A category of **clustering** algorithms that create a tree
of clusters. Hierarchical clustering is well-suited to hierarchical data,
such as botanical taxonomies. There are two types of hierarchical
clustering algorithms:

**Agglomerative clustering**first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.**Divisive clustering**first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with **centroid-based clustering**.

## K

## k-means

A popular **clustering** algorithm that groups examples
in unsupervised learning. The k-means algorithm basically does the following:

- Iteratively determines the best k center points (known
as
**centroids**). - Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative
*square* of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

Imagine that a manufacturer wants to determine the ideal sizes for small,
medium, and large sweaters for dogs. The three centroids identify the mean
height and mean width of each dog in that cluster. So, the manufacturer
should probably base sweater sizes on those three centroids. Note that
the centroid of a cluster is typically *not* an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

## k-median

A clustering algorithm closely related to **k-means**. The
practical difference between the two is as follows:

- In k-means, centroids are determined by minimizing the sum of the
*squares*of the distance between a centroid candidate and each of its examples. - In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

- k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:

- k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:

## S

## similarity measure

In **clustering** algorithms, the metric used to determine
how alike (how similar) any two examples are.

## sketching

In **unsupervised machine learning**,
a category of algorithms that perform a preliminary similarity analysis
on examples. Sketching algorithms use a
locality-sensitive hash function
to identify points that are likely to be similar, and then group
them into buckets.

Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within each bucket.

## T

## time series analysis

A subfield of machine learning and statistics that analyzes
**temporal data**. Many types of machine learning
problems require time series analysis, including classification, clustering,
forecasting, and anomaly detection. For example, you could use
time series analysis to forecast the future sales of winter coats by month
based on historical sales data.

## U

## unsupervised machine learning

Training a **model** to find patterns in a dataset, typically an
unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

Compare with **supervised machine learning**.