Introduction to Clojure - K-means from scratch

Non Publié • 12 minutes de lecture

🚧 Article en cours de rédaction

Cet article est en cours de rédaction, son contenu peut évoluer sans préavis et les informations qu'il contient peuvent manquer de précisions.

Introduction to clojure - k-means from scratch

Introduction

K-means intuition

Before getting started with Clojure, let's have a look at the algorithm we're trying to build.

According to Wikipedia, the K-means algorithm is meant to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid).

To put it differently, the k-means algorithm aims at finding the best spot for each centroid to sit.

How come we do that?

In theory, it's not that hard. Here are the steps to get there:

First, we'll need to randomly initialize k centroids in our vector space.
Then, we'll need to compute the distance that separates each point from each centroid.
Each point will be associated to it's closest centroid.
At this point, we have a first rather random- classification.
We can now update our centroids by computing the mean point of each cluster.
Repeat (but obviously skip the first step) until convergence.

If you just want to play around, you'll find the full code at the end of the blog post.

Clojure basics

There are a few prerequisite for this article:

You should have Clojure installed on your machine. If not, you can refer to the official guide https://clojure.org/guides/getting_started
We'll be using IntelliJ IDEA with the Cursive extension (download it in Preference -> Plugins then type cursive in the search bar).

For us to implement our algorithm, we'll need to understand a thing or two about Clojure.

Shaunlebron did a good job on explaining basic clojure syntax I strongly suggest you give it a look here.

We'll get into more details during the implementation.

Terminology used in this article

centroid: central point of a cluster

cluster: aggregation of data

Prelude: Set up

First of all, we'll define a global variable using the def keyword that will hold an associative array of our configuration.

(def configuration {:k-clusters    2
                    :max-epochs    1000
                    :threshold     0.0001
                    :bounds        [[0 15] [0 15]]  ;; 2 dimensions
                    :current-epoch 0})

Let's define what's in this associative array:

:k-clusters: the amount of clusters we want our data to be classified in.
:max-epochs: the maximum loop we're allow to make before convergence.
:threshold: the threshold used to define whether our model converged or not.
:bounds: bounds is a vector of m-dimensional-vectors that defines the vector space of our dataset.
:current-epoch: as our model converges, it will return the current loop we're in.

Also, let's build a small data set that we'll use as a validation dataset.

(def points [[0 0] [0 1] [1 0] [1 1]
             [10 10] [10 11] [11 10] [11 11]])

By now you should be able to say "We defined a global variable called points that is a vector of 2 dimensional vectors".

Step one: Randomly initialize centroids

Our aim here is: given our vector space (:bounds in our previous associative array) we'd like to generate n random points that will be our initial centroids.

Do we have special tools for this?

Yes! We'll use mapv, rand-nthand range.

Before defining mapv let's define map.

The basic map usage allows you to pass two arguments to the map function: a function and a sequence of elements.

Example:

(map (fn [elt] (* elt 2)) [1 2 3])
;; the output will be (2 4 6)

In this example map applies a lambda function (fn [elt] (* elt 2)) and applies it to every element of the array[1 2 3]. Then it returns a new list of results.

Keep in mind that in Clojure, map is lazy by design. Also, you might be thinking "returning a new copy of each data every time we want to do something with it must be incredibly inefficient !" - well just like in java you usually return an updated view of the updated data. We'll discuss this more in another topic but for now you can assume this is copy-paste on steroids.

Getting back to mapv well basically it will return a vector instead of a list.

(mapv (fn [elt] (* elt 2)) [1 2 3])
;; the output will be [2 4 6]

rand-nthand range on the other hand are pretty straight forward:

(range 10 15)
;; outputs -> (10 11 12 13 14)
(rand-nth (range 10 15))
;; pick a random number in the range. outputs -> 13 (for instance)

So how do we generate our random centroid?

First we'll define a function that takes some bounds for argument remember bounds are defined as a vector of vectors in our configuration.

Then we can use the mapv function to iterate through the bounds of each dimension and return a centroid vector!

(defn generate-centroid [bounds]
  (mapv (fn [bound] (rand-nth (range (first bound) (second bound)))) bounds))

;;try it out like so
(generate-centroid  [[0 15] [0 15]] )

Step Two: Let's build our application state using our `configuration`

Now that we can initialize random centroids, let's build an initial state that we'll make converge throughout the algorithm.

For this, let us first introduce let.

let allow you to define scoped constants that you might want to use in your return value.

For instance let's make a function that returns a vector of computed values:

(defn fun [x y]
  (let [x-times-two (* 2 x)
        y-times-two (* 2 y)]
    [x-times-two y-times-two]))

(fun 1 2) ;; output [2 4]

Values like x-times-two (* 2 x) are called local bindings.

Let's now introduce Clojure sets.

Sets are a list of unique elements. In Clojure they can be defined like so: #{:bar 3.14 "hello"}.

The assoc operation allow you to add a new key-value pair in an associative array.

vec can turn a list into a vector and repeat returns a list of repeated element. For instance (vec (repeat 2 "hello"))will return ["hello" "hello"].

Here is how we build our initial state:

(defn initial-state [config]
  (let [centroids (mapv (fn [elt] (generate-centroid (get config :bounds))) (range (get config :k-clusters)))]
    (assoc config :groups (vec (repeat (count centroids) #{}))
                  :centroids centroids)))

Output:

=>
{:k-clusters 2,
 :max-epochs 1000,
 :threshold 1.0E-4,
 :bounds [[0 15] [0 15]],
 :current-epoch 0,
 :groups [#{} #{}], 			;; Vector of repeated empty sets
 :centroids [[3 11] [14 11]]}		;; Random centroids

Step Three: Euclidean distances & point classification

Given two points A(X1,Y1) and B(X2,Y2) the euclidean distance between these two points can be computed like so:

√(X1-X2)²+(Y1-Y2)²

In Clojure, Math basic function can be used like this (Math/sqrt 4)

Now let's build a function that compute the euclidean distance between 2 2D points.

(defn euclidean-distance [[x1 y1] [x2 y2]]
  (Math/sqrt (+ (Math/pow (- x1 x2) 2)
                (Math/pow (- y1 y2) 2))))

We used destructuration in the previous example by providing arguments in vectors [x1 y1] [x2 y2]. Indeed our euclidean distance can now take two 2-dimensional vectors but also we can access to every coordinate directly.

However, we might want to scale our program up by using points with more than 2 dimensions.

A hack to subtract every "Xs" of our points is to view our points as a transposed matrix and then reduce every vector.

For instance [[x1 y1] [x2 y2]] will become [[x1 x2] [y1 y2]].

We can then use a mapv - reduce combo to return a vector differences!

So let's build a function that compute the euclidean distance between 2 mD points.

(defn euclidean-distance [point-1 point-2]
  (Math/sqrt
    (reduce + 0
            (mapv (fn [diff] (Math/pow diff 2))
                  (mapv (fn [nth-elts] (reduce - (first nth-elts) (rest nth-elts)))
                        (transpose-matrix [point-1 point-2]))))))

Remember the output of the previous step ? Now we want to create a function that can add points in each set of our state.

For this we'll use update-in that allow us to update an associative array given a path.

(defn classify-point [state point]
  (let [centroids (get state :centroids) ;; we get the centroids from our state
        distances (mapv (fn [centroid] (euclidean-distance point centroid)) centroids) ;; we return the distances between a point and each centroid
        indexed-distances (map (fn [index value] [index value]) (range) distances) ;; we index the distances in order to classify the point in the correct group
        [index shortest-distance] (first (sort-by second indexed-distances))] ;; we keep the shortest "indexed distance"
    (update-in state [:groups index] (fn [set] (conj set point))))) ;; we add our point to the correct set

Step Four: Update centroids

Congratulations, you've made the first classification now we want to update our centroid with the mean point of each set.

Here is an example for a set of two dimensional points.

(defn mean-point-for-record [points]
  (when (seq points)
    (let [nb-points (count points)
          xs-sum (reduce + 0 (map first points))
          ys-sum (reduce + 0 (map second points))]
      [(/ xs-sum nb-points) (/ ys-sum nb-points)])))

And now a version that can take any dimension. Notice that we use whento return nil if points is en empty vector.

(defn mean-point [points]
  (when (seq points)                                       
    (let [nb-points (count points)]
      (mapv
        (fn [coordinate-vec] (/ (reduce + 0 coordinate-vec) nb-points))
        (transpose-matrix points)))))

We can now update our state with the new centroids and we can also update our iteration count :current-epoch.

(defn update-centroids [state]
  (let [new-centroids (mapv mean-point (:groups state))
        new-current-epoch (inc (:current-epoch state))]
    (assoc state :centroids new-centroids :current-epoch new-current-epoch)))

Step Five: Repeat until convergence

To say wether or not our model has converged, we want to either:

Have reached our max-epoch.
Have a centroid variation so slight that we can consider our job done.

(defn converged? [previous-state current-state]
  (or (> (get current-state :current-epoch) (get current-state :max-epochs))
      (let [distances (map euclidean-distance (:centroids current-state) (:centroids previous-state))]
        (every? (fn [distance] (< distance (:threshold current-state))) distances)))) ;; here we use every? to check is every variation is under the threshold

Let's wrap it up in a main function:

(defn main [config points]
  (loop [state (initial-state config)]
    (let [with-points (reduce classify-point state points)
          with-new-centroids (update-centroids with-points)]
      (if (converged? state with-new-centroids)
        with-new-centroids
        (recur with-new-centroids)))))

Full code

(ns test-clojure.clustering)

(def configuration {:k-clusters    2
                    :max-epochs    1000
                    :threshold     0.0001
                    :bounds        [[0 15] [0 15]]          ;; 2 dim
                    :current-epoch 0})

(def points [[0 0] [0 1] [1 0] [1 1]
             [10 10] [10 11] [11 10] [11 11]])

(defn generate-centroid [bounds]
  (mapv (fn [bound] (rand-nth (range (first bound) (second bound)))) bounds))

(defn initial-state [config]
  (let [centroids (mapv (fn [elt] (generate-centroid (get config :bounds))) (range (get config :k-clusters)))]
    (assoc config :groups (vec (repeat (count centroids) #{}))
                  :centroids centroids)))

(defn transpose-matrix [matrix]
  (apply mapv vector matrix))

(defn euclidean-distance [point-1 point-2]
  (Math/sqrt
    (reduce + 0
            (mapv (fn [diff] (Math/pow diff 2))
                  (mapv (fn [nth-elts] (reduce - (first nth-elts) (rest nth-elts)))
                        (transpose-matrix [point-1 point-2]))))))

(defn classify-point [state point]
  (let [centroids (get state :centroids)
        distances (mapv (fn [centroid] (euclidean-distance point centroid)) centroids)
        indexed-distances (map (fn [index value] [index value]) (range) distances)
        [index shortest-distance] (first (sort-by second indexed-distances))]
    (update-in state [:groups index] (fn [set] (conj set point)))))

(defn mean-point [points]
  (when (seq points)                                       
    (let [nb-points (count points)]
      (mapv
        (fn [coordinate-vec] (/ (reduce + 0 coordinate-vec) nb-points))
        (transpose-matrix points)))))

(defn update-centroids [state]
  (let [new-centroids (mapv mean-point (:groups state))
        new-current-epoch (inc (:current-epoch state))]
    (assoc state :centroids new-centroids :current-epoch new-current-epoch)))

(defn converged? [previous-state current-state]
  (or (> (get current-state :current-epoch) (get current-state :max-epochs))
      (let [distances (map euclidean-distance (:centroids current-state) (:centroids previous-state))]
        (every? (fn [distance] (< distance (:threshold current-state))) distances))))

(defn main [config points]
  (loop [state (initial-state config)]
    (let [with-points (reduce classify-point state points)
          with-new-centroids (update-centroids with-points)]
      (if (converged? state with-new-centroids)
        with-new-centroids
        (recur with-new-centroids)))))

(main configuration points)

Expected output:

{:k-clusters 2,
 :max-epochs 1000,
 :threshold 1.0E-4,
 :bounds [[0 15] [0 15]],
 :current-epoch 2,
 :groups [#{[0 0] [1 0] [1 1] [0 1]} #{[11 11] [11 10] [10 11] [10 10]}],
 :centroids [[1/2 1/2] [21/2 21/2]]}