Up

sparktk kmeans

Functions

def load(

path, tc=<class 'sparktk.arguments.implicit'>)

load KMeansModel from given path

def train(

frame, columns, k=2, scalings=None, max_iterations=20, convergence_tolerance=0.0001, seed=None, init_mode='k-means||')

Creates a KMeansModel by training on the given frame

frame(Frame):frame of training data
columns(List[str]):names of columns containing the observations for training
k(Optional (int)):number of clusters
scalings(Optional(List[float])):column scalings for each of the observation columns. The scaling value is multiplied by the corresponding value in the observation column
max_iterations(Optional(int)):number of iterations for which the algorithm should run
convergence_tolerance(Optional(float)):distance threshold within which we consider k-means to have converged. Default is 1e-4. If all centers move less than this Euclidean distance, we stop iterating one run
seed: Optional(long) seed for randomness
init_mode(Optional(str)):the initialization technique for the algorithm. It can be either "random" to choose random points as initial clusters or "k-means||" to use a parallel variant of k-means++. Default is "k-means||

Returns(KMeansModel): trained KMeans model

Classes

class KMeansModel

A trained KMeans model

Example:
>>> frame = tc.frame.create([[2, "ab"],
...                          [1,"cd"],
...                          [7,"ef"],
...                          [1,"gh"],
...                          [9,"ij"],
...                          [2,"kl"],
...                          [0,"mn"],
...                          [6,"op"],
...                          [5,"qr"]],
...                         [("data", float), ("name", str)])

>>> model = tc.models.clustering.kmeans.train(frame, ["data"], 3, seed=5)

>>> model.k
3

>>> sizes = model.compute_sizes(frame)

>>> sizes
[2, 2, 5]

>>> wsse = model.compute_wsse(frame)

>>> wsse
5.3

>>> predicted_frame = model.predict(frame)

>>> predicted_frame.inspect()
[#]  data  name  cluster
========================
[0]   2.0  ab          1
[1]   1.0  cd          1
[2]   7.0  ef          0
[3]   1.0  gh          1
[4]   9.0  ij          0
[5]   2.0  kl          1
[6]   0.0  mn          1
[7]   6.0  op          2
[8]   5.0  qr          2


>>> model.add_distance_columns(predicted_frame)

>>> predicted_frame.inspect()
[#]  data  name  cluster  distance0  distance1  distance2
=========================================================
[0]   2.0  ab          1       36.0       0.64      12.25
[1]   1.0  cd          1       49.0       0.04      20.25
[2]   7.0  ef          0        1.0      33.64       2.25
[3]   1.0  gh          1       49.0       0.04      20.25
[4]   9.0  ij          0        1.0      60.84      12.25
[5]   2.0  kl          1       36.0       0.64      12.25
[6]   0.0  mn          1       64.0       1.44      30.25
[7]   6.0  op          2        4.0      23.04       0.25
[8]   5.0  qr          2        9.0      14.44       0.25

>>> model.columns
[u'data']

>>> model.scalings  # None


>>> centroids = model.centroids

>>> model.save("sandbox/kmeans1")

>>> restored = tc.load("sandbox/kmeans1")

>>> restored.centroids == centroids
True

>>> restored_sizes = restored.compute_sizes(predicted_frame)

>>> restored_sizes == sizes
True


>>> predicted_frame2 = restored.predict(frame)

>>> predicted_frame2.inspect()
[#]  data  name  cluster
========================
[0]   2.0  ab          1
[1]   1.0  cd          1
[2]   7.0  ef          0
[3]   1.0  gh          1
[4]   9.0  ij          0
[5]   2.0  kl          1
[6]   0.0  mn          1
[7]   6.0  op          2
[8]   5.0  qr          2

>>> canonical_path = model.export_to_mar("sandbox/Kmeans.mar")

Ancestors (in MRO)

  • KMeansModel
  • sparktk.propobj.PropertiesObject
  • __builtin__.object

Instance variables

var centroids

var columns

var initialization_mode

var k

var max_iterations

var scalings

Methods

def __init__(

self, tc, scala_model)

def add_distance_columns(

self, frame, columns=None)

def compute_sizes(

self, frame, columns=None)

def compute_wsse(

self, frame, columns=None)

def export_to_mar(

self, path)

Exports the trained model as a model archive (.mar) to the specified path

Parameters:
path(str):Path to save the trained model

Returns(str): Full path to the saved .mar file

def predict(

self, frame, columns=None)

Predicts the labels for the observation columns in the given input frame. Creates a new frame with the existing columns and a new predicted column.

Parameters:
frame(Frame):Frame used for predicting the values
c(List[str]):Names of the observation columns.

Returns(Frame): A new frame containing the original frame's columns and a prediction column

def save(

self, path)

def to_dict(

self)

def to_json(

self)