sparktk kmeans
Functions
def load(
path, tc=<class 'sparktk.arguments.implicit'>)
load KMeansModel from given path
def train(
frame, columns, k=2, scalings=None, max_iterations=20, convergence_tolerance=0.0001, seed=None, init_mode='k-means||')
Creates a KMeansModel by training on the given frame
frame | (Frame): | frame of training data |
columns | (List[str]): | names of columns containing the observations for training |
k | (Optional (int)): | number of clusters |
scalings | (Optional(List[float])): | column scalings for each of the observation columns. The scaling value is multiplied by the corresponding value in the observation column |
max_iterations | (Optional(int)): | number of iterations for which the algorithm should run |
convergence_tolerance | (Optional(float)): | distance threshold within which we consider k-means to have converged. Default is 1e-4. If all centers move less than this Euclidean distance, we stop iterating one run |
seed: | Optional(long) seed for randomness
|
Returns | (KMeansModel): | trained KMeans model |
Classes
class KMeansModel
A trained KMeans model
>>> frame = tc.frame.create([[2, "ab"],
... [1,"cd"],
... [7,"ef"],
... [1,"gh"],
... [9,"ij"],
... [2,"kl"],
... [0,"mn"],
... [6,"op"],
... [5,"qr"]],
... [("data", float), ("name", str)])
>>> model = tc.models.clustering.kmeans.train(frame, ["data"], 3, seed=5)
>>> model.k
3
>>> sizes = model.compute_sizes(frame)
>>> sizes
[2, 2, 5]
>>> wsse = model.compute_wsse(frame)
>>> wsse
5.3
>>> predicted_frame = model.predict(frame)
>>> predicted_frame.inspect()
[#] data name cluster
========================
[0] 2.0 ab 1
[1] 1.0 cd 1
[2] 7.0 ef 0
[3] 1.0 gh 1
[4] 9.0 ij 0
[5] 2.0 kl 1
[6] 0.0 mn 1
[7] 6.0 op 2
[8] 5.0 qr 2
>>> model.add_distance_columns(predicted_frame)
>>> predicted_frame.inspect()
[#] data name cluster distance0 distance1 distance2
=========================================================
[0] 2.0 ab 1 36.0 0.64 12.25
[1] 1.0 cd 1 49.0 0.04 20.25
[2] 7.0 ef 0 1.0 33.64 2.25
[3] 1.0 gh 1 49.0 0.04 20.25
[4] 9.0 ij 0 1.0 60.84 12.25
[5] 2.0 kl 1 36.0 0.64 12.25
[6] 0.0 mn 1 64.0 1.44 30.25
[7] 6.0 op 2 4.0 23.04 0.25
[8] 5.0 qr 2 9.0 14.44 0.25
>>> model.columns
[u'data']
>>> model.scalings # None
>>> centroids = model.centroids
>>> model.save("sandbox/kmeans1")
>>> restored = tc.load("sandbox/kmeans1")
>>> restored.centroids == centroids
True
>>> restored_sizes = restored.compute_sizes(predicted_frame)
>>> restored_sizes == sizes
True
>>> predicted_frame2 = restored.predict(frame)
>>> predicted_frame2.inspect()
[#] data name cluster
========================
[0] 2.0 ab 1
[1] 1.0 cd 1
[2] 7.0 ef 0
[3] 1.0 gh 1
[4] 9.0 ij 0
[5] 2.0 kl 1
[6] 0.0 mn 1
[7] 6.0 op 2
[8] 5.0 qr 2
>>> canonical_path = model.export_to_mar("sandbox/Kmeans.mar")
Ancestors (in MRO)
- KMeansModel
- sparktk.propobj.PropertiesObject
- __builtin__.object
Instance variables
var centroids
var columns
var initialization_mode
var k
var max_iterations
var scalings
Methods
def __init__(
self, tc, scala_model)
def add_distance_columns(
self, frame, columns=None)
def compute_sizes(
self, frame, columns=None)
def compute_wsse(
self, frame, columns=None)
def export_to_mar(
self, path)
Exports the trained model as a model archive (.mar) to the specified path
path | (str): | Path to save the trained model |
Returns | (str): | Full path to the saved .mar file |
def predict(
self, frame, columns=None)
Predicts the labels for the observation columns in the given input frame. Creates a new frame with the existing columns and a new predicted column.
frame | (Frame): | Frame used for predicting the values |
c | (List[str]): | Names of the observation columns. |
Returns | (Frame): | A new frame containing the original frame's columns and a prediction column |
def save(
self, path)
def to_dict(
self)
def to_json(
self)