sparktk pca
Functions
def load(
path, tc=<class 'sparktk.arguments.implicit'>)
load PcaModel from given path
def train(
frame, columns, mean_centered=True, k=None)
Creates a PcaModel by training on the given frame
frame | (Frame): | A frame of training data. |
columns | (str or list[str]): | Names of columns containing the observations for training. |
mean_centered | (bool): | Whether to mean center the columns. |
k | (int): | Principal component count. Default is the number of observation columns. |
Returns | (PcaModel): | The trained PCA model |
Classes
class PcaModel
Princiapl Component Analysis Model
>>> frame = tc.frame.create([[2.6,1.7,0.3,1.5,0.8,0.7],
... [3.3,1.8,0.4,0.7,0.9,0.8],
... [3.5,1.7,0.3,1.7,0.6,0.4],
... [3.7,1.0,0.5,1.2,0.6,0.3],
... [1.5,1.2,0.5,1.4,0.6,0.4]],
... [("1", float), ("2", float), ("3", float), ("4", float), ("5", float), ("6", float)])
-etc-
>>> frame.inspect()
[#] 1 2 3 4 5 6
=================================
[0] 2.6 1.7 0.3 1.5 0.8 0.7
[1] 3.3 1.8 0.4 0.7 0.9 0.8
[2] 3.5 1.7 0.3 1.7 0.6 0.4
[3] 3.7 1.0 0.5 1.2 0.6 0.3
[4] 1.5 1.2 0.5 1.4 0.6 0.4
>>> model = tc.models.dimreduction.pca.train(frame, ['1','2','3','4','5','6'], mean_centered=True, k=4)
>>> model.columns
[u'1', u'2', u'3', u'4', u'5', u'6']
>>> model.column_means
[2.92, 1.48, 0.4, 1.3, 0.7, 0.52]
>>> model.singular_values
[1.804817009663242, 0.8835344148403884, 0.7367461843294286, 0.15234027471064396]
>>> model.right_singular_vectors
[[-0.9906468642089336, 0.11801374544146298, 0.02564701035332026, 0.04852509627553534], [-0.07735139793384983, -0.6023104604841426, 0.6064054412059492, -0.4961696216881456], [0.028850639537397756, 0.07268697636708586, -0.24463936400591005, -0.17103491337994484], [0.10576208410025367, 0.5480329468552814, 0.7523059089872701, 0.2866144016081254], [-0.024072151446194616, -0.30472267167437644, -0.011259366445851784, 0.48934541040601887], [-0.00617295395184184, -0.47414707747028795, 0.0753345822621543, 0.6329307498105843]]
>>> predicted_frame = model.predict(frame, mean_centered=True, t_squared_index=True, columns=['1','2','3','4','5','6'], k=3)
-etc-
>>> predicted_frame.inspect()
[#] 1 2 3 4 5 6 p_1 p_2
===================================================================
[0] 1.5 1.2 0.5 1.4 0.6 0.4 1.44498618058 0.150509319195
[1] 2.6 1.7 0.3 1.5 0.8 0.7 0.314738695012 -0.183753549226
[2] 3.5 1.7 0.3 1.7 0.6 0.4 -0.549024749481 0.235254068619
[3] 3.3 1.8 0.4 0.7 0.9 0.8 -0.471198363594 -0.670419608227
[4] 3.7 1.0 0.5 1.2 0.6 0.3 -0.739501762517 0.468409769639
<BLANKLINE>
[#] p_3 t_squared_index
=====================================
[0] -0.163359836968 0.719188122813
[1] 0.312561560113 0.253649649849
[2] 0.465756549839 0.563086507007
[3] -0.228746130528 0.740327252782
[4] -0.386212142456 0.723748467549
>>> model.save('sandbox/pca1')
>>> model2 = tc.load('sandbox/pca1')
>>> model2.k
4
>>> predicted_frame2 = model2.predict(frame, mean_centered=True, t_squared_index=True, columns=['1','2','3','4','5','6'], k=3)
>>> predicted_frame2.inspect()
[#] 1 2 3 4 5 6 p_1 p_2
===================================================================
[0] 1.5 1.2 0.5 1.4 0.6 0.4 1.44498618058 0.150509319195
[1] 2.6 1.7 0.3 1.5 0.8 0.7 0.314738695012 -0.183753549226
[2] 3.5 1.7 0.3 1.7 0.6 0.4 -0.549024749481 0.235254068619
[3] 3.3 1.8 0.4 0.7 0.9 0.8 -0.471198363594 -0.670419608227
[4] 3.7 1.0 0.5 1.2 0.6 0.3 -0.739501762517 0.468409769639
<BLANKLINE>
[#] p_3 t_squared_index
=====================================
[0] -0.163359836968 0.719188122813
[1] 0.312561560113 0.253649649849
[2] 0.465756549839 0.563086507007
[3] -0.228746130528 0.740327252782
[4] -0.386212142456 0.723748467549
>>> canonical_path = model.export_to_mar("sandbox/Kmeans.mar")
Ancestors (in MRO)
- PcaModel
- sparktk.propobj.PropertiesObject
- __builtin__.object
Instance variables
var column_means
var columns
var k
var mean_centered
var right_singular_vectors
var singular_values
Methods
def __init__(
self, tc, scala_model)
def export_to_mar(
self, path)
Exports the trained model as a model archive (.mar) to the specified path
path | (str): | Path to save the trained model |
Returns | (str): | Full path to the saved .mar file |
def predict(
self, frame, columns=None, mean_centered=None, k=None, t_squared_index=False)
Predicts the labels for the observation columns in the given input frame. Creates a new frame with the existing columns and a new predicted column.
frame | (Frame): | Frame used for predicting the values |
columns | (List[str]): | Names of the observation columns. |
mean_centered | (boolean): | whether to mean center the columns. Default is true |
k | (int): | the number of principal components to be computed, must be <= the k used in training. Default is the trained k |
t_squared_index | (boolean): | whether the t-square index is to be computed. Default is false |
Returns | (Frame): | A new frame containing the original frame's columns and a prediction column |
def save(
self, path)
def to_dict(
self)
def to_json(
self)