sparktk logistic_regression

Functions

def train(

frame, observation_columns, label_column, frequency_column=None, num_classes=2, optimizer='LBFGS', compute_covariance=True, intercept=True, feature_scaling=False, threshold=0.5, reg_type='L2', reg_param=0.0, num_iterations=100, convergence_tolerance=0.0001, num_corrections=10, mini_batch_fraction=1.0, step_size=1.0)

Build logistic regression model.

Creating a Logistic Regression Model using the observation column and label column of the train frame.

Parameters:

frame

(Frame):

A frame to train the model on.

observation_columns

(List[str]):

Column(s) containing the observations.

label_column

(str):

Column name containing the label for each observation.

:param frequency_column:(Option[str]) Optional column containing the frequency of observations.

num_classes

(int):

Number of classes

optimizer

(str):

Set type of optimizer. LBFGS - Limited-memory BFGS. LBFGS supports multinomial logistic regression. SGD - Stochastic Gradient Descent. SGD only supports binary logistic regression.

compute_covariance

(bool):

Compute covariance matrix for the model.

intercept

(bool):

Add intercept column to training data.

feature_scaling

(bool):

Perform feature scaling before training model.

threshold

(double):

Threshold for separating positive predictions from negative predictions.

reg_type

(str):

Set type of regularization L1 - L1 regularization with sum of absolute values of coefficients L2 - L2 regularization with sum of squares of coefficients

reg_param

(double):

Regularization parameter

num_iterations

(int):

Maximum number of iterations

convergence_tolerance

(double):

Convergence tolerance of iterations for L-BFGS. Smaller value will lead to higher accuracy with the cost of more iterations.

num_corrections

(int):

Number of corrections used in LBFGS update. Default is 10. Values of less than 3 are not recommended; large values will result in excessive computing time.

mini_batch_fraction

(double):

Fraction of data to be used for each SGD iteration

step_size

(double):

Initial step size for SGD. In subsequent steps, the step size decreases by stepSize/sqrt(t)

Returns

(LogisticRegressionModel):

A LogisticRegressionModel with a summary of the trained model. The data returned is composed of multiple components\: int : numFeatures Number of features in the training data int : numClasses Number of classes in the training data table : summaryTable A summary table composed of: Frame : CovarianceMatrix (optional) Covariance matrix of the trained model. The covariance matrix is the inverse of the Hessian matrix for the trained model. The Hessian matrix is the second-order partial derivatives of the model's log-likelihood function.

Classes

class LogisticRegressionModel

A trained logistic regression model

Example:

>>> rows = [[4.9,1.4,0], [4.7,1.3,0], [4.6,1.5,0], [6.3,4.9,1],[6.1,4.7,1], [6.4,4.3,1], [6.6,4.4,1],[7.2,6.0,2], [7.2,5.8,2], [7.4,6.1,2], [7.9,6.4,2]]
>>> schema = [('Sepal_Length', float),('Petal_Length', float), ('Class', int)]
>>> frame = tc.frame.create(rows, schema)
[===Job Progress===]

Consider the following frame containing three columns.

>>> frame.inspect()
[#]  Sepal_Length  Petal_Length  Class
======================================
[0]           4.9           1.4      0
[1]           4.7           1.3      0
[2]           4.6           1.5      0
[3]           6.3           4.9      1
[4]           6.1           4.7      1
[5]           6.4           4.3      1
[6]           6.6           4.4      1
[7]           7.2           6.0      2
[8]           7.2           5.8      2
[9]           7.4           6.1      2

>>> model = tc.models.classification.logistic_regression.train(frame, ['Sepal_Length', 'Petal_Length'], 'Class', num_classes=3, optimizer='LBFGS', compute_covariance=True)
[===Job Progress===]

>>> model.training_summary
                coefficients  degrees_freedom  standard_errors          intercept_0        -0.780153                1              NaN
Sepal_Length_1   -120.442165                1  28497036.888425
Sepal_Length_0    -63.683819                1  28504715.870243
intercept_1       -90.484405                1              NaN
Petal_Length_0    117.979824                1  36178481.415888
Petal_Length_1    206.339649                1  36172481.900910

                wald_statistic   p_value
intercept_0                NaN       NaN
Sepal_Length_1       -0.000004  1.000000
Sepal_Length_0       -0.000002  1.000000
intercept_1                NaN       NaN
Petal_Length_0        0.000003  0.998559
Petal_Length_1        0.000006  0.998094

>>> model.training_summary.covariance_matrix.inspect()
[#]  Sepal_Length_0      Petal_Length_0      intercept_0
===============================================================
[0]   8.12518826843e+14   -1050552809704907   5.66008788624e+14
[1]  -1.05055305606e+15   1.30888251756e+15   -3.5175956714e+14
[2]   5.66010683868e+14  -3.51761845892e+14  -2.52746479908e+15
[3]   8.12299962335e+14  -1.05039425964e+15   5.66614798332e+14
[4]  -1.05027789037e+15    1308665462990595    -352436215869081
[5]     566011198950063  -3.51665950639e+14   -2527929411221601

[#]  Sepal_Length_1      Petal_Length_1      intercept_1
===============================================================
[0]     812299962806401  -1.05027764456e+15   5.66009303434e+14
[1]  -1.05039450654e+15   1.30866546361e+15  -3.51663671537e+14
[2]     566616693386615   -3.5243849435e+14   -2.5279294114e+15
[3]    8.1208111142e+14   -1050119118230513   5.66615352448e+14
[4]  -1.05011936458e+15   1.30844844687e+15   -3.5234036349e+14
[5]     566617247774244  -3.52342642321e+14   -2528394057347494

>>> predict_frame = model.predict(frame, ['Sepal_Length', 'Petal_Length'])
[===Job Progress===]

>>> predict_frame.inspect()
[#]  Sepal_Length  Petal_Length  Class  predicted_label
=======================================================
[0]           4.9           1.4      0                0
[1]           4.7           1.3      0                0
[2]           4.6           1.5      0                0
[3]           6.3           4.9      1                1
[4]           6.1           4.7      1                1
[5]           6.4           4.3      1                1
[6]           6.6           4.4      1                1
[7]           7.2           6.0      2                2
[8]           7.2           5.8      2                2
[9]           7.4           6.1      2                2

>>> test_metrics = model.test(frame, 'Class', ['Sepal_Length', 'Petal_Length'])
[===Job Progress===]

>>> test_metrics
accuracy         = 1.0
confusion_matrix =             Predicted_0.0  Predicted_1.0  Predicted_2.0
Actual_0.0              3              0              0
Actual_1.0              0              4              0
Actual_2.0              0              0              4
f_measure        = 1.0
precision        = 1.0
recall           = 1.0

>>> model.save("sandbox/logistic_regression")

>>> restored = tc.load("sandbox/logistic_regression")

>>> restored.training_summary.num_features == model.training_summary.num_features
True

The trained model can also be exported to a .mar file, to be used with the scoring engine:

>>> canonical_path = model.export_to_mar("sandbox/logisticRegressionModel.mar")

Ancestors (in MRO)

LogisticRegressionModel
sparktk.propobj.PropertiesObject
__builtin__.object

Instance variables

var compute_covariance

Compute covariance matrix for the model.

var convergence_tolerance

Convergence tolerance of iterations for L-BFGS. Smaller value will lead to higher accuracy with the cost of more iterations.

var feature_scaling

Perform feature scaling before training model.

var frequency_column

Optional column containing the frequency of observations.

var intercept

intercept column of training data.

var label_column

Column name containing the label for each observation.

var mini_batch_fraction

Fraction of data to be used for each SGD iteration

var num_classes

Number of classes

var num_corrections

Number of corrections used in LBFGS update. Default is 10. Values of less than 3 are not recommended; large values will result in excessive computing time.

var num_iterations

Maximum number of iterations

var observation_columns

Column(s) containing the observations.

var optimizer

Set type of optimizer. LBFGS - Limited-memory BFGS. LBFGS supports multinomial logistic regression. SGD - Stochastic Gradient Descent. SGD only supports binary logistic regression.

var reg_param

Regularization parameter

var reg_type

Set type of regularization L1 - L1 regularization with sum of absolute values of coefficients L2 - L2 regularization with sum of squares of coefficients

var step_size

Initial step size for SGD. In subsequent steps, the step size decreases by stepSize/sqrt(t)

var threshold

Threshold for separating positive predictions from negative predictions.

var training_summary

Logistic regression summary table

Methods

def __init__(

self, tc, scala_model)

def export_to_mar(

self, path)

Exports the trained model as a model archive (.mar) to the specified path.

Parameters:

path

(str):

Path to save the trained model

Returns

(str):

Full path to the saved .mar file

def predict(

self, frame, observation_columns_predict)

Predict labels for data points using trained logistic regression model.

Predict the labels for a test frame using trained logistic regression model, and create a new frame revision with existing columns and a new predicted label's column.

Parameters:

frame

(Frame):

A frame whose labels are to be predicted. By default, predict is run on the same columns over which the model is trained.

observation_columns_predict

(None or list[str]):

Column(s) containing the observations whose labels are to be predicted. Default is the labels the model was trained on.

Returns

(Frame):

Frame containing the original frame's columns and a column with the predicted label.

def save(

self, path)

Save the trained model to path

Parameters:

path

(str):

Path to save

def test(

self, frame, label_column, observation_columns_test)

Get the predictions for observations in a test frame

Parameters:

frame

(Frame):

Frame whose labels are to be predicted.

label_column

(str):

Column containing the actual label for each observation.

observation_columns_test

(None or list[str]):

Column(s) containing the observations whose labels are to be predicted and tested. Default is to test over the columns the SVM model was trained on.

Returns

(ClassificationMetricsValue):

Object with binary classification metrics

def to_dict(

self)

def to_json(

self)

Index

Functions

Classes

Functions

Classes

Ancestors (in MRO)

Instance variables

Methods