sparktk logistic_regression
Functions
def train(
frame, observation_columns, label_column, frequency_column=None, num_classes=2, optimizer='LBFGS', compute_covariance=True, intercept=True, feature_scaling=False, threshold=0.5, reg_type='L2', reg_param=0.0, num_iterations=100, convergence_tolerance=0.0001, num_corrections=10, mini_batch_fraction=1.0, step_size=1.0)
Build logistic regression model.
Creating a Logistic Regression Model using the observation column and label column of the train frame.
frame | (Frame): | A frame to train the model on. |
observation_columns | (List[str]): | Column(s) containing the observations. |
label_column | (str): | Column name containing the label for each observation. |
:param frequency_column:(Option[str]) Optional column containing the frequency of observations.
num_classes | (int): | Number of classes |
optimizer | (str): | Set type of optimizer. LBFGS - Limited-memory BFGS. LBFGS supports multinomial logistic regression. SGD - Stochastic Gradient Descent. SGD only supports binary logistic regression. |
compute_covariance | (bool): | Compute covariance matrix for the model. |
intercept | (bool): | Add intercept column to training data. |
feature_scaling | (bool): | Perform feature scaling before training model. |
threshold | (double): | Threshold for separating positive predictions from negative predictions. |
reg_type | (str): | Set type of regularization L1 - L1 regularization with sum of absolute values of coefficients L2 - L2 regularization with sum of squares of coefficients |
reg_param | (double): | Regularization parameter |
num_iterations | (int): | Maximum number of iterations |
convergence_tolerance | (double): | Convergence tolerance of iterations for L-BFGS. Smaller value will lead to higher accuracy with the cost of more iterations. |
num_corrections | (int): | Number of corrections used in LBFGS update. Default is 10. Values of less than 3 are not recommended; large values will result in excessive computing time. |
mini_batch_fraction | (double): | Fraction of data to be used for each SGD iteration |
step_size | (double): | Initial step size for SGD. In subsequent steps, the step size decreases by stepSize/sqrt(t) |
Returns | (LogisticRegressionModel): | A LogisticRegressionModel with a summary of the trained model. The data returned is composed of multiple components\: int : numFeatures Number of features in the training data int : numClasses Number of classes in the training data table : summaryTable A summary table composed of: Frame : CovarianceMatrix (optional) Covariance matrix of the trained model. The covariance matrix is the inverse of the Hessian matrix for the trained model. The Hessian matrix is the second-order partial derivatives of the model's log-likelihood function. |
Classes
class LogisticRegressionModel
A trained logistic regression model
>>> rows = [[4.9,1.4,0], [4.7,1.3,0], [4.6,1.5,0], [6.3,4.9,1],[6.1,4.7,1], [6.4,4.3,1], [6.6,4.4,1],[7.2,6.0,2], [7.2,5.8,2], [7.4,6.1,2], [7.9,6.4,2]]
>>> schema = [('Sepal_Length', float),('Petal_Length', float), ('Class', int)]
>>> frame = tc.frame.create(rows, schema)
[===Job Progress===]
Consider the following frame containing three columns.
>>> frame.inspect()
[#] Sepal_Length Petal_Length Class
======================================
[0] 4.9 1.4 0
[1] 4.7 1.3 0
[2] 4.6 1.5 0
[3] 6.3 4.9 1
[4] 6.1 4.7 1
[5] 6.4 4.3 1
[6] 6.6 4.4 1
[7] 7.2 6.0 2
[8] 7.2 5.8 2
[9] 7.4 6.1 2
>>> model = tc.models.classification.logistic_regression.train(frame, ['Sepal_Length', 'Petal_Length'], 'Class', num_classes=3, optimizer='LBFGS', compute_covariance=True)
[===Job Progress===]
>>> model.training_summary
coefficients degrees_freedom standard_errors intercept_0 -0.780153 1 NaN
Sepal_Length_1 -120.442165 1 28497036.888425
Sepal_Length_0 -63.683819 1 28504715.870243
intercept_1 -90.484405 1 NaN
Petal_Length_0 117.979824 1 36178481.415888
Petal_Length_1 206.339649 1 36172481.900910
wald_statistic p_value
intercept_0 NaN NaN
Sepal_Length_1 -0.000004 1.000000
Sepal_Length_0 -0.000002 1.000000
intercept_1 NaN NaN
Petal_Length_0 0.000003 0.998559
Petal_Length_1 0.000006 0.998094
>>> model.training_summary.covariance_matrix.inspect()
[#] Sepal_Length_0 Petal_Length_0 intercept_0
===============================================================
[0] 8.12518826843e+14 -1050552809704907 5.66008788624e+14
[1] -1.05055305606e+15 1.30888251756e+15 -3.5175956714e+14
[2] 5.66010683868e+14 -3.51761845892e+14 -2.52746479908e+15
[3] 8.12299962335e+14 -1.05039425964e+15 5.66614798332e+14
[4] -1.05027789037e+15 1308665462990595 -352436215869081
[5] 566011198950063 -3.51665950639e+14 -2527929411221601
[#] Sepal_Length_1 Petal_Length_1 intercept_1
===============================================================
[0] 812299962806401 -1.05027764456e+15 5.66009303434e+14
[1] -1.05039450654e+15 1.30866546361e+15 -3.51663671537e+14
[2] 566616693386615 -3.5243849435e+14 -2.5279294114e+15
[3] 8.1208111142e+14 -1050119118230513 5.66615352448e+14
[4] -1.05011936458e+15 1.30844844687e+15 -3.5234036349e+14
[5] 566617247774244 -3.52342642321e+14 -2528394057347494
>>> predict_frame = model.predict(frame, ['Sepal_Length', 'Petal_Length'])
[===Job Progress===]
>>> predict_frame.inspect()
[#] Sepal_Length Petal_Length Class predicted_label
=======================================================
[0] 4.9 1.4 0 0
[1] 4.7 1.3 0 0
[2] 4.6 1.5 0 0
[3] 6.3 4.9 1 1
[4] 6.1 4.7 1 1
[5] 6.4 4.3 1 1
[6] 6.6 4.4 1 1
[7] 7.2 6.0 2 2
[8] 7.2 5.8 2 2
[9] 7.4 6.1 2 2
>>> test_metrics = model.test(frame, 'Class', ['Sepal_Length', 'Petal_Length'])
[===Job Progress===]
>>> test_metrics
accuracy = 1.0
confusion_matrix = Predicted_0.0 Predicted_1.0 Predicted_2.0
Actual_0.0 3 0 0
Actual_1.0 0 4 0
Actual_2.0 0 0 4
f_measure = 1.0
precision = 1.0
recall = 1.0
>>> model.save("sandbox/logistic_regression")
>>> restored = tc.load("sandbox/logistic_regression")
>>> restored.training_summary.num_features == model.training_summary.num_features
True
The trained model can also be exported to a .mar file, to be used with the scoring engine:
>>> canonical_path = model.export_to_mar("sandbox/logisticRegressionModel.mar")
Ancestors (in MRO)
- LogisticRegressionModel
- sparktk.propobj.PropertiesObject
- __builtin__.object
Instance variables
var compute_covariance
Compute covariance matrix for the model.
var convergence_tolerance
Convergence tolerance of iterations for L-BFGS. Smaller value will lead to higher accuracy with the cost of more iterations.
var feature_scaling
Perform feature scaling before training model.
var frequency_column
Optional column containing the frequency of observations.
var intercept
intercept column of training data.
var label_column
Column name containing the label for each observation.
var mini_batch_fraction
Fraction of data to be used for each SGD iteration
var num_classes
Number of classes
var num_corrections
Number of corrections used in LBFGS update. Default is 10. Values of less than 3 are not recommended; large values will result in excessive computing time.
var num_iterations
Maximum number of iterations
var observation_columns
Column(s) containing the observations.
var optimizer
Set type of optimizer. LBFGS - Limited-memory BFGS. LBFGS supports multinomial logistic regression. SGD - Stochastic Gradient Descent. SGD only supports binary logistic regression.
var reg_param
Regularization parameter
var reg_type
Set type of regularization L1 - L1 regularization with sum of absolute values of coefficients L2 - L2 regularization with sum of squares of coefficients
var step_size
Initial step size for SGD. In subsequent steps, the step size decreases by stepSize/sqrt(t)
var threshold
Threshold for separating positive predictions from negative predictions.
var training_summary
Logistic regression summary table
Methods
def __init__(
self, tc, scala_model)
def export_to_mar(
self, path)
Exports the trained model as a model archive (.mar) to the specified path.
path | (str): | Path to save the trained model |
Returns | (str): | Full path to the saved .mar file |
def predict(
self, frame, observation_columns_predict)
Predict labels for data points using trained logistic regression model.
Predict the labels for a test frame using trained logistic regression model, and create a new frame revision with existing columns and a new predicted label's column.
frame | (Frame): | A frame whose labels are to be predicted. By default, predict is run on the same columns over which the model is trained. |
observation_columns_predict | (None or list[str]): | Column(s) containing the observations whose labels are to be predicted. Default is the labels the model was trained on. |
Returns | (Frame): | Frame containing the original frame's columns and a column with the predicted label. |
def save(
self, path)
Save the trained model to path
path | (str): | Path to save |
def test(
self, frame, label_column, observation_columns_test)
Get the predictions for observations in a test frame
frame | (Frame): | Frame whose labels are to be predicted. |
label_column | (str): | Column containing the actual label for each observation. |
observation_columns_test | (None or list[str]): | Column(s) containing the observations whose labels are to be predicted and tested. Default is to test over the columns the SVM model was trained on. |
Returns | (ClassificationMetricsValue): | Object with binary classification metrics |
def to_dict(
self)
def to_json(
self)