sparktk random_forest_classifier
Functions
def load(
path, tc=<class 'sparktk.arguments.implicit'>)
load RandomForestClassifierModel from given path
def train(
frame, label_column, observation_columns, num_classes=2, num_trees=1, impurity='gini', max_depth=4, max_bins=100, seed=None, categorical_features_info=None, feature_subset_category=None)
Creates a Random Forest Classifier Model by training on the given frame
frame | (Frame): | frame frame of training data |
label_column | (str): | Column name containing the label for each observation |
observation_columns | (list(str)): | Column(s) containing the observations |
num_classes | (int): | Number of classes for classification. Default is 2 |
num_trees | (int): | Number of tress in the random forest. Default is 1 |
impurity | (str): | Criterion used for information gain calculation. Supported values "gini" or "entropy". Default is "gini" |
max_depth | (int): | Maximum depth of the tree. Default is 4 |
max_bins | (int): | Maximum number of bins used for splitting features. Default is 100 |
seed | (Optional(int)): | Random seed for bootstrapping and choosing feature subsets. Default is a randomly chosen seed |
categorical_features_info | (Optional(Dict(Int:Int))): | Arity of categorical features. Entry (n-> k) indicates that feature 'n' is categorical with 'k' categories indexed from 0:{0,1,...,k-1} |
feature_subset_category | (Optional(str)): | Number of features to consider for splits at each node. Supported values "auto","all","sqrt","log2","onethird". If "auto" is set, this is based on num_trees: if num_trees == 1, set to "all" ; if num_trees > 1, set to "sqrt" |
Returns | (RandomForestClassifierModel): | The trained random forest classifier model |
Random Forest is a supervised ensemble learning algorithm which can be used to perform binary and multi-class classification. The Random Forest Classifier model is initialized, trained on columns of a frame, used to predict the labels of observations in a frame, and tests the predicted labels against the true labels. This model runs the MLLib implementation of Random Forest. During training, the decision trees are trained in parallel. During prediction, each tree's prediction is counted as vote for one class. The label is predicted to be the class which receives the most votes. During testing, labels of the observations are predicted and tested against the true labels using built-in binary and multi-class Classification Metrics.
Classes
class RandomForestClassifierModel
A trained Random Forest Classifier model
>>> frame = tc.frame.create([[1,19.8446136104,2.2985856384],[1,16.8973559126,2.6933495054],
... [1,5.5548729596,2.7777687995],[0,46.1810010826,3.1611961917],
... [0,44.3117586448,3.3458963222],[0,34.6334526911,3.6429838715]],
... [('Class', int), ('Dim_1', float), ('Dim_2', float)])
>>> frame.inspect()
[#] Class Dim_1 Dim_2
=======================================
[0] 1 19.8446136104 2.2985856384
[1] 1 16.8973559126 2.6933495054
[2] 1 5.5548729596 2.7777687995
[3] 0 46.1810010826 3.1611961917
[4] 0 44.3117586448 3.3458963222
[5] 0 34.6334526911 3.6429838715
>>> model = tc.models.classification.random_forest_classifier.train(frame, 'Class', ['Dim_1', 'Dim_2'], num_classes=2, num_trees=1, impurity="entropy", max_depth=4, max_bins=100)
>>> predicted_frame = model.predict(frame, ['Dim_1', 'Dim_2'])
>>> predicted_frame.inspect()
[#] Class Dim_1 Dim_2 predicted_class
========================================================
[0] 1 19.8446136104 2.2985856384 1
[1] 1 16.8973559126 2.6933495054 1
[2] 1 5.5548729596 2.7777687995 1
[3] 0 46.1810010826 3.1611961917 0
[4] 0 44.3117586448 3.3458963222 0
[5] 0 34.6334526911 3.6429838715 0
>>> test_metrics = model.test(frame, ['Dim_1','Dim_2'])
>>> test_metrics
accuracy = 1.0
confusion_matrix = Predicted_Pos Predicted_Neg
Actual_Pos 3 0
Actual_Neg 0 3
f_measure = 1.0
precision = 1.0
recall = 1.0
>>> model.save("sandbox/randomforestclassifier")
>>> restored = tc.load("sandbox/randomforestclassifier")
>>> restored.label_column == model.label_column
True
>>> restored.seed == model.seed
True
>>> set(restored.observation_columns) == set(model.observation_columns)
True
The trained model can also be exported to a .mar file, to be used with the scoring engine:
>>> canonical_path = model.export_to_mar("sandbox/rfClassifier.mar")
Ancestors (in MRO)
- RandomForestClassifierModel
- sparktk.propobj.PropertiesObject
- __builtin__.object
Instance variables
var categorical_features_info
categorical feature dictionary used during model training
var feature_subset_category
feature subset category of the trained model
var impurity
impurity value of the trained model
var label_column
column containing the label used for model training
var max_bins
maximum bins in the trained model
var max_depth
maximum depth of the trained model
var num_classes
number of classes in the trained model
var num_trees
number of trees in the trained model
var observation_columns
observation columns used for model training
var seed
seed used during training of the model
Methods
def __init__(
self, tc, scala_model)
def export_to_mar(
self, path)
Exports the trained model as a model archive (.mar) to the specified path.
path | (str): | Path to save the trained model |
Returns | (str): | Full path to the saved .mar file |
def predict(
self, frame, columns=None)
Predict the labels for a test frame using trained Random Forest Classifier model, and create a new frame revision with existing columns and a new predicted label's column.
frame | (Frame): | A frame whose labels are to be predicted. By default, predict is run on the same columns over which the model is trained. |
columns | (Optional(list[str])): | Column(s) containing the observations whose labels are to be predicted. By default, we predict the labels over columns the RandomForestModel was trained on. |
Returns | (Frame): | A new frame consisting of the existing columns of the frame and a new column with predicted label for each observation. |
def save(
self, path)
Save the trained model to path
path | (str): | Path to save |
def test(
self, frame, columns=None)
Predict test frame labels and return metrics.
frame | (Frame): | The frame whose labels are to be predicted |
columns | (Optional(list[str])): | Column(s) containing the observations whose labels are to be predicted. By default, we predict the labels over columns the RandomForest was trained on. |
Returns | (ClassificationMetricsValue): | Binary classification metrics comprised of: accuracy (double) The proportion of predictions that are correctly identified confusion_matrix (dictionary) A table used to describe the performance of a classification model f_measure (double) The harmonic mean of precision and recall precision (double) The proportion of predicted positive instances that are correctly identified recall (double) The proportion of positive instances that are correctly identified. |
def to_dict(
self)
def to_json(
self)