sparktk tkcontext
class TkContext
TK Context
The sparktk Python API centers around the TkContext object. This object holds the session's requisite SparkContext object in order to work with Spark. It also provides the entry point to the main APIs.
Instance variables
var agg
Convenient access to the aggregation function enumeration (See the group_by operation on sparktk Frames)
For the given frame, count the groups in column 'b':
>>> frame.inspect()
[#] a b c
=====================
[0] 1 alpha 3.0
[1] 1 bravo 5.0
[2] 1 alpha 5.0
[3] 2 bravo 8.0
[4] 2 charlie 12.0
[5] 2 bravo 7.0
[6] 2 bravo 12.0
>>> b_count = frame.group_by('b', tc.agg.count)
>>> b_count.inspect()
[#] b count
===================
[0] alpha 2
[1] charlie 1
[2] bravo 4
var dicom
Access to create or load the sparktk Dicom objects (See the Dicom API)
>>> d = tc.dicom.import_dcm('path/to/dicom/images/')
>>> type(d)
sparktk.dicom.dicom.Dicom
var examples
Access to some example data structures
Get a small, built-in sparktk Frame object:
>>> cities = tc.examples.frames.get_cities_frame()
>>> cities.inspect(5)
[#] rank city population_2013 population_2010 change county
==========================================================================
[0] 1 Portland 609456 583776 4.40% Multnomah
[1] 2 Salem 160614 154637 3.87% Marion
[2] 3 Eugene 159190 156185 1.92% Lane
[3] 4 Gresham 109397 105594 3.60% Multnomah
[4] 5 Hillsboro 97368 91611 6.28% Washington
var frame
Access to create or load the sparktk Frames (See the Frame API)
>>> frame = tc.frame.create([[1, 3.14, 'blue'], [7, 1.61, 'red'], [4, 2.72, 'yellow']])
>>> frame.inspect()
[#] C0 C1 C2
=====================
[0] 1 3.14 blue
[1] 7 1.61 red
[2] 4 2.72 yellow
>>> frame2 = tc.frame.import_csv("../datasets/basic.csv")
>>> frame2.inspect(5)
[#] C0 C1 C2 C3
================================
[0] 132 75.4 0 correction
[1] 133 77.66 0 fitness
[2] 134 71.22 1 proposal
[3] 201 72.3 1 utilization
[4] 202 80.1 0 commission
var graph
Access to create or load the sparktk Graphs (See the Graph API)
>>> g = tc.graph.load('sandbox/my_saved_graph')
var jutils
Utilities for working with the remote JVM
var models
Access to create or load the various models available in sparktk (See the Models API)
Train an SVM model:
>>> svm_model = tc.models.classification.svm.train(frame, 'label', ['data'])
Train a Random Forest regression model:
>>> rf = tc.models.regression.random_forest_regressor.train(frame,
... 'Class',
... ['Dim_1', 'Dim_2'],
... num_trees=1,
... impurity="variance",
... max_depth=4,
... max_bins=100)
Train a KMeans clustering model:
>>> km = tc.models.clustering.kmeans.train(frame, ["data"], k=3)
var sc
Access to the underlying SparkContext
>>> tc.sc.version
u'1.6.0'
var sql_context
Access to the underlying Spark SQLContext
>>> tc.sql_context.registerDataFrameAsTable(frame.dataframe, "table1")
>>> df2 = tc.sql_context.sql("SELECT field1 AS f1, field2 as f2 from table1")
>>> df2.collect()
[Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
Methods
def __init__(
self, sc=None, master='local[4]', py_files=None, spark_home=None, sparktk_home=None, pyspark_submit_args=None, app_name='sparktk', other_libs=None, extra_conf=None, use_local_fs=False, debug=None)
Creates a TkContext object
sc | (SparkContext): | Active Spark Context, if not provided a new Spark Context is created with the rest of the args (see https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html) |
master | (str): | override spark master setting; for ex. 'local[4]' or 'yarn-client' |
py_files | (list): | list of str of paths to python dependencies; Note the the current python package will be freshly zipped up and put in a tmp folder for shipping by spark, and then removed |
spark_home | (str): | override $SPARK_HOME, the location of spark |
sparktk_home | (str): | override $SPARKTK_HOME, the location of spark-tk |
pyspark_submit_args | (str): | extra args passed to the pyspark submit |
app_name | (str): | name of spark app that will be created |
other_libs | (list): | other libraries (actual python packages or modules) that are compatible with spark-tk, which need to be added to the spark context. These libraries must be developed for use with spark-tk and have particular methods implemented. (See sparkconf.py _validate_other_libs) |
extra_conf | (dict): | dict for any extra spark conf settings, for ex. {"spark.hadoop.fs.default.name": "file:///"} |
use_local_fs | (bool): | simpler way to specify using local file system, rather than hdfs or other |
debug | (int or str): | provide an port address to attach a debugger to the JVM that gets started |
Returns: | TkContext |
Creating a TkContext requires creating or obtaining a SparkContext object. It is usually recommended to have the TkContext create the SparkContext, since it can provide the proper locations to the sparktk specific dependencies (i.e. jars). Otherwise, specifying the classpath and jars arguments is left to the user.
Creating a TkContext using no arguments will cause a SparkContext to be created using default settings:
>>> import sparktk
>>> tc = sparktk.TkContext()
>>> print tc.sc._conf.toDebugString()
spark.app.name=sparktk
spark.driver.extraClassPath=/opt/lib/spark/lib/*:/opt/spark-tk/sparktk-core/*
spark.driver.extraLibraryPath=/opt/lib/hadoop/lib/native:/opt/lib/spark/lib:/opt/lib/hadoop/lib/native
spark.jars=file:/opt/lib/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar,file:/opt/lib/spark/lib/spark-assembly.jar,file:/opt/lib/spark/lib/spark-examples.jar,file:/opt/lib/spark-tk/sparktk-core/sparktk-core-1.0-SNAPSHOT.jar,file:/opt/lib/spark-tk/sparktk-core/dependencies/spark-mllib_2.10-1.6.0.jar, ...
spark.master=local[4]
spark.yarn.jar=local:/opt/lib/spark/lib/spark-assembly.jar
Another case with arguments to control some Spark Context settings:
>>> import sparktk
>>> tc = sparktk.TkContext(master='yarn-client',
... py_files='mylib.py',
... pyspark_submit_args='--jars /usr/lib/custom/extra.jar' \
... '--driver-class-path /usr/lib/custom/*' \
... '--executor-memory 6g',
... extra_conf={'spark.files.overwrite': 'true'},
... app_name='myapp'
>>> print tc.sc._conf.toDebugString()
spark.app.name=myapp
spark.driver.extraClassPath=/usr/lib/custom/*:/opt/lib/spark/lib/*:/opt/spark-tk/sparktk-core/*
spark.driver.extraLibraryPath=/opt/lib/hadoop/lib/native:/opt/lib/spark/lib:/opt/lib/hadoop/lib/native
spark.executor.memory=6g
spark.files.overwrite=true
spark.jars=file:/usr/local/custom/extra.jar,file:/opt/lib/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar,file:/opt/lib/spark/lib/spark-assembly.jar,file:/opt/lib/spark/lib/spark-examples.jar,file:/opt/lib/spark-tk/sparktk-core/sparktk-core-1.0-SNAPSHOT.jar,file:/opt/lib/spark-tk/sparktk-core/dependencies/spark-mllib_2.10-1.6.0.jar, ...
spark.master=yarn-client
spark.yarn.isPython=true
spark.yarn.jar=local:/opt/lib/spark/lib/spark-assembly.jar
def load(
self, path, validate_type=None)
Loads object from the given path
path | (str): | location of the object to load |
validate_type | (type): | if provided, a RuntimeError is raised if the loaded obj is not of that type |
Returns | (object): | the loaded object |
>>> f = tc.load("/home/user/sandbox/superframe")
>>> type(f)
sparktk.frame.frame.Frame