Up

sparktk tkcontext

class TkContext

TK Context

The sparktk Python API centers around the TkContext object. This object holds the session's requisite SparkContext object in order to work with Spark. It also provides the entry point to the main APIs.

Instance variables

var agg

Convenient access to the aggregation function enumeration (See the group_by operation on sparktk Frames)

Example:

For the given frame, count the groups in column 'b':

>>> frame.inspect()
[#]  a  b        c
=====================
[0]  1  alpha     3.0
[1]  1  bravo     5.0
[2]  1  alpha     5.0
[3]  2  bravo     8.0
[4]  2  charlie  12.0
[5]  2  bravo     7.0
[6]  2  bravo    12.0

>>> b_count = frame.group_by('b', tc.agg.count)

>>> b_count.inspect()
[#]  b        count
===================
[0]  alpha        2
[1]  charlie      1
[2]  bravo        4

var dicom

Access to create or load the sparktk Dicom objects (See the Dicom API)

Example:
>>> d = tc.dicom.import_dcm('path/to/dicom/images/')

>>> type(d)
sparktk.dicom.dicom.Dicom

var examples

Access to some example data structures

Example:

Get a small, built-in sparktk Frame object:

>>> cities = tc.examples.frames.get_cities_frame()

>>> cities.inspect(5)
[#]  rank  city       population_2013  population_2010  change  county
==========================================================================
[0]  1     Portland   609456           583776           4.40%   Multnomah
[1]  2     Salem      160614           154637           3.87%   Marion
[2]  3     Eugene     159190           156185           1.92%   Lane
[3]  4     Gresham    109397           105594           3.60%   Multnomah
[4]  5     Hillsboro  97368            91611            6.28%   Washington

var frame

Access to create or load the sparktk Frames (See the Frame API)

Example:
>>> frame = tc.frame.create([[1, 3.14, 'blue'], [7, 1.61, 'red'], [4, 2.72, 'yellow']])

>>> frame.inspect()
[#] C0  C1    C2
=====================
[0]  1  3.14  blue
[1]  7  1.61  red
[2]  4  2.72  yellow


>>> frame2 = tc.frame.import_csv("../datasets/basic.csv")

>>> frame2.inspect(5)
[#]  C0   C1     C2  C3
================================
[0]  132  75.4    0  correction
[1]  133  77.66   0  fitness
[2]  134  71.22   1  proposal
[3]  201   72.3   1  utilization
[4]  202   80.1   0  commission

var graph

Access to create or load the sparktk Graphs (See the Graph API)

Example:
>>> g = tc.graph.load('sandbox/my_saved_graph')

var jutils

Utilities for working with the remote JVM

var models

Access to create or load the various models available in sparktk (See the Models API)

Examples:

Train an SVM model:

>>> svm_model = tc.models.classification.svm.train(frame, 'label', ['data'])

Train a Random Forest regression model:

>>> rf = tc.models.regression.random_forest_regressor.train(frame,
...                                                         'Class',
...                                                         ['Dim_1', 'Dim_2'],
...                                                         num_trees=1,
...                                                         impurity="variance",
...                                                         max_depth=4,
...                                                         max_bins=100)

Train a KMeans clustering model:

>>> km = tc.models.clustering.kmeans.train(frame, ["data"], k=3)

var sc

Access to the underlying SparkContext

Example:
>>> tc.sc.version
u'1.6.0'

var sql_context

Access to the underlying Spark SQLContext

Example:
>>> tc.sql_context.registerDataFrameAsTable(frame.dataframe, "table1")
>>> df2 = tc.sql_context.sql("SELECT field1 AS f1, field2 as f2 from table1")
>>> df2.collect()
[Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]

Methods

def __init__(

self, sc=None, master='local[4]', py_files=None, spark_home=None, sparktk_home=None, pyspark_submit_args=None, app_name='sparktk', other_libs=None, extra_conf=None, use_local_fs=False, debug=None)

Creates a TkContext object

sc(SparkContext):Active Spark Context, if not provided a new Spark Context is created with the rest of the args (see https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html)
master(str):override spark master setting; for ex. 'local[4]' or 'yarn-client'
py_files(list):list of str of paths to python dependencies; Note the the current python package will be freshly zipped up and put in a tmp folder for shipping by spark, and then removed
spark_home(str):override $SPARK_HOME, the location of spark
sparktk_home(str):override $SPARKTK_HOME, the location of spark-tk
pyspark_submit_args(str):extra args passed to the pyspark submit
app_name(str):name of spark app that will be created
other_libs(list):other libraries (actual python packages or modules) that are compatible with spark-tk, which need to be added to the spark context. These libraries must be developed for use with spark-tk and have particular methods implemented. (See sparkconf.py _validate_other_libs)
extra_conf(dict):dict for any extra spark conf settings, for ex. {"spark.hadoop.fs.default.name": "file:///"}
use_local_fs(bool):simpler way to specify using local file system, rather than hdfs or other
debug(int or str):provide an port address to attach a debugger to the JVM that gets started

Returns: TkContext

Creating a TkContext requires creating or obtaining a SparkContext object. It is usually recommended to have the TkContext create the SparkContext, since it can provide the proper locations to the sparktk specific dependencies (i.e. jars). Otherwise, specifying the classpath and jars arguments is left to the user.

Examples:

Creating a TkContext using no arguments will cause a SparkContext to be created using default settings:

>>> import sparktk

>>> tc = sparktk.TkContext()

>>> print tc.sc._conf.toDebugString()
spark.app.name=sparktk
spark.driver.extraClassPath=/opt/lib/spark/lib/*:/opt/spark-tk/sparktk-core/*
spark.driver.extraLibraryPath=/opt/lib/hadoop/lib/native:/opt/lib/spark/lib:/opt/lib/hadoop/lib/native
spark.jars=file:/opt/lib/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar,file:/opt/lib/spark/lib/spark-assembly.jar,file:/opt/lib/spark/lib/spark-examples.jar,file:/opt/lib/spark-tk/sparktk-core/sparktk-core-1.0-SNAPSHOT.jar,file:/opt/lib/spark-tk/sparktk-core/dependencies/spark-mllib_2.10-1.6.0.jar, ...
spark.master=local[4]
spark.yarn.jar=local:/opt/lib/spark/lib/spark-assembly.jar

Another case with arguments to control some Spark Context settings:

>>> import sparktk

>>> tc = sparktk.TkContext(master='yarn-client',
...                        py_files='mylib.py',
...                        pyspark_submit_args='--jars /usr/lib/custom/extra.jar' \
...                                            '--driver-class-path /usr/lib/custom/*' \
...                                            '--executor-memory 6g',
...                        extra_conf={'spark.files.overwrite': 'true'},
...                        app_name='myapp'

>>> print tc.sc._conf.toDebugString()
spark.app.name=myapp
spark.driver.extraClassPath=/usr/lib/custom/*:/opt/lib/spark/lib/*:/opt/spark-tk/sparktk-core/*
spark.driver.extraLibraryPath=/opt/lib/hadoop/lib/native:/opt/lib/spark/lib:/opt/lib/hadoop/lib/native
spark.executor.memory=6g
spark.files.overwrite=true
spark.jars=file:/usr/local/custom/extra.jar,file:/opt/lib/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar,file:/opt/lib/spark/lib/spark-assembly.jar,file:/opt/lib/spark/lib/spark-examples.jar,file:/opt/lib/spark-tk/sparktk-core/sparktk-core-1.0-SNAPSHOT.jar,file:/opt/lib/spark-tk/sparktk-core/dependencies/spark-mllib_2.10-1.6.0.jar, ...
spark.master=yarn-client
spark.yarn.isPython=true
spark.yarn.jar=local:/opt/lib/spark/lib/spark-assembly.jar

def load(

self, path, validate_type=None)

Loads object from the given path

Parameters:
path(str):location of the object to load
validate_type(type):if provided, a RuntimeError is raised if the loaded obj is not of that type

Returns(object): the loaded object

Example:
>>> f = tc.load("/home/user/sandbox/superframe")

>>> type(f)
sparktk.frame.frame.Frame