sparktk frame
Functions
def create(
data, schema=None, validate_schema=False, tc=<class 'sparktk.arguments.implicit'>)
Creates a frame from the given data and schema. If no schema data types are provided, the schema is inferred based on the data in the first 100 rows.
If schema validation is enabled, all data is is checked to ensure that it matches the schema. If the data does not match the schema's data type, it attempts to cast the data to the proper data type. When the data is unable to be casted to the schema's data type, the item will be missing (None) in the frame.
data | (List of row data or RDD): | Data source |
schema | (Optional(list[tuple(str, type)] or list[str])] Optionally specify a schema (list of tuples of string column names and data type), column names (list of strings, and the column data types will be inferred): | or None (column data types will be inferred and column names will be numbered like C0, C1, C2, etc). |
validate_schema | (Optional(bool)): | When True, all data is checked to ensure that it matches the schema. If the data does not match the schema's data type, it attempts to cast the data to the proper data type. When the data is unable to be casted to the schema's data type, a missing value (None) is inserted in it's place. Defaults to False. |
tc: | TkContext |
Returns | (Frame): | Frame loaded with the specified data |
Create a frame with the specified data.
>>> data = [["Bob", 30, 8], ["Jim", 45, 9.5], ["Sue", 25, 7], ["George", 15, 6], ["Jennifer", 18, 8.5]]
>>> frame = tc.frame.create(data)
Since no schema is provided, the schema will be inferred. Note that the data set had a mix of strings and integers in the third column. The schema will use the most general data type from the data that it sees, so in this example, the column is treated as a float.
>>> frame.schema
[('C0', <type 'str'>), ('C1', <type 'int'>), ('C2', <type 'float'>)]
>>> frame.inspect()
[#] C0 C1 C2
======================
[0] Bob 30 8
[1] Jim 45 9.5
[2] Sue 25 7
[3] George 15 6
[4] Jennifer 18 8.5
We could also enable schema validation, which checks the data against the schema. If the data does not match the schema's data type, it attempts to cast the data to the proper data type.
>>> frame = tc.frame.create(data, validate_schema=True)
In this example with schema validation enabled, the integers in column C2 get casted to floats:
>>> frame.inspect()
[#] C0 C1 C2
======================
[0] Bob 30 8.0
[1] Jim 45 9.5
[2] Sue 25 7.0
[3] George 15 6.0
[4] Jennifer 18 8.5
We could also provide a list of column names when creating the frame. When a list of column names is provided, the data types for the schema are still inferred, but the columns in the schema are labeled with the specified names.
>>> frame = tc.frame.create(data, schema=["name", "age", "shoe_size"], validate_schema=True)
>>> frame.schema
[('name', <type 'str'>), ('age', <type 'int'>), ('shoe_size', <type 'float'>)]
>>> frame.inspect()
[#] name age shoe_size
=============================
[0] Bob 30 8.0
[1] Jim 45 9.5
[2] Sue 25 7.0
[3] George 15 6.0
[4] Jennifer 18 8.5
Note that if a value cannot be parsed as the specified data type in the schema, it will show up as missing (None), if validate_schema is enabled. For example, consider the following frame where columns are defined as integers, but the data specified has a string in the second row.
>>> data = [[1, 2, 3], [4, "five", 6]]
>>> schema = [("a", int), ("b", int), ("c", int)]
>>> frame = tc.frame.create(data, schema, validate_schema = True)
>>> frame.inspect()
[#] a b c
===============
[0] 1 2 3
[1] 4 None 6
Note that the spot where the string was located, has it's value missing (None) since it couldn't be parsed to an integer. If validate_schema was disabled, no attempt is made to parse the data to the data type specified by the schema, and further frame operations may fail due to the data type discrepancy.
def import_csv(
path, delimiter=',', header=False, infer_schema=True, schema=None, datetime_format="yyyy-MM-dd'T'HH:mm:ss.SSSX", tc=<class 'sparktk.arguments.implicit'>)
Creates a frame with data from a csv file.
path | (str): | Full path to the csv file |
delimiter | (Optional[str]): | A string which indicates the separation of data fields. This is usually a single character and could be a non-visible character, such as a tab. The default delimiter is a comma (,). |
header | (Optional[bool]): | Boolean value indicating if the first line of the file will be used to name columns, and not be included in the data. The default value is false. |
:param infer_schema:(Optional[bool]) Boolean value indicating if the column types will be automatically inferred. It requires one extra pass over the data and is false by default.
schema | (Optional[List[tuple(str, type)]]): | Optionally specify the schema for the dataset. Number of columns specified in the schema must match the number of columns in the csv file provided. If the value from the csv file cannot be converted to the data type specified by the schema (for example, if the csv file has a string, and the schema specifies an int), the value will show up as missing (None) in the frame. |
datetime_format | (str): | String specifying how date/time columns are formatted, using the java.text.SimpleDateFormat specified at https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html |
Returns | (Frame): | Frame that contains the data from the csv file |
Load a frame from a csv file by specifying the path to the file, delimiter, and options that specify that there is a header and to infer the schema based on the data.
>>> file_path = "../datasets/cities.csv"
>>> frame = tc.frame.import_csv(file_path, "|", header=True, infer_schema=True)
-etc-
>>> frame.inspect()
[#] rank city population_2013 population_2010 change county
============================================================================
[0] 1 Portland 609456 583776 4.40% Multnomah
[1] 2 Salem 160614 154637 3.87% Marion
[2] 3 Eugene 159190 156185 1.92% Lane
[3] 4 Gresham 109397 105594 3.60% Multnomah
[4] 5 Hillsboro 97368 91611 6.28% Washington
[5] 6 Beaverton 93542 89803 4.16% Washington
[6] 15 Grants Pass 35076 34533 1.57% Josephine
[7] 16 Oregon City 34622 31859 8.67% Clackamas
[8] 17 McMinnville 33131 32187 2.93% Yamhill
[9] 18 Redmond 27427 26215 4.62% Deschutes
>>> frame.schema
[('rank', <type 'int'>), ('city', <type 'str'>), ('population_2013', <type 'int'>), ('population_2010', <type 'int'>), ('change', <type 'str'>), ('county', <type 'str'>)]
def import_hbase(
table_name, schema, start_tag=None, end_tag=None, tc=<class 'sparktk.arguments.implicit'>)
Import data from hbase table into frame
table_name | (str): | hbase table name |
schema | (list[list[str, str, type]]): | hbase schema as a List of List(string) (columnFamily, columnName, dataType for cell value) |
start_tag | (Optional(str)): | optional start tag for filtering |
end_tag | (Optional(str)): | optional end tag for filtering |
Returns | (Frame): | frame with data from hbase table |
Load data into frame from a hbase table
>>> frame = tc.frame.import_hbase("demo_test_hbase", [["test_family", "a", int],["test_family", "b", float], ["test_family", "c", int],["test_family", "d", int]])
-etc-
>>> frame.inspect()
[#] test_family_a test_family_b test_family_c test_family_d
===============================================================
[0] 1 0.2 -2 5
[1] 2 0.4 -1 6
[2] 3 0.6 0 7
[3] 4 0.8 1 8
Use of start_tag and end_tag. (Hbase creates a unique row id for data in hbase tables)
start_tag: It is the unique row id from where row scan should start
end_tag: It is the unique row id where row scan should end
Assuming you already have data on hbase table "test_startendtag" under "startendtag" family name with single column named "number".
data: column contains values from 1 to 99. Here rowid is generated by hbase.
Sample hbase data. Few rows from hbase table looks as below.
hbase(main):002:0> scan "test_startendtag"
ROW COLUMN+CELL
0 column=startendtag:number, timestamp=1465342524846, value=1
1 column=startendtag:number, timestamp=1465342524846, value=25
10 column=startendtag:number, timestamp=1465342524847, value=51
103 column=startendtag:number, timestamp=1465342524851, value=98
107 column=startendtag:number, timestamp=1465342524851, value=99
11 column=startendtag:number, timestamp=1465342524851, value=75
12 column=startendtag:number, timestamp=1465342524846, value=4
13 column=startendtag:number, timestamp=1465342524846, value=28
14 column=startendtag:number, timestamp=1465342524847, value=52
15 column=startendtag:number, timestamp=1465342524851, value=76
16 column=startendtag:number, timestamp=1465342524846, value=5
17 column=startendtag:number, timestamp=1465342524846, value=29
18 column=startendtag:number, timestamp=1465342524847, value=53
19 column=startendtag:number, timestamp=1465342524851, value=77
2 column=startendtag:number, timestamp=1465342524847, value=49
20 column=startendtag:number, timestamp=1465342524846, value=6
21 column=startendtag:number, timestamp=1465342524846, value=30
>>> frame = tc.frame.import_hbase("test_startendtag", [["startendtag", "number", int]], start_tag="20", end_tag="50")
-etc-
>>> frame.count()
33
>>> frame.inspect(33)
[##] startendtag_number
========================
[0] 6
[1] 30
[2] 54
[3] 78
[4] 7
[5] 31
[6] 55
[7] 79
[8] 8
[9] 32
[10] 73
[11] 56
[12] 80
[13] 9
[14] 33
[15] 57
[16] 81
[17] 10
[18] 34
[19] 58
[##] startendtag_number
========================
[20] 82
[21] 2
[22] 11
[23] 35
[24] 59
[25] 83
[26] 12
[27] 36
[28] 60
[29] 84
[30] 13
[31] 37
[32] 26
def import_hive(
hive_query, tc=<class 'sparktk.arguments.implicit'>)
Import data from hive table into frame.
Define the sql query to retrieve the data from a hive table.
Only a subset of Hive data types are supported:
DataType Support
---------- ------------------------------------
boolean cast to int
bigint native support
int native support
tinyint cast to int
smallint cast to int
decimal cast to double, may lose precision
double native support
float native support
date cast to string
string native support
timestamp cast to string
varchar cast to string
arrays not supported
binary not supported
char not supported
maps not supported
structs not supported
union not supported
hive_query | (str): | hive query to fetch data from table |
tc | (TkContext): | TK context |
Returns | (Frame): | returns frame with hive table data |
Load data into frame from a hive table based on hive query
>>> h_query = "select * from demo_test"
>>> frame = tc.frame.import_hive(h_query)
-etc-
>>> frame.inspect()
[#] number strformat
======================
[0] 1 one
[1] 2 two
[2] 3 three
[3] 4 four
def import_jdbc(
connection_url, table_name, tc=<class 'sparktk.arguments.implicit'>)
Import data from jdbc table into frame.
connection_url | (str): | JDBC connection url to database server |
table_name | (str): | JDBC table name |
Returns | (Frame): | returns frame with jdbc table data |
Load a frame from a jdbc table specifying the connection url to the database server.
>>> url = "jdbc:postgresql://localhost/postgres"
>>> tb_name = "demo_test"
>>> frame = tc.frame.import_jdbc(url, tb_name)
-etc-
>>> frame.inspect()
[#] a b c d
==================
[0] 1 0.2 -2 5
[1] 2 0.4 -1 6
[2] 3 0.6 0 7
[3] 4 0.8 1 8
>>> frame.schema
[(u'a', int), (u'b', float), (u'c', int), (u'd', int)]
def import_pandas(
pandas_frame, schema=None, row_index=True, validate_schema=False, tc=<class 'sparktk.arguments.implicit'>)
Imports data from the specified pandas data frame.
pandas_frame | (pandas.DataFrame): | pandas dataframe object |
schema | (Optional(list[tuples(string, type)])): | Schema description of the fields for a given line. It is a list of tuples which describe each field, (field name, field type), where the field name is a string, and file is a supported type. If no schema is provided, the schema will be inferred based on the column names and types from the pandas_frame. |
row_index | (Optional(bool)): | Indicates if the row_index is present in the pandas dataframe and needs to be ignored when looking at the data values. Default value is True. |
validate_schema | (Optional(bool)): | If true, validates the data against the schema and attempts to cast the data to the specified type, if it does not match the schema. Defaults to False. |
Returns | (Frame): | spark-tk frame that contains data from the pandas_frame |
Create a pandas data frame:
>>> import pandas
>>> ratings_data = [[0, "invalid"], [1, "Very Poor"], [2, "Poor"], [3, "Average"], [4, "Good"], [5, "Very Good"]]
>>> df = pandas.DataFrame(ratings_data, columns=['rating_id', 'rating_text'])
>>> df
rating_id rating_text
0 0 invalid
1 1 Very Poor
2 2 Poor
3 3 Average
4 4 Good
5 5 Very Good
>>> df.columns.tolist()
['rating_id', 'rating_text']
>>> df.dtypes
rating_id int64
rating_text object
dtype: object
When using import_pandas by just passing the pandas data frame, it will use the column names and types from the pandas data frame to generate the schema.
>>> frame = tc.frame.import_pandas(df)
>>> frame.inspect()
[#] rating_id rating_text
===========================
[0] 0 invalid
[1] 1 Very Poor
[2] 2 Poor
[3] 3 Average
[4] 4 Good
[5] 5 Very Good
>>> frame.schema
[('rating_id', <type 'long'>), ('rating_text', <type 'str'>)]
Alternatively, you can specify a schema when importing the pandas data frame. There is also the option to validate the data against the schema. If this option is enabled, we will attempt to cast the data to the column's data type, if it does not match the schema.
For example, here we will specify a schema where the rating_id column will instead be called 'rating_float' and it's data type will be a float. We will also enable the validate_schema option so that the rating_id value will get casted to a float: >>> schema = [("rating_float", float), ("rating_str", unicode)] >>> frame = tc.frame.import_pandas(df, schema, validate_schema=True)
>>> frame.inspect()
[#] rating_float rating_str
=============================
[0] 0.0 invalid
[1] 1.0 Very Poor
[2] 2.0 Poor
[3] 3.0 Average
[4] 4.0 Good
[5] 5.0 Very Good
>>> frame.schema
[('rating_float', <type 'float'>), ('rating_str', <type 'unicode'>)]
def load(
path, tc=<class 'sparktk.arguments.implicit'>)
load Frame from given path
Classes
class Frame
Ancestors (in MRO)
- Frame
- __builtin__.object
Instance variables
var column_names
Column identifications in the current frame.
Returns: | list of names of all the frame's columns |
Returns the names of the columns of the current frame.
>>> frame.column_names
[u'name', u'age', u'tenure', u'phone']
var dataframe
pyspark DataFrame (causes conversion through Scala)
var rdd
pyspark RDD (causes conversion if currently backed by a Scala RDD)
var schema
Methods
def __init__(
self, tc, source, schema=None, validate_schema=False)
(Private constructor -- use tc.frame.create or other methods available from the TkContext)
def add_columns(
self, func, schema)
Add columns to current frame.
Assigns data to column based on evaluating a function for each row.
- The row |UDF| ('func') must return a value in the same format as specified by the schema.
func | (UDF): | Function which takes the values in the row and produces a value, or collection of values, for the new cell(s). |
schema | (List[(str,type)]): | Schema for the column(s) being added. |
Given our frame, let's add a column which has how many years the person has been over 18
>>> frame = tc.frame.create([['Fred',39,16,'555-1234'],
... ['Susan',33,3,'555-0202'],
... ['Thurston',65,26,'555-4510'],
... ['Judy',44,14,'555-2183']],
... schema=[('name', str), ('age', int), ('tenure', int), ('phone', str)])
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Susan 33 3 555-0202
[2] Thurston 65 26 555-4510
[3] Judy 44 14 555-2183
>>> frame.add_columns(lambda row: row.age - 18, ('adult_years', int))
>>> frame.inspect()
[#] name age tenure phone adult_years
=================================================
[0] Fred 39 16 555-1234 21
[1] Susan 33 3 555-0202 15
[2] Thurston 65 26 555-4510 47
[3] Judy 44 14 555-2183 26
Multiple columns can be added at the same time. Let's add percentage of life and percentage of adult life in one call, which is more efficient.
>>> frame.add_columns(lambda row: [row.tenure / float(row.age), row.tenure / float(row.adult_years)],
... [("of_age", float), ("of_adult", float)])
>>> frame.inspect(round=2)
[#] name age tenure phone adult_years of_age of_adult
===================================================================
[0] Fred 39 16 555-1234 21 0.41 0.76
[1] Susan 33 3 555-0202 15 0.09 0.20
[2] Thurston 65 26 555-4510 47 0.40 0.55
[3] Judy 44 14 555-2183 26 0.32 0.54
Note that the function returns a list, and therefore the schema also needs to be a list.
It is not necessary to use lambda syntax, any function will do, as long as it takes a single row argument. We can also call other local functions within.
Let's add a column which shows the amount of person's name based on their adult tenure percentage.
>>> def percentage_of_string(string, percentage):
... '''returns a substring of the given string according to the given percentage'''
... substring_len = int(percentage * len(string))
... return string[:substring_len]
>>> def add_name_by_adult_tenure(row):
... return percentage_of_string(row.name, row.of_adult)
>>> frame.add_columns(add_name_by_adult_tenure, ('tenured_name', unicode))
>>> frame.inspect(columns=['name', 'of_adult', 'tenured_name'], round=2)
[#] name of_adult tenured_name
=====================================
[0] Fred 0.76 Fre
[1] Susan 0.20 S
[2] Thurston 0.55 Thur
[3] Judy 0.54 Ju
Let's add a name based on tenure percentage of age.
>>> frame.add_columns(lambda row: percentage_of_string(row.name, row.of_age),
... ('tenured_name_age', unicode))
>>> frame.inspect(round=2)
[#] name age tenure phone adult_years of_age of_adult
===================================================================
[0] Fred 39 16 555-1234 21 0.41 0.76
[1] Susan 33 3 555-0202 15 0.09 0.20
[2] Thurston 65 26 555-4510 47 0.40 0.55
[3] Judy 44 14 555-2183 26 0.32 0.54
<BLANKLINE>
[#] tenured_name tenured_name_age
===================================
[0] Fre F
[1] S
[2] Thur Thu
[3] Ju J
def append(
self, frame)
Adds more data to the current frame.
frame | (Frame): | Frame of data to append to the current frame. |
In this example, we start off by creating a frame of animals.
>>> animals = tc.frame.create([['dog', 'snoopy'],['cat', 'tom'],['bear', 'yogi'],['mouse', 'jerry']],
... [('animal', str), ('name', str)])
[===Job Progress===]
>>> animals.inspect()
[#] animal name
===================
[0] dog snoopy
[1] cat tom
[2] bear yogi
[3] mouse jerry
Then, we append a frame that will add a few more animals to the original frame.
>>> animals.append(tc.frame.create([['donkey'],['elephant'], ['ostrich']], [('animal', str)]))
[===Job Progress===]
>>> animals.inspect()
[#] animal name
=====================
[0] dog snoopy
[1] cat tom
[2] bear yogi
[3] mouse jerry
[4] donkey None
[5] elephant None
[6] ostrich None
The data we added didn't have names, so None values were inserted for the new rows.
def assign_sample(
self, sample_percentages, sample_labels=None, output_column=None, seed=None)
Randomly group rows into user-defined classes.
sample_percentages | (List[float]): | Entries are non-negative and sum to 1. (See the note below.) If the *i*'th entry of the list is *p*, then then each row receives label *i* with independent probability *p*. |
sample_labels | (Optional[List[str]]): | Names to be used for the split classes. Defaults to 'TR', 'TE', 'VA' when the length of *sample_percentages* is 3, and defaults to Sample_0, Sample_1, ... otherwise. |
output_column | (str): | Name of the new column which holds the labels generated by the function |
seed | (int): | Random seed used to generate the labels. Defaults to 0. |
Randomly assign classes to rows given a vector of percentages.
The table receives an additional column that contains a random label.
The random label is generated by a probability distribution function.
The distribution function is specified by the sample_percentages, a list of
floating point values, which add up to 1.
The labels are non-negative integers drawn from the range
:math:[ 0, len(S) - 1]
where :math:S
is the sample_percentages.
The sample percentages provided by the user are preserved to at least eight decimal places, but beyond this there may be small changes due to floating point imprecision.
In particular:
- The engine validates that the sum of probabilities sums to 1.0 within eight decimal places and returns an error if the sum falls outside of this range.
- The probability of the final class is clamped so that each row receives a valid label with probability one.
Consider this simple frame.
>>> frame.inspect()
[#] blip id
=============
[0] abc 0
[1] def 1
[2] ghi 2
[3] jkl 3
[4] mno 4
[5] pqr 5
[6] stu 6
[7] vwx 7
[8] yza 8
[9] bcd 9
We'll assign labels to each row according to a rough 40-30-30 split, for "train", "test", and "validate".
>>> frame.assign_sample([0.4, 0.3, 0.3])
[===Job Progress===]
>>> frame.inspect()
[#] blip id sample_bin
=========================
[0] abc 0 VA
[1] def 1 TR
[2] ghi 2 TE
[3] jkl 3 TE
[4] mno 4 TE
[5] pqr 5 TR
[6] stu 6 TR
[7] vwx 7 VA
[8] yza 8 VA
[9] bcd 9 VA
Now the frame has a new column named "sample_bin" with a string label. Values in the other columns are unaffected.
Here it is again, this time specifying labels, output column and random seed
>>> frame.assign_sample([0.2, 0.2, 0.3, 0.3],
... ["cat1", "cat2", "cat3", "cat4"],
... output_column="cat",
... seed=12)
[===Job Progress===]
>>> frame.inspect()
[#] blip id sample_bin cat
===============================
[0] abc 0 VA cat4
[1] def 1 TR cat2
[2] ghi 2 TE cat3
[3] jkl 3 TE cat4
[4] mno 4 TE cat1
[5] pqr 5 TR cat3
[6] stu 6 TR cat2
[7] vwx 7 VA cat3
[8] yza 8 VA cat3
[9] bcd 9 VA cat4
def bin_column(
self, column_name, bins=None, include_lowest=True, strict_binning=False, bin_column_name=None)
Summarize rows of data based on the value in a single column by sorting them into bins, or groups, based on a list of bin cutoff points or a specified number of equal-width bins.
column_name | (str): | Name of the column to bin |
bins | (Optional[List[float]]): | Either a single value representing the number of equal-width bins to create, or an array of values containing bin cutoff points. Array can be list or tuple. If an array is provided, values must be progressively increasing. All bin boundaries must be included, so, with N bins, you need N+1 values. Default (None or Empty List) is equal-width bins where the maximum number of bins is the Square-root choice :math:`\lfloor \sqrt{m} floor`, where :math:`m` is the number of rows. |
include_lowest | (bool): | Specify how the boundary conditions are handled. ``True`` indicates that the lower bound of the bin is inclusive. ``False`` indicates that the upper bound is inclusive. Default is ``True``. |
strict_binning | (bool): | Specify how values outside of the cutoffs array should be binned. If set to ``True``, each value less than cutoffs[0] or greater than cutoffs[-1] will be assigned a bin value of -1. If set to ``False``, values less than cutoffs[0] will be included in the first bin while values greater than cutoffs[-1] will be included in the final bin. |
bin_column_name | (str): | The name for the new binned column. Default is ``<column_name>_binned`` |
Returns | (List[float]): | a list containing the edges of each bin |
- Bins IDs are 0-index, in other words, the lowest bin number is 0.
- The first and last cutoffs are always included in the bins.
When include_lowest is
True
, the last bin includes both cutoffs. When include_lowest isFalse
, the first bin (bin 0) includes both cutoffs.
For these examples, we will use a frame with column a accessed by a Frame object my_frame:
>>> frame.inspect(n=11)
[##] a
========
[0] 1
[1] 1
[2] 2
[3] 3
[4] 5
[5] 8
[6] 13
[7] 21
[8] 34
[9] 55
[10] 89
Modify the frame with a column showing what bin the data is in, by specifying cutoffs for the bin edges. The data values should use strict_binning:
>>> frame.bin_column('a', [5, 12, 25, 60], include_lowest=True,
... strict_binning=True, bin_column_name='binned_using_cutoffs')
[===Job Progress===]
>>> frame.inspect(n=11)
[##] a binned_using_cutoffs
==============================
[0] 1 -1
[1] 1 -1
[2] 2 -1
[3] 3 -1
[4] 5 0
[5] 8 0
[6] 13 1
[7] 21 1
[8] 34 2
[9] 55 2
[10] 89 -1
Modify the frame with a column showing what bin the data is in. The data value should not use strict_binning:
>>> frame.bin_column('a', [5, 12, 25, 60], include_lowest=True,
... strict_binning=False, bin_column_name='binned_using_cutoffs')
[===Job Progress===]
>>> frame.inspect(n=11)
[##] a binned_using_cutoffs
==============================
[0] 1 0
[1] 1 0
[2] 2 0
[3] 3 0
[4] 5 0
[5] 8 0
[6] 13 1
[7] 21 1
[8] 34 2
[9] 55 2
[10] 89 2
Modify the frame with a column showing what bin the data is in. The bins should be lower inclusive:
>>> frame.bin_column('a', [1,5,34,55,89], include_lowest=True,
... strict_binning=False, bin_column_name='binned_using_cutoffs')
[===Job Progress===]
>>> frame.inspect( n=11 )
[##] a binned_using_cutoffs
==============================
[0] 1 0
[1] 1 0
[2] 2 0
[3] 3 0
[4] 5 1
[5] 8 1
[6] 13 1
[7] 21 1
[8] 34 2
[9] 55 3
[10] 89 3
Modify the frame with a column showing what bin the data is in. The bins should be upper inclusive:
>>> frame.bin_column('a', [1,5,34,55,89], include_lowest=False,
... strict_binning=True, bin_column_name='binned_using_cutoffs')
[===Job Progress===]
>>> frame.inspect( n=11 )
[##] a binned_using_cutoffs
==============================
[0] 1 0
[1] 1 0
[2] 2 0
[3] 3 0
[4] 5 0
[5] 8 1
[6] 13 1
[7] 21 1
[8] 34 1
[9] 55 2
[10] 89 3
Modify the frame with a column of 3 equal-width bins. This also returns the cutoffs that were used for creating the bins.
>>> cutoffs = frame.bin_column('a', 3, bin_column_name='equal_width_bins')
>>> print cutoffs
[1.0, 30.333333333333332, 59.666666666666664, 89.0]
>>> frame.inspect(n=frame.count())
[##] a equal_width_bins
==========================
[0] 1 0
[1] 1 0
[2] 2 0
[3] 3 0
[4] 5 0
[5] 8 0
[6] 13 0
[7] 21 0
[8] 34 1
[9] 55 1
[10] 89 2
def binary_classification_metrics(
self, label_column, pred_column, pos_label, beta=1.0, frequency_column=None)
Statistics of accuracy, precision, and others for a binary classification model.
label_column | (str): | The name of the column containing the correct label for each instance. |
pred_column | (str): | The name of the column containing the predicted label for each instance. |
pos_label | (Any): | The value to be interpreted as a positive instance for binary classification. |
beta | (Optional[float]): | This is the beta value to use for :math:`F_{ eta}` measure (default F1 measure is computed); must be greater than zero. Defaults is 1. |
frequency_column | (Optional[str]): | The name of an optional column containing the frequency of observations. |
Returns | (ClassificationMetricsValue): | The data returned is composed of multiple components: <object>.accuracy : double <object>.confusion_matrix : table <object>.f_measure : double <object>.precision : double <object>.recall : double |
Calculate the accuracy, precision, confusion_matrix, recall and :math:F_{ eta}
measure for a
classification model.
-
The f_measure result is the :math:
F_{ eta}
measure for a classification model. The :math:F_{ eta}
measure of a binary classification model is the harmonic mean of precision and recall. If we let:- beta :math:
\equiv eta
, - :math:
T_{P}
denotes the number of true positives, - :math:
F_{P}
denotes the number of false positives, and - :math:
F_{N}
denotes the number of false negatives
then:
.. math::
F_{ eta} = (1 + eta ^ 2) * rac{ rac{T_{P}}{T_{P} + F_{P}} * rac{T_{P}}{T_{P} + F_{N}}}{ eta ^ 2 * rac{T_{P}}{T_{P} + F_{P}} + rac{T_{P}}{T_{P} + F_{N}}}
The :math:
F_{ eta}
measure for a multi-class classification model is computed as the weighted average of the :math:F_{ eta}
measure for each label, where the weight is the number of instances of each label. The determination of binary vs. multi-class is automatically inferred from the data.-
The recall result of a binary classification model is the proportion of positive instances that are correctly identified. If we let :math:
T_{P}
denote the number of true positives and :math:F_{N}
denote the number of false negatives, then the model recall is given by :math:rac {T_{P}} {T_{P} + F_{N}}
. -
The precision of a binary classification model is the proportion of predicted positive instances that are correctly identified. If we let :math:
T_{P}
denote the number of true positives and :math:F_{P}
denote the number of false positives, then the model precision is given by: :math:rac {T_{P}} {T_{P} + F_{P}}
. -
The accuracy of a classification model is the proportion of predictions that are correctly identified. If we let :math:
T_{P}
denote the number of true positives, :math:T_{N}
denote the number of true negatives, and :math:K
denote the total number of classified instances, then the model accuracy is given by: :math:rac{T_{P} + T_{N}}{K}
.
- beta :math:
-
The confusion_matrix result is a confusion matrix for a binary classifier model, formatted for human readability.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] a b labels predictions
==================================
[0] red 1 0 0
[1] blue 3 1 0
[2] green 1 0 0
[3] green 0 1 1
>>> cm = my_frame.binary_classification_metrics('labels', 'predictions', 1, 1)
[===Job Progress===]
>>> cm.f_measure
0.6666666666666666
>>> cm.recall
0.5
>>> cm.accuracy
0.75
>>> cm.precision
1.0
>>> cm.confusion_matrix
Predicted_Pos Predicted_Neg
Actual_Pos 1 1
Actual_Neg 0 2
def box_cox(
self, column_name, lambda_value=0.0, box_cox_column_name=None)
Calculate the box-cox transformation for each row on a given column of the current frame
column_name: | Name of the column to perform the transformation on |
lambda_value: | Lambda power parameter. Default is 0.0 |
box_cox_column_name: | Optional column name for the box_cox value |
Returns | (Frame): | returns a frame with a new column storing the box-cox transformed value |
Calculate the box-cox transformation for each row in column 'column_name' of a frame using the lambda_value.
Box-cox transformation is computed by the following formula:
boxcox = log(y); if lambda=0, boxcox = (y^lambda -1)/lambda ; else where log is the natural log
>>> data = [[7.7132064326674596],[0.207519493594015],[6.336482349262754],[7.4880388253861181],[4.9850701230259045]]
>>> schema = [("input", float)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.inspect()
[#] input
===================
[0] 7.71320643267
[1] 0.207519493594
[2] 6.33648234926
[3] 7.48803882539
[4] 4.98507012303
Compute the box-cox transformation on the 'input' column
>>> my_frame.box_cox('input',0.3)
A new column gets added to the frame which stores the box-cox transformation for each row
>>> my_frame.inspect()
[#] input input_lambda_0.3
=====================================
[0] 7.71320643267 2.81913279907
[1] 0.207519493594 -1.25365381375
[2] 6.33648234926 2.46673638752
[3] 7.48803882539 2.76469126003
[4] 4.98507012303 2.06401101556
def categorical_summary(
self, columns, top_k=None, threshold=None)
Build summary of the data.
columns | (List[CategoricalSummaryInput]): | List of CategoricalSummaryInput consisting of column, topk and/or threshold |
top_k | (Optional[int]): | Displays levels which are in the top k most frequently occurring values for that column. Default is 10. |
threshold | (Optional[float]): | Displays levels which are above the threshold percentage with respect to the total row count. Default is 0.0. |
Returns | (List[CategoricalSummaryOutput]): | List of CategoricalSummaryOutput objects for specified column(s) consisting of levels with their frequency and percentage. |
Compute a summary of the data in a column(s) for categorical or numerical data types. The returned value is a Map containing categorical summary for each specified column.
For each column, levels which satisfy the top k and/or threshold cutoffs are displayed along with their frequency and percentage occurrence with respect to the total rows in the dataset.
Performs level pruning first based on top k and then filters out levels which satisfy the threshold criterion.
Missing data is reported when a column value is empty ("") or null.
All remaining data is grouped together in the Other category and its frequency and percentage are reported as well.
User must specify the column name and can optionally specify top_k and/or threshold.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] source target
=====================================
[0] entity thing
[1] entity physical_entity
[2] entity abstraction
[3] physical_entity entity
[4] physical_entity matter
[5] physical_entity process
[6] physical_entity thing
[7] physical_entity substance
[8] physical_entity object
[9] physical_entity causal_agent
>>> cm = my_frame.categorical_summary('source', top_k=2)
[===Job Progress===]
>>> cm
column_name = "source"
[#] level frequency percentage
===========================================
[0] thing 9 0.321428571429
[1] abstraction 9 0.321428571429
[2] <Missing> 0 0.0
[3] <Other> 10 0.357142857143
>>> cm = my_frame.categorical_summary('source', threshold = 0.5)
[===Job Progress===]
>>> cm
column_name = "source"
[#] level frequency percentage
=====================================
[0] <Missing> 0 0.0
[1] <Other> 28 1.0
>>> cm = my_frame.categorical_summary(['source', 'target'], top_k=[2, None], threshold=[None, 0.5])
[===Job Progress===]
>>> cm
column_name = "source"
[#] level frequency percentage
===========================================
[0] thing 9 0.321428571429
[1] abstraction 9 0.321428571429
[2] <Missing> 0 0.0
[3] <Other> 10 0.357142857143
<BLANKLINE>
column_name = "target"
[#] level frequency percentage
=====================================
[0] <Missing> 0 0.0
[1] <Other> 28 1.0
def collect(
self, columns=None)
Brings all the rows of data from the frame into a local python list of lists
(Use the 'take' operation for control over row count and offset of the collected data)
columns | (Optional[str or List[str]): | If not None, only the given columns' data will be provided. By default, all columns are included. |
Returns | (List[List[*]]): | the frame data represented as a list of lists |
>>> schema = [('name',str), ('age', int), ('tenure', int), ('phone', str)]
>>> rows = [['Fred', 39, 16, '555-1234'], ['Susan', 33, 3, '555-0202'], ['Thurston', 65, 26, '555-4510'], ['Judy', 44, 14, '555-2183']]
>>> frame = tc.frame.create(rows, schema)
>>> frame.collect()
[['Fred', 39, 16, '555-1234'], ['Susan', 33, 3, '555-0202'], ['Thurston', 65, 26, '555-4510'], ['Judy', 44, 14, '555-2183']]
>>> frame.collect(['name', 'phone'])
[['Fred', '555-1234'], ['Susan', '555-0202'], ['Thurston', '555-4510'], ['Judy', '555-2183']]
def column_median(
self, data_column, weights_column=None)
Calculate the (weighted) median of a column.
The median is the least value X in the range of the distribution so that the cumulative weight of values strictly below X is strictly less than half of the total weight and the cumulative weight of values up to and including X is greater than or equal to one-half of the total weight.
All data elements of weight less than or equal to 0 are excluded from the calculation, as are all data elements whose weight is NaN or infinite. If a weight column is provided and no weights are finite numbers greater than 0, None is returned.
data_column | (str): | The column whose median is to be calculated. |
weights_column | (Option[str]): | The column that provides weights (frequencies) for the median calculation. Must contain numerical data. Default is all items have a weight of 1. |
Returns | (varies): | The median of the values. If a weight column is provided and no weights are finite numbers greater than 0, None is returned. The type of the median returned is the same as the contents of the data column, so a column of Longs will result in a Long median and a column of Floats will result in a Float median. |
Given a frame with column 'a' accessed by a Frame object 'my_frame':
>>> data = [[2],[3],[3],[5],[7],[10],[30]]
>>> schema = [('a', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a
=======
[0] 2
[1] 3
[2] 3
[3] 5
[4] 7
[5] 10
[6] 30
Compute and return middle number of values in column a:
>>> median = my_frame.column_median('a')
[===Job Progress===]
>>> print median
5
Given a frame with column 'a' and column 'w' as weights accessed by a Frame object 'my_frame':
>>> data = [[2,1.7],[3,0.5],[3,1.2],[5,0.8],[7,1.1],[10,0.8],[30,0.1]]
>>> schema = [('a', int), ('w', float)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a w
============
[0] 2 1.7
[1] 3 0.5
[2] 3 1.2
[3] 5 0.8
[4] 7 1.1
[5] 10 0.8
[6] 30 0.1
Compute and return middle number of values in column 'a' with weights 'w':
>>> median = my_frame.column_median('a', weights_column='w')
[===Job Progress===]
>>> print median
3
def column_mode(
self, data_column, weights_column=None, max_modes_returned=None)
Evaluate the weights assigned to rows.
Calculate the modes of a column. A mode is a data element of maximum weight. All data elements of weight less than or equal to 0 are excluded from the calculation, as are all data elements whose weight is NaN or infinite. If there are no data elements of finite weight greater than 0, no mode is returned.
Because data distributions often have multiple modes, it is possible for a set of modes to be returned. By default, only one is returned, but by setting the optional parameter max_modes_returned, a larger number of modes can be returned.
data_column | (str): | Name of the column supplying the data. |
weights_column | (Optional[str]): | Name of the column supplying the weights. Default is all items have weight of 1. |
max_modes_returned | (Option[int]): | Maximum number of modes returned. Default is 1. |
Returns | (ColumnMode): | ColumnMode object which includes multiple components (mode, weight_of_mode, total_weight, and mode_count). |
The data returned is composed of multiple components\:
mode : A mode is a data element of maximum net weight. A set of modes is returned. The empty set is returned when the sum of the weights is 0. If the number of modes is less than or equal to the parameter max_modes_returned, then all modes of the data are returned. If the number of modes is greater than the max_modes_returned parameter, only the first max_modes_returned many modes (per a canonical ordering) are returned. weight_of_mode : Weight of a mode. If there are no data elements of finite weight greater than 0, the weight of the mode is 0. If no weights column is given, this is the number of appearances of each mode. total_weight : Sum of all weights in the weight column. This is the row count if no weights are given. If no weights column is given, this is the number of rows in the table with non-zero weight. mode_count : The number of distinct modes in the data. In the case that the data is very multimodal, this number may exceed max_modes_returned.
Given a frame with column 'a' accessed by a Frame object 'my_frame':
>>> data = [[2],[3],[3],[5],[7],[10],[30]]
>>> schema = [('a', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a
=======
[0] 2
[1] 3
[2] 3
[3] 5
[4] 7
[5] 10
[6] 30
Compute and return a ColumnMode object containing summary statistics of column a:
>>> mode = my_frame.column_mode('a')
[===Job Progress===]
>>> print mode
mode_count = 1
modes = [3]
total_weight = 7.0
weight_of_mode = 2.0
Given a frame with column 'a' and column 'w' as weights accessed by a Frame object 'my_frame':
>>> data = [[2,1.7],[3,0.5],[3,1.2],[5,0.8],[7,1.1],[10,0.8],[30,0.1]]
>>> schema = [('a', int), ('w', float)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a w
============
[0] 2 1.7
[1] 3 0.5
[2] 3 1.2
[3] 5 0.8
[4] 7 1.1
[5] 10 0.8
[6] 30 0.1
Compute and return ColumnMode object containing summary statistics of column 'a' with weights 'w':
>>> mode = my_frame.column_mode('a', weights_column='w')
[===Job Progress===]
>>> print mode
mode_count = 2
modes = [2]
total_weight = 6.2
weight_of_mode = 1.7
def column_summary_statistics(
self, data_column, weights_column=None, use_popultion_variance=False)
Calculate multiple statistics for a column.
data_column | (str): | The column to be statistically summarized. Must contain numerical data; all NaNs and infinite values are excluded from the calculation. |
weights_column | (Optional[str]): | Name of column holding weights of column values. |
use_popultion_variance | (Optional[bool]): | If true, the variance is calculated as the population variance. If false, the variance calculated as the sample variance. Because this option affects the variance, it affects the standard deviation and the confidence intervals as well. Default is false. |
Returns | (ColumnSummaryStatistics): | ColumnSummaryStatistics object containing summary statistics. |
The data returned is composed of multiple components:
- mean : [ double | None ]
Arithmetic mean of the data. - geometric_mean : [ double | None ]
Geometric mean of the data. None when there is a data element <= 0, 1.0 when there are no data elements. - variance : [ double | None ]
None when there are <= 1 many data elements. Sample variance is the weighted sum of the squared distance of each data element from the weighted mean, divided by the total weight minus 1. None when the sum of the weights is <= 1. Population variance is the weighted sum of the squared distance of each data element from the weighted mean, divided by the total weight. - standard_deviation : [ double | None ]
The square root of the variance. None when sample variance is being used and the sum of weights is <= 1. - total_weight : long
The count of all data elements that are finite numbers. In other words, after excluding NaNs and infinite values. - minimum : [ double | None ]
Minimum value in the data. None when there are no data elements. - maximum : [ double | None ]
Maximum value in the data. None when there are no data elements. - mean_confidence_lower : [ double | None ]
Lower limit of the 95% confidence interval about the mean. Assumes a Gaussian distribution. None when there are no elements of positive weight. - mean_confidence_upper : [ double | None ]
Upper limit of the 95% confidence interval about the mean. Assumes a Gaussian distribution. None when there are no elements of positive weight. - bad_row_count : [ double | None ]
The number of rows containing a NaN or infinite value in either the data or weights column. - good_row_count : [ double | None ]
The number of rows not containing a NaN or infinite value in either the data or weights column. - positive_weight_count : [ double | None ]
The number of valid data elements with weight > 0. This is the number of entries used in the statistical calculation. - non_positive_weight_count : [ double | None ]
The number valid data elements with finite weight <= 0.
-
Sample Variance
Sample Variance is computed by the following formula:.. math::
\left( rac{1}{W - 1}
ight) * sum_{i} \left(x_{i} - M ight) ^{2}
where :math:
W
is sum of weights over valid elements of positive weight, and :math:M
is the weighted mean. -
Population Variance
Population Variance is computed by the following formula:.. math::
\left( rac{1}{W}
ight) * sum_{i} \left(x_{i} - M ight) ^{2}
where :math:
W
is sum of weights over valid elements of positive weight, and :math:M
is the weighted mean. -
Standard Deviation
The square root of the variance. -
Logging Invalid Data
A row is bad when it contains a NaN or infinite value in either its data or weights column. In this case, it contributes to bad_row_count; otherwise it contributes to good row count.A good row can be skipped because the value in its weight column is less than or equal to 0. In this case, it contributes to non_positive_weight_count, otherwise (when the weight is greater than 0) it contributes to valid_data_weight_pair_count.
Equations
bad_row_count + good_row_count = # rows in the frame
positive_weight_count + non_positive_weight_count = good_row_count
In particular, when no weights column is provided and all weights are 1.0:
non_positive_weight_count = 0 and
positive_weight_count = good_row_count
Given a frame with column 'a' accessed by a Frame object 'my_frame':
>>> data = [[2],[3],[3],[5],[7],[10],[30]]
>>> schema = [('a', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a
=======
[0] 2
[1] 3
[2] 3
[3] 5
[4] 7
[5] 10
[6] 30
Compute and return summary statistics for values in column a:
>>> summary_statistics = my_frame.column_summary_statistics('a')
[===Job Progress===]
>>> print summary_statistics
bad_row_count = 0
geometric_mean = 5.67257514519
good_row_count = 7
maximum = 30.0
mean = 8.57142857143
mean_confidence_lower = 1.27708372993
mean_confidence_upper = 15.8657734129
minimum = 2.0
non_positive_weight_count = 0
positive_weight_count = 7
standard_deviation = 9.84644001416
total_weight = 7.0
variance = 96.9523809524
Given a frame with column 'a' and column 'w' as weights accessed by a Frame object 'my_frame':
>>> data = [[2,1.7],[3,0.5],[3,1.2],[5,0.8],[7,1.1],[10,0.8],[30,0.1]]
>>> schema = [('a', int), ('w', float)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Inspect my_frame
>>> my_frame.inspect()
[#] a w
============
[0] 2 1.7
[1] 3 0.5
[2] 3 1.2
[3] 5 0.8
[4] 7 1.1
[5] 10 0.8
[6] 30 0.1
Compute and return summary statistics values in column 'a' with weights 'w':
>>> summary_statistics = my_frame.column_summary_statistics('a', weights_column='w')
[===Job Progress===]
>>> print summary_statistics
bad_row_count = 0
geometric_mean = 4.03968288152
good_row_count = 7
maximum = 30.0
mean = 5.03225806452
mean_confidence_lower = 1.42847242276
mean_confidence_upper = 8.63604370627
minimum = 2.0
non_positive_weight_count = 0
positive_weight_count = 7
standard_deviation = 4.57824177679
total_weight = 6.2
variance = 20.9602977667
def copy(
self, columns=None, where=None)
New frame with copied columns.
columns | (str, List[str], or dictionary(str,str)): | If not None, the copy will only include the columns specified. If dict, the string pairs represent a column renaming { source_column_name : destination_column_name } |
where | (UDF): | Optionally provide a where function. If not None, only those rows for which the UDF evaluates to True will be copied. |
Returns | (Frame): | New Frame object. |
Copies specified columns into a new Frame object, optionally renaming them and/or filtering them. Useful for frame query.
Consider the following frame of employee names, age, and years of service:
>>> frame.inspect()
[#] name age years
=========================
[0] Thurston 64 26
[1] Judy 44 14
[2] Emily 37 5
[3] Frank 50 18
[4] Joe 43 11
[5] Ruth 52 21
>>> frame.schema
[('name', <type 'str'>), ('age', <type 'int'>), ('years', <type 'int'>)]
To create a duplicate copy of the frame, use the copy operation with no parameters:
>>> duplicate = frame.copy()
[===Job Progress===]
>>> duplicate.inspect()
[#] name age years
=========================
[0] Thurston 64 26
[1] Judy 44 14
[2] Emily 37 5
[3] Frank 50 18
[4] Joe 43 11
[5] Ruth 52 21
Using the copy operation, we can also limit the new frame to just include the 'name' column:
>>> names = frame.copy("name")
[===Job Progress===]
>>> names.inspect()
[#] name
=============
[0] Thurston
[1] Judy
[2] Emily
[3] Frank
[4] Joe
[5] Ruth
We could also include a UDF to filter the data that is included in the new frame, and also provide a dictionary to rename the column(s) in the new frame. Here we will use copy to create a frame of names for the employees that have over 20 years of service and also rename of the 'name' column to 'first_name':
>>> names = frame.copy({"name" : "first_name"}, lambda row: row.years > 20)
[===Job Progress===]
>>> names.inspect()
[#] first_name
===============
[0] Thurston
[1] Ruth
def correlation(
self, column_a, column_b)
Calculate correlation for two columns of current frame.
column_a | (str): | The name of the column from which to compute the correlation. |
column_b | (str): | The name of the column from which to compute the correlation. |
Returns | (float): | Pearson correlation coefficient of the two columns. |
This method applies only to columns containing numerical data.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] idnum x1 x2 x3 x4
===============================
[0] 0 1.0 4.0 0.0 -1.0
[1] 1 2.0 3.0 0.0 -1.0
[2] 2 3.0 2.0 1.0 -1.0
[3] 3 4.0 1.0 2.0 -1.0
[4] 4 5.0 0.0 2.0 -1.0
my_frame.correlation computes the common correlation coefficient (Pearson's) on the pair of columns provided. In this example, the idnum and most of the columns have trivial correlations: -1, 0, or +1. Column x3 provides a contrasting coefficient of 3 / sqrt(3) = 0.948683298051 .
>>> my_frame.correlation("x1", "x2")
-0.9999999999999998
>>> my_frame.correlation("x1", "x4")
nan
>>> my_frame.correlation("x2", "x3")
-0.9486832980505138
def correlation_matrix(
self, data_column_names)
Calculate correlation matrix for two or more columns.
data_column_names | (List[str]): | The names of the columns from which to compute the matrix. |
Returns | (Frame): | A Frame with the matrix of the correlation values for the columns. |
This method applies only to columns containing numerical data.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] idnum x1 x2 x3 x4
===============================
[0] 0 1.0 4.0 0.0 -1.0
[1] 1 2.0 3.0 0.0 -1.0
[2] 2 3.0 2.0 1.0 -1.0
[3] 3 4.0 1.0 2.0 -1.0
[4] 4 5.0 0.0 2.0 -1.0
my_frame.correlation_matrix computes the common correlation coefficient (Pearson's) on each pair of columns in the user-provided list. In this example, the idnum and most of the columns have trivial correlations: -1, 0, or +1. Column x3 provides a contrasting coefficient of 3 / sqrt(3) = 0.948683298051
>>> corr_matrix = my_frame.correlation_matrix(my_frame.column_names)
[===Job Progress===]
The resulting table (specifying all columns) is:
>>> corr_matrix.inspect()
[#] idnum x1 x2 x3 x4
==========================================================================
[0] 1.0 1.0 -1.0 0.948683298051 nan
[1] 1.0 1.0 -1.0 0.948683298051 nan
[2] -1.0 -1.0 1.0 -0.948683298051 nan
[3] 0.948683298051 0.948683298051 -0.948683298051 1.0 nan
[4] nan nan nan nan 1.0
def count(
self, where=None)
Counts all rows or all qualified rows.
where | (UDF): | Optional function which evaluates a row to a boolean to determine if it should be counted |
Returns | (int): | Number of rows counted |
Counts all rows or all rows which meet criteria specified by a UDF predicate.
>>> frame = tc.frame.create([['Fred',39,16,'555-1234'],
... ['Susan',33,3,'555-0202'],
... ['Thurston',65,26,'555-4510'],
... ['Judy',44,14,'555-2183']],
... schema=[('name', str), ('age', int), ('tenure', int), ('phone', str)])
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Susan 33 3 555-0202
[2] Thurston 65 26 555-4510
[3] Judy 44 14 555-2183
>>> frame.count()
4
>>> frame.count(lambda row: row.age > 35)
3
def covariance(
self, column_a, column_b)
Calculate covariance for exactly two columns.
column_a | (str): | The name of the column from which to compute the covariance. |
column_b | (str): | The name of the column from which to compute the covariance. |
Returns | (float): | Covariance of the two columns. |
This method applies only to columns containing numerical data.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] idnum x1 x2 x3 x4
===============================
[0] 0 1.0 4.0 0.0 -1.0
[1] 1 2.0 3.0 0.0 -1.0
[2] 2 3.0 2.0 1.0 -1.0
[3] 3 4.0 1.0 2.0 -1.0
[4] 4 5.0 0.0 2.0 -1.0
my_frame.covariance computes the covariance on the pair of columns provided.
>>> my_frame.covariance("x1", "x2")
-2.5
>>> my_frame.covariance("x1", "x4")
0.0
>>> my_frame.covariance("x2", "x3")
-1.5
def covariance_matrix(
self, data_column_names)
Calculate covariance matrix for two or more columns.
data_column_names | (List[str]): | The names of the column from which to compute the matrix. Names should refer to a single column of type vector, or two or more columns of numeric scalars. |
Returns | (Frame): | A matrix with the covariance values for the columns. |
This function applies only to columns containing numerical data.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] idnum x1 x2 x3 x4
===============================
[0] 0 1.0 4.0 0.0 -1.0
[1] 1 2.0 3.0 0.0 -1.0
[2] 2 3.0 2.0 1.0 -1.0
[3] 3 4.0 1.0 2.0 -1.0
[4] 4 5.0 0.0 2.0 -1.0
my_frame.covariance_matrix computes the covariance on each pair of columns in the user-provided list.
>>> cov_matrix = my_frame.covariance_matrix(my_frame.column_names)
[===Job Progress===]
The resulting table (specifying all columns) is:
>>> cov_matrix.inspect()
[#] idnum x1 x2 x3 x4
=================================
[0] 2.5 2.5 -2.5 1.5 0.0
[1] 2.5 2.5 -2.5 1.5 0.0
[2] -2.5 -2.5 2.5 -1.5 0.0
[3] 1.5 1.5 -1.5 1.0 0.0
[4] 0.0 0.0 0.0 0.0 0.0
def cumulative_percent(
self, sample_col)
Add column to frame with cumulative percent.
sample_col | (str): | The name of the column from which to compute the cumulative percent. |
A cumulative percent sum is computed by sequentially stepping through the rows, observing the column values and keeping track of the current percentage of the total sum accounted for at the current value.
This method applies only to columns containing numerical data. Although this method will execute for columns containing negative values, the interpretation of the result will change (for example, negative percentages).
Consider Frame my_frame accessing a frame that contains a single column named obs:
>>> my_frame.inspect()
[#] obs
========
[0] 0
[1] 1
[2] 2
[3] 0
[4] 1
[5] 2
The cumulative percent sum for column obs is obtained by:
>>> my_frame.cumulative_percent('obs')
[===Job Progress===]
The Frame my_frame now contains two columns obs and obsCumulativePercentSum. They contain the original data and the cumulative percent sum, respectively:
>>> my_frame.inspect()
[#] obs obs_cumulative_percent
================================
[0] 0 0.0
[1] 1 0.166666666667
[2] 2 0.5
[3] 0 0.5
[4] 1 0.666666666667
[5] 2 1.0
def cumulative_sum(
self, sample_col)
Add column to frame with cumulative sum.
sample_col | (str): | The name of the column from which to compute the cumulative sum. |
A cumulative sum is computed by sequentially stepping through the rows, observing the column values and keeping track of the cumulative sum for each value.
This method applies only to columns containing numerical data.
Consider Frame my_frame, which accesses a frame that contains a single column named obs:
>>> my_frame.inspect()
[#] obs
========
[0] 0
[1] 1
[2] 2
[3] 0
[4] 1
[5] 2
The cumulative sum for column obs is obtained by:
>>> my_frame.cumulative_sum('obs')
[===Job Progress===]
The Frame my_frame accesses the original frame that now contains two columns, obs that contains the original column values, and obsCumulativeSum that contains the cumulative percent count:
>>> my_frame.inspect()
[#] obs obs_cumulative_sum
============================
[0] 0 0.0
[1] 1 1.0
[2] 2 3.0
[3] 0 3.0
[4] 1 4.0
[5] 2 6.0
def dot_product(
self, left_column_names, right_column_names, dot_product_column_name, default_left_values=None, default_right_values=None)
Calculate dot product for each row in current frame.
left_column_names | (List[str]): | Names of columns used to create the left vector (A) for each row. Names should refer to a single column of type vector, or two or more columns of numeric scalars. |
right_column_names | (List[str]): | Names of columns used to create right vector (B) for each row. Names should refer to a single column of type vector, or two or more columns of numeric scalars. |
dot_product_column_name | (str): | Name of column used to store the dot product. |
default_left_values | (Optional[List[float]): | Default values used to substitute null values in left vector.Default is None. |
default_right_values | (Optional[List[float]): | Default values used to substitute null values in right vector.Default is None. |
Returns | (Frame): | returns a frame with give "dot_product" column name |
Calculate the dot product for each row in a frame using values from two equal-length sequences of columns.
Dot product is computed by the following formula:
The dot product of two vectors :math:A=[a_1, a_2, ..., a_n]
and :math:B =[b_1, b_2, ..., b_n]
is :math:a_1*b_1 + a_2*b_2 + ...+ a_n*b_n
.
The dot product for each row is stored in a new column in the existing frame.
- If default_left_values or default_right_values are not specified, any null values will be replaced by zeros.
- This method applies only to columns containing numerical data.
>>> data = [[1, 0.2, -2, 5], [2, 0.4, -1, 6], [3, 0.6, 0, 7], [4, 0.8, 1, 8]]
>>> schema = [('col_0', int), ('col_1', float),('col_2', int) ,('col_3', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
Calculate the dot product for a sequence of columns in Frame object my_frame:
>>> my_frame.inspect()
[#] col_0 col_1 col_2 col_3
===============================
[0] 1 0.2 -2 5
[1] 2 0.4 -1 6
[2] 3 0.6 0 7
[3] 4 0.8 1 8
Modify the frame by computing the dot product for a sequence of columns:
>>> my_frame.dot_product(['col_0','col_1'], ['col_2', 'col_3'], 'dot_product')
[===Job Progress===]
>>> my_frame.inspect()
[#] col_0 col_1 col_2 col_3 dot_product
============================================
[0] 1 0.2 -2 5 -1.0
[1] 2 0.4 -1 6 0.4
[2] 3 0.6 0 7 4.2
[3] 4 0.8 1 8 10.4
def drop_columns(
self, columns)
Drops columns from the frame
columns | (str or List[str]): | names of the columns to drop |
For this example, the Frame object my_frame accesses a frame with 4 columns columns column_a, column_b, column_c and column_d and drops 2 columns column_b and column_d using drop columns.
>>> print frame.schema
[('column_a', <type 'str'>), ('column_b', <type 'int'>), ('column_c', <type 'str'>), ('column_d', <type 'int'>)]
Eliminate columns column_b and column_d:
>>> frame.drop_columns(["column_b", "column_d"])
>>> print frame.schema
[('column_a', <type 'str'>), ('column_c', <type 'str'>)]
Now the frame only has the columns column_a and column_c.
For further examples, see: ref:example_frame.drop_columns
.
def drop_duplicates(
self, unique_columns=None)
Modify the current frame, removing duplicate rows.
unique_columns | (Optional[List[str] or str]): | Column name(s) to identify duplicates. Default is the entire row is compared. |
Remove data rows which are the same as other rows. The entire row can be checked for duplication, or the search for duplicates can be limited to one or more columns. This modifies the current frame.
Given a frame with data:
>>> frame.inspect()
[#] a b c
===============
[0] 200 4 25
[1] 200 5 25
[2] 200 4 25
[3] 200 5 35
[4] 200 6 25
[5] 200 8 35
[6] 200 4 45
[7] 200 4 25
[8] 200 5 25
[9] 201 4 25
Remove any rows that are identical to a previous row. The result is a frame of unique rows. Note that row order may change.
>>> frame.drop_duplicates()
[===Job Progress===]
>>> frame.inspect()
[#] a b c
===============
[0] 200 8 35
[1] 200 6 25
[2] 200 5 35
[3] 200 4 45
[4] 200 4 25
[5] 200 5 25
[6] 201 4 25
Now remove any rows that have the same data in columns a and c as a previously checked row:
>>> frame.drop_duplicates([ "a", "c"])
[===Job Progress===]
The result is a frame with unique values for the combination of columns a and c.
>>> frame.inspect()
[#] a b c
===============
[0] 201 4 25
[1] 200 4 45
[2] 200 6 25
[3] 200 8 35
def drop_rows(
self, predicate)
Erase any row in the current frame which qualifies.
predicate | (UDF): | Function which evaluates a row to a boolean; rows that answer True are dropped from the frame. |
>>> frame = tc.frame.create([['Fred',39,16,'555-1234'],
... ['Susan',33,3,'555-0202'],
... ['Thurston',65,26,'555-4510'],
... ['Judy',44,14,'555-2183']],
... schema=[('name', str), ('age', int), ('tenure', int), ('phone', str)])
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Susan 33 3 555-0202
[2] Thurston 65 26 555-4510
[3] Judy 44 14 555-2183
>>> frame.drop_rows(lambda row: row.name[-1] == 'n') # drop people whose name ends in 'n'
>>> frame.inspect()
[#] name age tenure phone
================================
[0] Fred 39 16 555-1234
[1] Judy 44 14 555-2183
More information on a |UDF| can be found at :doc:/ds_apir
.
def ecdf(
self, column)
Builds new frame with columns for data and distribution.
column | (str): | The name of the input column containing sample. |
Returns | (Frame): | A new Frame containing each distinct value in the sample and its corresponding ECDF value. |
Generates the :term:empirical cumulative distribution
for the input column.
Consider the following sample data set in frame 'frame' containing several numbers.
>>> frame.inspect()
[#] numbers
============
[0] 1
[1] 3
[2] 1
[3] 0
[4] 2
[5] 1
[6] 4
[7] 3
>>> ecdf_frame = frame.ecdf('numbers')
[===Job Progress===]
>>> ecdf_frame.inspect()
[#] numbers numbers_ecdf
==========================
[0] 0 0.125
[1] 1 0.5
[2] 2 0.625
[3] 3 0.875
[4] 4 1.0
def entropy(
self, data_column, weights_column=None)
Calculate the Shannon entropy of a column.
data_column | (str): | The column whose entropy is to be calculated. |
weights_column | (Optional[str]): | The column that provides weights (frequencies) for the entropy calculation. Must contain numerical data. Default is using uniform weights of 1 for all items. |
Returns | (float): | Entropy. |
The data column is weighted via the weights column. All data elements of weight <= 0 are excluded from the calculation, as are all data elements whose weight is NaN or infinite. If there are no data elements with a finite weight greater than 0, the entropy is zero.
Consider the following sample data set in frame 'frame' containing several numbers.
>>> frame.inspect()
[#] data weight
=================
[0] 0 1
[1] 1 2
[2] 2 4
[3] 4 8
>>> entropy = frame.entropy("data", "weight")
[===Job Progress===]
>>> "%0.8f" % entropy
'1.13691659'
If we have more choices and weights, the computation is not as simple. An on-line search for "Shannon Entropy" will provide more detail.
Given a frame of coin flips, half heads and half tails, the entropy is simply ln(2):
>>> frame.inspect()
[#] data
=========
[0] H
[1] T
[2] H
[3] T
[4] H
[5] T
[6] H
[7] T
[8] H
[9] T
>>> entropy = frame.entropy("data")
[===Job Progress===]
>>> "%0.8f" % entropy
'0.69314718'
def export_to_csv(
self, file_name, separator=',')
Write current frame to disk as a CSV file
file_name | (str): | file destination |
separator | (str): | string to be used for delimiting the fields |
>>> frame = tc.frame.create([[1, 2, 3], [4, 5, 6]])
>>> frame.export_to_csv("sandbox/export_example.csv")
>>> frame2 = tc.frame.import_csv("sandbox/export_example.csv")
>>> frame2.inspect()
[#] C0 C1 C2
===============
[0] 1 2 3
[1] 4 5 6
def export_to_hbase(
self, table_name, key_column_name=None, family_name='familyColumn')
Write current frame to HBase table.
Table must exist in HBase.
table_name | (str): | The name of the HBase table that will contain the exported frame |
key_column_name | (Optional[str]): | The name of the column to be used as row key in hbase table |
family_name | (Optional[str]): | The family name of the HBase table that will contain the exported frame |
>>> data = [[1, 0.2, -2, 5], [2, 0.4, -1, 6], [3, 0.6, 0, 7], [4, 0.8, 1, 8]]
>>> schema = [('a', int), ('b', float),('c', int) ,('d', int)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.export_to_hbase("test_demo_hbase", family_name="test_family")
[===Job Progress===]
Verify exported frame in hbase
From bash shell
$hbase shell
hbase(main):001:0> list
You should see test_demo_hbase table.
Run hbase(main):001:0> scan 'test_demo_hbase' (to verify frame).
Output:
ROW COLUMN+CELL
0 column=test_family:a, timestamp=1464219662295, value=1
0 column=test_family:b, timestamp=1464219662295, value=0.2
0 column=test_family:c, timestamp=1464219662295, value=-2
0 column=test_family:d, timestamp=1464219662295, value=5
1 column=test_family:a, timestamp=1464219662295, value=2
1 column=test_family:b, timestamp=1464219662295, value=0.4
1 column=test_family:c, timestamp=1464219662295, value=-1
1 column=test_family:d, timestamp=1464219662295, value=6
2 column=test_family:a, timestamp=1464219662295, value=3
2 column=test_family:b, timestamp=1464219662295, value=0.6
2 column=test_family:c, timestamp=1464219662295, value=0
2 column=test_family:d, timestamp=1464219662295, value=7
3 column=test_family:a, timestamp=1464219662295, value=4
3 column=test_family:b, timestamp=1464219662295, value=0.8
3 column=test_family:c, timestamp=1464219662295, value=1
3 column=test_family:d, timestamp=1464219662295, value=8
4 row(s) in 0.1560 seconds
def export_to_hive(
self, hive_table_name)
Write current frame to Hive table.
Table must not exist in Hive. Hive does not support case sensitive table names and columns names. Hence column names with uppercase letters will be converted to lower case by Hive.
hive_table_name | (str): | hive table name |
>>> data = [[1, 0.2, -2, 5], [2, 0.4, -1, 6], [3, 0.6, 0, 7], [4, 0.8, 1, 8]]
>>> schema = [('a', int), ('b', float),('c', int) ,('d', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
table_name: (string): table name. It will create new table with given name if it does not exists already.
>>> my_frame.export_to_hive("demo_test_hive")
[===Job Progress===]
Verify exported frame in hive
From bash shell
$hive
hive> show tables
You should see demo_test_hive table.
Run hive> select * from demo_test_hive; (to verify frame).
def export_to_jdbc(
self, connection_url, table_name)
Write current frame to JDBC table
connection_url | (str): | JDBC connection url to database server |
table_name | (str): | JDBC table name |
>>> from sparktk import TkContext
>>> c=TkContext(sc)
>>> data = [[1, 0.2, -2, 5], [2, 0.4, -1, 6], [3, 0.6, 0, 7], [4, 0.8, 1, 8]]
>>> schema = [('a', int), ('b', float),('c', int) ,('d', int)]
>>> my_frame = tc.frame.create(data, schema)
[===Job Progress===]
connection_url : (string) : "jdbc:{datasbase_type}://{host}/{database_name}
Sample connection string for postgres ex: jdbc:postgresql://localhost/postgres [standard connection string to connect to default 'postgres' database]
table_name: (string): table name. It will create new table with given name if it does not exists already.
>>> my_frame.export_to_jdbc("jdbc:postgresql://localhost/postgres", "demo_test")
[===Job Progress===]
Verify exported frame in postgres
From bash shell
$sudo -su ppostgres psql
postgres=#\d
You should see demo_test table.
Run postgres=#select * from demo_test (to verify frame).
def export_to_json(
self, path, count=0, offset=0)
Write current frame to HDFS in Json format.
path | (str): | The HDFS folder path where the files will be created. |
count | (Optional[int]): | The number of records you want. Default (0), or a non-positive value, is the whole frame. |
offset | (Optional[int]): | The number of rows to skip before exporting to the file. Default is zero (0). |
def filter(
self, predicate)
Select all rows which satisfy a predicate.
Modifies the current frame to save defined rows and delete everything else.
predicate | (UDF): | Function which evaluates a row to a boolean; rows that answer False are dropped from the frame. |
>>> frame = tc.frame.create([['Fred',39,16,'555-1234'],
... ['Susan',33,3,'555-0202'],
... ['Thurston',65,26,'555-4510'],
... ['Judy',44,14,'555-2183']],
... schema=[('name', str), ('age', int), ('tenure', int), ('phone', str)])
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Susan 33 3 555-0202
[2] Thurston 65 26 555-4510
[3] Judy 44 14 555-2183
>>> frame.filter(lambda row: row.tenure >= 15) # keep only people with 15 or more years tenure
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Thurston 65 26 555-4510
More information on a |UDF| can be found at :doc:/ds_apir
.
def flatten_columns(
self, columns)
Spread data to multiple rows based on cell data.
columns | (str or tuple(str, str)): | The the name of the column to be flattened, or a tuple with the column name and delimiter string. The default delimiter is a comma (,). |
Splits cells in the specified columns into multiple rows according to a string delimiter. New rows are a full copy of the original row, but the specified columns only contain one value. The original row is deleted.
Given a data file:
1-solo,mono,single-green,yellow,red
2-duo,double-orange,black
The commands to bring the data into a frame, where it can be worked on:
>>> frame.inspect()
[#] a b c
==========================================
[0] 1 solo,mono,single green|yellow|red
[1] 2 duo,double orange|black
Now, spread out those sub-strings in column b and c by specifying the column names and delmiters:
>>> frame.flatten_columns([('b', ','), ('c', '|')])
[===Job Progress===]
Note that the delimiters parameter is optional, and if no delimiter is specified, the default is a comma (,). So, in the above example, the delimiter parameter for b could be omitted.
Check again:
>>> frame.inspect()
[#] a b c
======================
[0] 1 solo green
[1] 1 mono yellow
[2] 1 single red
[3] 2 duo orange
[4] 2 double black
Alternatively, we can flatten a single column b using the default comma delimiter:
>>> frame.flatten_columns('b')
[===Job Progress===]
Check again:
>>> frame.inspect()
[#] a b c
================================
[0] 1 solo green|yellow|red
[1] 1 mono green|yellow|red
[2] 1 single green|yellow|red
[3] 2 duo orange|black
[4] 2 double orange|black
def group_by(
self, group_by_columns, *aggregations)
Create a summarized frame with aggregations (Avg, Count, Max, Min, Mean, Sum, Stdev, ...).
group_by_columns | (List[str]): | list of columns to group on |
aggregations | (dict): | Aggregation function based on entire row, and/or dictionaries (one or more) of { column name str : aggregation function(s) }. |
Returns | (Frame): | Summarized Frame |
Creates a new frame and returns a Frame object to access it.Takes a column or group of columns, finds the unique combination of values, and creates unique rows with these column values.The other columns are combined according to the aggregation argument(s).
Aggregation currently supports using the following functions:
* avg
* count
* count_distinct
* max
* min
* stdev
* sum
* var
* histogram()
- Column order is not guaranteed when columns are added
-
The column names created by aggregation functions in the new frame are the original column name appended with the '_' character and the aggregation function. For example, if the original field is a and the function is avg, the resultant column is named a_avg.
-
An aggregation argument of count results in a column named count.
- The aggregation function agg.count is the only full row aggregation function supported at this time.
Consider this frame:
>>> frame.inspect()
[#] a b c d e f g
========================================
[0] 1 alpha 3.0 small 1 3.0 9
[1] 1 bravo 5.0 medium 1 4.0 9
[2] 1 alpha 5.0 large 1 8.0 8
[3] 2 bravo 8.0 large 1 5.0 7
[4] 2 charlie 12.0 medium 1 6.0 6
[5] 2 bravo 7.0 small 1 8.0 5
[6] 2 bravo 12.0 large 1 6.0 4
Count the groups in column 'b'
>>> b_count = frame.group_by('b', tc.agg.count)
[===Job Progress===]
>>> b_count.inspect()
[#] b count
===================
[0] alpha 2
[1] charlie 1
[2] bravo 4
Group by columns 'a' and 'b' and compute the average for column 'c'
>>> avg1 = frame.group_by(['a', 'b'], {'c' : tc.agg.avg})
>>> avg1.inspect()
[#] a b c_AVG
======================
[0] 2 charlie 12.0
[1] 2 bravo 9.0
[2] 1 bravo 5.0
[3] 1 alpha 4.0
Group by column 'a' and make a bunch of calculations for the grouped columns 'f' and 'g'
>>> mix_frame = frame.group_by('a', tc.agg.count, {'f': [tc.agg.avg, tc.agg.sum, tc.agg.min], 'g': tc.agg.max})
>>> mix_frame.inspect()
[#] a count g_MAX f_AVG f_SUM f_MIN
=========================================
[0] 2 4 7 6.25 25.0 5.0
[1] 1 3 9 5.0 15.0 3.0
Group by with histogram. The histogram aggregation argument is configured with these parameters:
cutoffs | (List[int or float or long or double]): | An array of values containing bin cutoff points. Array can be list or tuple. If an array is provided, values must be progressively increasing. All bin boundaries must be included, so, with N bins, you need N+1 values. For example, |
cutoffs=[1, 5, 8, 12] # creates three bins:
# bin0 holds values [1 inclusive - 5 exclusive]
# bin1 holds values [5 inclusive - 8 exclusive]
# bin2 holds values [8 inclusive - 9 exclusive]
include_lowest | (Optional[bool]): | Specify how the boundary conditions are handled. ``True`` indicates that the lower bound of the bin is inclusive. ``False`` indicates that the upper bound is inclusive. Default is ``True``. |
strict_binning | (Optional(bool)): | Specify how values outside of the cutoffs array should be binned. If set to ``True``, each value less than cutoffs[0] or greater than cutoffs[-1] will be assigned a bin value of -1. If set to ``False``, values less than cutoffs[0] will be included in the first bin while values greater than cutoffs[-1] will be included in the final bin. |
>>> hist = frame.group_by('a', {'g': tc.agg.histogram([1, 5, 8, 9])})
>>> hist.inspect()
[#] a g_HISTOGRAM
=========================
[0] 2 [0.25, 0.75, 0.0]
[1] 1 [0.0, 0.0, 1.0]
>>> hist = frame.group_by('a', {'g': tc.agg.histogram([1, 5, 8, 9], False)})
>>> hist.inspect()
[#] a g_HISTOGRAM
=============================================
[0] 2 [0.5, 0.5, 0.0]
[1] 1 [0.0, 0.333333333333, 0.666666666667]
def histogram(
self, column_name, num_bins=None, weight_column_name=None, bin_type='equalwidth')
Compute the histogram for a column in a frame.
The returned value is a Histogram object containing 3 lists one each for: the cutoff points of the bins, size of each bin, and density of each bin.
column_name | (str): | Name of column to be evaluated. |
num_bins | (Optional[int]): | Number of bins in histogram. Default is Square-root choice will be used (in other words math.floor(math.sqrt(frame.count())). |
weight_column_name | (Optional[str]): | Name of column containing weights. Default is all observations are weighted equally. |
bin_type | (str["equalwidth"|"equaldepth"]): | The type of binning algorithm to use: ["equalwidth"|"equaldepth"] Defaults is "equalwidth". |
Returns | (Histogram): | A Histogram object containing the result set.
The data returned is composed of multiple components: cutoffs : array of float A list containing the edges of each bin. hist : array of float A list containing count of the weighted observations found in each bin. density : array of float A list containing a decimal containing the percentage of observations found in the total set per bin. |
The num_bins parameter is considered to be the maximum permissible number of bins because the data may dictate fewer bins. With equal depth binning, for example, if the column to be binned has 10 elements with only 2 distinct values and the num_bins parameter is greater than 2, then the number of actual number of bins will only be 2. This is due to a restriction that elements with an identical value must belong to the same bin.
Consider the following sample data set:
>>> frame.inspect()
[#] a b
=========
[0] a 2
[1] b 7
[2] c 3
[3] d 9
[4] e 1
A simple call for 3 equal-width bins gives:
>>> hist = frame.histogram("b", num_bins=3)
>>> hist.cutoffs
[1.0, 3.6666666666666665, 6.333333333333333, 9.0]
>>> hist.hist
[3.0, 0.0, 2.0]
>>> hist.density
[0.6, 0.0, 0.4]
Switching to equal depth gives:
>>> hist = frame.histogram("b", num_bins=3, bin_type='equaldepth')
>>> hist.cutoffs
[1.0, 2.0, 7.0, 9.0]
>>> hist.hist
[1.0, 2.0, 2.0]
>>> hist.density
[0.2, 0.4, 0.4]
Plot hist as a bar chart using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.bar(hist,cutoffs[:1], hist.hist, width=hist.cutoffs[1] - hist.cutoffs[0])
Plot hist as a bar chart using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.bar(hist.cutoffs[:1], hist.hist, width=hist.cutoffs[1] -
... hist["cutoffs"][0])
def inspect(
self, n=10, offset=0, columns=None, wrap='inspect_settings', truncate='inspect_settings', round='inspect_settings', width='inspect_settings', margin='inspect_settings', with_types='inspect_settings')
Pretty-print of the frame data
Essentially returns a string, but technically returns a RowInspection object which renders a string.
The RowInspection object naturally converts to a str when needed, like when printed or when displayed
by python REPL (i.e. using the object's repr). If running in a script and want the inspect output
to be printed, then it must be explicitly printed, then print frame.inspect()
n | (Optional[int]): | The number of rows to print |
offset | (Optional[int]): | The number of rows to skip before printing. |
columns | (Optional[List[str]]): | Filter columns to be included. By default, all columns are included. |
wrap | (Optional[int or 'stripes']): | If set to 'stripes' then inspect prints rows in stripes; if set to an integer N, rows will be printed in clumps of N columns, where the columns are wrapped. |
truncate | (Optional[int]): | If set to integer N, all strings will be truncated to length N, including all tagged ellipses. |
round | (Optional[int]): | If set to integer N, all floating point numbers will be rounded and truncated to N digits. |
width | (Optional[int]): | If set to integer N, the print out will try to honor a max line width of N. |
margin | (Optional[int]): | Applies to 'stripes' mode only. If set to integer N, the margin for printing names in a stripe will be limited to N characters. |
with_types | (Optinoal[bool]): | If set to True, header will include the data_type of each column. |
Returns | (RowsInspection): | An object which naturally converts to a pretty-print string. |
To look at the first 4 rows of data in a frame:
>>> frame.inspect(4)
[#] animal name age weight
==================================
[0] human George 8 542.5
[1] human Ursula 6 495.0
[2] ape Ape 41 400.0
[3] elephant Shep 5 8630.0
For other examples, see :ref:example_frame.inspect
.
Note: if the frame data contains unicode characters, this method may raise a Unicode exception when running in an interactive REPL or otherwise which triggers the standard python repr(). To get around this problem, explicitly print the unicode of the returned object:
>>> print unicode(frame.inspect())
Global Settings
If not specified, the arguments that control formatting receive default values from 'sparktk.inspect_settings'. Make changes there to affect all calls to inspect.
>>> import sparktk
>>> sparktk.inspect_settings
wrap 20
truncate None
round None
width 80
margin None
with_types False
>>> sparktk.inspect_settings.width = 120 # changes inspect to use 120 width globally
>>> sparktk.inspect_settings.truncate = 16 # changes inspect to always truncate strings to 16 chars
>>> sparktk.inspect_settings
wrap 20
truncate 16
round None
width 120
margin None
with_types False
>>> sparktk.inspect_settings.width = None # return value back to default
>>> sparktk.inspect_settings
wrap 20
truncate 16
round None
width 80
margin None
with_types False
>>> sparktk.inspect_settings.reset() # set everything back to default
>>> sparktk.inspect_settings
wrap 20
truncate None
round None
width 80
margin None
with_types False
def join_inner(
self, right, left_on, right_on=None, use_broadcast=None)
join_inner performs inner join operation on one or two frames, creating a new frame.
right | (Frame): | Another frame to join with |
left_on | (List[str]): | Names of the columns in the left frame used to match up the two frames. |
right_on | (Optional[List[str]]): | Names of the columns in the right frame used to match up the two frames. Default is the same as the left frame. |
use_broadcast | (Optional[str]): | If one of your tables is small enough to fit in the memory of a single machine, you can use a broadcast join. Specify that table to broadcast (left or right) to possibly improve performance. Default is None. |
:returns: (Frame) A new frame with the results of the join
Create a new frame from a SQL JOIN operation with another frame. The frame on the 'left' is the currently active frame. The frame on the 'right' is another frame. This method take column(s) in the left frame and matches its values with column(s) in the right frame. 'inner' join will only allow data in the resultant frame if both the left and right frames have the same value in the matching column(s).
When a column is named the same in both frames, it will result in two columns in the new frame. The column from the left frame (originally the current frame) will be copied and the column name will have the string "_L" added to it. The same thing will happen with the column from the right frame, except its name has the string "_R" appended. The order of columns after this method is called is not guaranteed.
It is recommended that you rename the columns to meaningful terms prior
to using the join
method.
Consider two frames: codes and colors
codes.inspect() [#] numbers ============ [0] 1 [1] 3 [2] 1 [3] 0 [4] 2 [5] 1 [6] 5 [7] 3
colors.inspect() [#] numbers color ==================== [0] 1 red [1] 2 yellow [2] 3 green [3] 4 blue
Inner join using hash joins.
j = codes.join_inner(colors, 'numbers') [===Job Progress===]
j.inspect() [#] numbers color ==================== [0] 1 red [1] 1 red [2] 1 red [3] 2 yellow [4] 3 green [5] 3 green
(The join adds an extra column *_R which is the join column from the right frame; it may be disregarded)
Consider two frames: country_codes_frame and country_names_frame
country_codes_frame.inspect() [#] country_code area_code test_str ====================================== [0] 1 354 a [1] 2 91 a [2] 2 100 b [3] 3 47 a [4] 4 968 c [5] 5 50 c
country_names_frame.inspect() [#] country_code country_name test_str ========================================= [0] 1 Iceland a [1] 1 Ice-land a [2] 2 India b [3] 3 Norway a [4] 4 Oman c [5] 6 Germany c
Join them on the 'country_code' and 'test_str' columns ('inner' join by default)
composite_join = country_codes_frame.join_inner(country_names_frame, ['country_code', 'test_str']) [===Job Progress===]
composite_join.inspect() [#] country_code area_code test_str country_name ==================================================== [0] 1 354 a Iceland [1] 1 354 a Ice-land [2] 2 100 b India [3] 3 47 a Norway [4] 4 968 c Oman
Inner join broadcasting the left table
j = codes.join_inner(colors, 'numbers',use_broadcast="left") [===Job Progress===]
j.inspect() [#] numbers color ==================== [0] 1 red [1] 1 red [2] 1 red [3] 2 yellow [4] 3 green [5] 3 green
composite_join_left = country_codes_frame.join_inner(country_names_frame, ['country_code', 'test_str'],use_broadcast="left") [===Job Progress===]
composite_join_left.inspect() [#] country_code area_code test_str country_name ==================================================== [0] 1 354 a Iceland [1] 1 354 a Ice-land [2] 2 100 b India [3] 3 47 a Norway [4] 4 968 c Oman
Inner join broadcasting right table
j = codes.join_inner(colors, 'numbers',use_broadcast="right") [===Job Progress===]
j.inspect() [#] numbers color ==================== [0] 1 red [1] 3 green [2] 1 red [3] 2 yellow [4] 1 red [5] 3 green
composite_join_right = country_codes_frame.join_inner(country_names_frame, ['country_code', 'test_str'],use_broadcast="right") [===Job Progress===]
composite_join_right.inspect() [#] country_code area_code test_str country_name ==================================================== [0] 1 354 a Iceland [1] 1 354 a Ice-land [2] 2 100 b India [3] 3 47 a Norway [4] 4 968 c Oman
def join_left(
self, right, left_on, right_on=None, use_broadcast_right=False)
join_left performs left join(Left outer) operation on one or two frames, creating a new frame.
right | (Frame): | Another frame to join with |
left_on | (List[str]): | Names of the columns in the left frame used to match up the two frames. |
right_on | (Optional[List[str]]): | Names of the columns in the right frame used to match up the two frames. Default is the same as the left frame. |
use_broadcast_right | (bool): | If right table is small enough to fit in the memory of a single machine, you can set use_broadcast_right to True to possibly improve performance using broadcast join. Default is False. |
:returns: (Frame) A new frame with the results of the join
Create a new frame from a SQL JOIN operation with another frame. The frame on the 'left' is the currently active frame. The frame on the 'right' is another frame. This method take column(s) in the left frame and matches its values with column(s) in the right frame. 'left' join will allow any data in the resultant frame if it exists in the left frame, but will allow any data from the right frame if it has a value in its column(s) which matches the value in the left frame column(s).
When a column is named the same in both frames, it will result in two columns in the new frame. The column from the left frame (originally the current frame) will be copied and the column name will have the string "_L" added to it. The same thing will happen with the column from the right frame, except its name has the string "_R" appended. The order of columns after this method is called is not guaranteed.
It is recommended that you rename the columns to meaningful terms prior
to using the join
method.
Consider two frames: codes and colors
codes.inspect() [#] numbers ============ [0] 1 [1] 3 [2] 1 [3] 0 [4] 2 [5] 1 [6] 5 [7] 3
colors.inspect() [#] numbers color ==================== [0] 1 red [1] 2 yellow [2] 3 green [3] 4 blue
j_left = codes.join_left(colors, 'numbers') [===Job Progress===]
j_left.inspect() [#] numbers_L color ====================== [0] 0 None [1] 1 red [2] 1 red [3] 1 red [4] 2 yellow [5] 3 green [6] 3 green [7] 5 None
(The join adds an extra column *_R which is the join column from the right frame; it may be disregarded)
Consider two frames: country_codes_frame and country_names_frame
country_codes_frame.inspect() [#] country_code area_code test_str ====================================== [0] 1 354 a [1] 2 91 a [2] 2 100 b [3] 3 47 a [4] 4 968 c [5] 5 50 c
country_names_frame.inspect() [#] country_code country_name test_str ========================================= [0] 1 Iceland a [1] 1 Ice-land a [2] 2 India b [3] 3 Norway a [4] 4 Oman c [5] 6 Germany c
Join them on the 'country_code' and 'test_str' columns ('inner' join by default)
composite_join_left = country_codes_frame.join_left(country_names_frame, ['country_code', 'test_str']) [===Job Progress===]
composite_join_left.inspect() [#] country_code_L area_code test_str_L country_name ======================================================== [0] 1 354 a Iceland [1] 1 354 a Ice-land [2] 2 91 a None [3] 2 100 b India [4] 3 47 a Norway [5] 4 968 c Oman [6] 5 50 c None
Left join broadcasting right table
j_left = codes.join_left(colors, 'numbers', use_broadcast_right=True) [===Job Progress===]
j_left.inspect() [#] numbers_L color ====================== [0] 1 red [1] 3 green [2] 1 red [3] 0 None [4] 2 yellow [5] 1 red [6] 5 None [7] 3 green
composite_join_left = country_codes_frame.join_left(country_names_frame, ['country_code', 'test_str'], use_broadcast_right=True) [===Job Progress===]
composite_join_left.inspect() [#] country_code_L area_code test_str_L country_name ======================================================== [0] 1 354 a Iceland [1] 1 354 a Ice-land [2] 2 91 a None [3] 2 100 b India [4] 3 47 a Norway [5] 4 968 c Oman [6] 5 50 c None
def join_outer(
self, right, left_on, right_on=None)
join_outer performs outer join operation on one or two frames, creating a new frame.
right | (Frame): | Another frame to join with |
left_on | (List[str]): | Names of the columns in the left frame used to match up the two frames. |
right_on | (Optional[List[str]]): | Names of the columns in the right frame used to match up the two frames. Default is the same as the left frame. |
:returns: (Frame) A new frame with the results of the join
Create a new frame from a SQL JOIN operation with another frame. The frame on the 'left' is the currently active frame. The frame on the 'right' is another frame. This method take column(s) in the left frame and matches its values with column(s) in the right frame. The 'outer' join provides a frame with data from both frames where the left and right frames did not have the same value in the matching column(s).
When a column is named the same in both frames, it will result in two columns in the new frame. The column from the left frame (originally the current frame) will be copied and the column name will have the string "_L" added to it. The same thing will happen with the column from the right frame, except its name has the string "_R" appended. The order of columns after this method is called is not guaranteed.
It is recommended that you rename the columns to meaningful terms prior
to using the join
method.
Consider two frames: codes and colors
codes.inspect() [#] numbers ============ [0] 1 [1] 3 [2] 1 [3] 0 [4] 2 [5] 1 [6] 5 [7] 3
colors.inspect() [#] numbers color ==================== [0] 1 red [1] 2 yellow [2] 3 green [3] 4 blue
Join them on the 'numbers' column ('inner' join by default)
j_outer = codes.join_outer(colors, 'numbers') [===Job Progress===]
j_outer.inspect() [#] numbers_L color ====================== [0] 0 None [1] 1 red [2] 1 red [3] 1 red [4] 2 yellow [5] 3 green [6] 3 green [7] 4 blue [8] 5 None
(The join adds an extra column *_R which is the join column from the right frame; it may be disregarded)
Consider two frames: country_codes_frame and country_names_frame
country_codes_frame.inspect() [#] country_code area_code test_str ====================================== [0] 1 354 a [1] 2 91 a [2] 2 100 b [3] 3 47 a [4] 4 968 c [5] 5 50 c
country_names_frame.inspect() [#] country_code country_name test_str ========================================= [0] 1 Iceland a [1] 1 Ice-land a [2] 2 India b [3] 3 Norway a [4] 4 Oman c [5] 6 Germany c
Join them on the 'country_code' and 'test_str' columns ('inner' join by default)
composite_join_outer = country_codes_frame.join_outer(country_names_frame, ['country_code', 'test_str']) [===Job Progress===]
composite_join_outer.inspect() [#] country_code_L area_code test_str_L country_name ======================================================== [0] 6 None c Germany [1] 1 354 a Iceland [2] 1 354 a Ice-land [3] 2 91 a None [4] 2 100 b India [5] 3 47 a Norway [6] 4 968 c Oman [7] 5 50 c None
def join_right(
self, right, left_on, right_on=None, use_broadcast_left=False)
join_right performs right join(right outer) operation on one or two frames, creating a new frame.
right | (Frame): | Another frame to join with |
left_on | (List[str]): | Names of the columns in the left frame used to match up the two frames. |
right_on | (Optional[List[str]])Names of the columns in the right frame used to match up the two frames. Default is the same as the left frame.
|
:returns: (Frame) A new frame with the results of the join
Create a new frame from a SQL JOIN operation with another frame. The frame on the 'left' is the currently active frame. The frame on the 'right' is another frame. This method take column(s) in the left frame and matches its values with column(s) in the right frame. 'right' join works similarly to join_left, except it keeps all the data from the right frame and only the data from the left frame when it matches.
When a column is named the same in both frames, it will result in two columns in the new frame. The column from the left frame (originally the current frame) will be copied and the column name will have the string "_L" added to it. The same thing will happen with the column from the right frame, except its name has the string "_R" appended. The order of columns after this method is called is not guaranteed.
It is recommended that you rename the columns to meaningful terms prior
to using the join
method.
Consider two frames: codes and colors
codes.inspect() [#] numbers ============ [0] 1 [1] 3 [2] 1 [3] 0 [4] 2 [5] 1 [6] 5 [7] 3
colors.inspect() [#] numbers color ==================== [0] 1 red [1] 2 yellow [2] 3 green [3] 4 blue
j_right = codes.join_right(colors, 'numbers') [===Job Progress===]
j_right.inspect() [#] numbers_R color ====================== [0] 1 red [1] 1 red [2] 1 red [3] 2 yellow [4] 3 green [5] 3 green [6] 4 blue
(The join adds an extra column *_R which is the join column from the right frame; it may be disregarded)
Consider two frames: country_codes_frame and country_names_frame
country_codes_frame.inspect() [#] country_code area_code test_str ====================================== [0] 1 354 a [1] 2 91 a [2] 2 100 b [3] 3 47 a [4] 4 968 c [5] 5 50 c
country_names_frame.inspect() [#] country_code country_name test_str ========================================= [0] 1 Iceland a [1] 1 Ice-land a [2] 2 India b [3] 3 Norway a [4] 4 Oman c [5] 6 Germany c
Join them on the 'country_code' and 'test_str' columns ('inner' join by default)
composite_join_right = country_codes_frame.join_right(country_names_frame, ['country_code', 'test_str']) [===Job Progress===]
composite_join_right.inspect() [#] area_code country_code_R country_name test_str_R ======================================================== [0] None 6 Germany c [1] 354 1 Iceland a [2] 354 1 Ice-land a [3] 100 2 India b [4] 47 3 Norway a [5] 968 4 Oman c
Right join broadcasting left table
j_right = codes.join_right(colors, 'numbers', use_broadcast_left=True) [===Job Progress===]
j_right.inspect() [#] numbers_R color ====================== [0] 1 red [1] 1 red [2] 1 red [3] 2 yellow [4] 3 green [5] 3 green [6] 4 blue
composite_join_right = country_codes_frame.join_right(country_names_frame, ['country_code', 'test_str'], use_broadcast_left=True) [===Job Progress===]
composite_join_right.inspect() [#] area_code country_code_R country_name test_str_R ======================================================== [0] 354 1 Iceland a [1] 354 1 Ice-land a [2] 100 2 India b [3] 47 3 Norway a [4] 968 4 Oman c [5] None 6 Germany c
def map_columns(
self, func, schema)
Create a new frame from the output of a UDF which over each row of the current frame.
- The row |UDF| ('func') must return a value in the same format as specified by the schema.
func | (UDF): | Function which takes the values in the row and produces a value, or collection of values, for the new cell(s). |
schema | (List[(str,type)]): | Schema for the column(s) being added. |
Given our frame, let's create a new frame with the name and a column with how many years the person has been over 18
>>> frame = tc.frame.create([['Fred',39,16,'555-1234'],
... ['Susan',33,3,'555-0202'],
... ['Thurston',65,26,'555-4510'],
... ['Judy',44,14,'555-2183']],
... schema=[('name', str), ('age', int), ('tenure', int), ('phone', str)])
>>> frame.inspect()
[#] name age tenure phone
====================================
[0] Fred 39 16 555-1234
[1] Susan 33 3 555-0202
[2] Thurston 65 26 555-4510
[3] Judy 44 14 555-2183
>>> adult = frame.map_columns(lambda row: [row.name, row.age - 18], [('name', str), ('adult_years', int)])
>>> adult.inspect()
[#] name adult_years
==========================
[0] Fred 21
[1] Susan 15
[2] Thurston 47
[3] Judy 26
Note that the function returns a list, and therefore the schema also needs to be a list.
It is not necessary to use lambda syntax, any function will do, as long as it takes a single row argument. We can also call other local functions within.
(see also the 'add_columns' frame operation)
def matrix_covariance_matrix(
self, matrix_column_name)
Compute the Covariance Matrix of matrices stored in a frame
matrix_column_name: | Name of the column to compute the covariance matrix on |
Returns | (Frame): | returns the frame with a new column storing the covariance matrix for the corresponding matrix |
Calculate the covariance matrix for each matrix in column 'matrix_column_name' of a frame using the following:
Element (i,j) of the covariance matrix for a given matrix X is computed as: ((Xi - Mi)(Xj - Mj)) where Mi is the mean
>>> from sparktk import dtypes
>>> data = [[1, [[1,2,3,5],[2,3,5,6],[4,6,7,3],[8,9,2,4]]]]
>>> schema = [('id', int),('pixeldata', dtypes.matrix)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
Compute the covariance matrix for the matrices in 'pixeldata' column of the frame
>>> my_frame.matrix_covariance_matrix('pixeldata')
A new column gets added to the existing frame storing the covariance matrix
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
<BLANKLINE>
[#] CovarianceMatrix_pixeldata
============================================================
[0] [[ 2.91666667 3. -1. -3.75 ]
[ 3. 3.33333333 -0.33333333 -5. ]
[ -1. -0.33333333 3.33333333 -1. ]
[ -3.75 -5. -1. 10.91666667]]
def matrix_pca(
self, matrix_column_name, v_matrix_column_name)
Compute the Principal Component Analysis of a matrix
matrix_column_name: | Name of the column storing the matrices whose principal components are to be computed |
v_matrix_column_name: | Name of the column storing the V matrix |
Returns | (Frame): | returns the frame with new column storing the principal components for the corresponding matrix |
Calculate the Principal Components for each matrix in column 'matrix_column_name' using the V matrix
>>> from sparktk import dtypes
>>> data = [[1, [[1,2,3,5],[2,3,5,6],[4,6,7,3],[8,9,2,4]]]]
>>> schema = [('id', int),('pixeldata', dtypes.matrix)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
Compute the singular value decomposition for the matrices in 'pixeldata' column of the frame
>>> my_frame.matrix_svd('pixeldata')
Three new columns get added storing the U matrix, V matrix and Singular Vectors
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
<BLANKLINE>
[#] U_pixeldata
========================================================
[0] [[-0.29128979 -0.43716238 -0.44530839 0.72507913]
[-0.42474933 -0.55066945 -0.26749936 -0.66692972]
[-0.55099141 -0.16785045 0.79986267 0.16868433]
[-0.65661765 0.69099814 -0.30060644 -0.0317899 ]]
<BLANKLINE>
[#] V_pixeldata
========================================================
[0] [[-0.47195872 0.50289367 -0.05244699 -0.72222035]
[-0.60780067 0.40702574 0.11313693 0.67239008]
[-0.44835972 -0.58469285 0.65644993 -0.16180641]
[-0.45476024 -0.48945099 -0.74399115 0.01039344]]
<BLANKLINE>
[#] SingularVectors_pixeldata
============================================================
[0] [[ 18.21704938 6.59797925 3.54086993 0.26080987]]
Compute the principal components using the V matrices computed for matrices in 'pixeldata'
>>> my_frame.matrix_pca('pixeldata', 'V_pixeldata')
A new column gets added storing the Principal components matrix
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
<BLANKLINE>
[#] U_pixeldata
========================================================
[0] [[-0.29128979 -0.43716238 -0.44530839 0.72507913]
[-0.42474933 -0.55066945 -0.26749936 -0.66692972]
[-0.55099141 -0.16785045 0.79986267 0.16868433]
[-0.65661765 0.69099814 -0.30060644 -0.0317899 ]]
<BLANKLINE>
[#] V_pixeldata
========================================================
[0] [[-0.47195872 0.50289367 -0.05244699 -0.72222035]
[-0.60780067 0.40702574 0.11313693 0.67239008]
[-0.44835972 -0.58469285 0.65644993 -0.16180641]
[-0.45476024 -0.48945099 -0.74399115 0.01039344]]
<BLANKLINE>
[#] SingularVectors_pixeldata
============================================================
[0] [[ 18.21704938 6.59797925 3.54086993 0.26080987]]
<BLANKLINE>
[#] PrincipalComponents_pixeldata
========================================================
[0] [[-0.47195872 1.00578734 -0.15734098 -3.61110176]
[-1.21560134 1.22107722 0.56568466 4.0343405 ]
[-1.79343888 -3.50815713 4.59514953 -0.48541923]
[-3.63808191 -4.40505888 -1.4879823 0.04157377]]
def matrix_svd(
self, matrix_column_name)
Compute the Singular Value Decomposition of a matrix
matrix_column_name: | Name of the column to compute the svd on |
Returns | (Frame): | returns the frame with three new columns storing the U matrix, V matrix and Singular Vectors |
Calculate the Singular Value Decomposition for each matrix in column 'matrix_column_name'
>>> from sparktk import dtypes
>>> data = [[1, [[1,2,3,5],[2,3,5,6],[4,6,7,3],[8,9,2,4]]]]
>>> schema = [('id', int),('pixeldata', dtypes.matrix)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
Compute the singular value decomposition for the matrices in 'pixeldata' column of the frame
>>> my_frame.matrix_svd('pixeldata')
Three new columns get added storing the U matrix, V matrix and Singular Vectors
>>> my_frame.inspect()
[#] id pixeldata
============================
[0] 1 [[ 1. 2. 3. 5.]
[ 2. 3. 5. 6.]
[ 4. 6. 7. 3.]
[ 8. 9. 2. 4.]]
<BLANKLINE>
[#] U_pixeldata
========================================================
[0] [[-0.29128979 -0.43716238 -0.44530839 0.72507913]
[-0.42474933 -0.55066945 -0.26749936 -0.66692972]
[-0.55099141 -0.16785045 0.79986267 0.16868433]
[-0.65661765 0.69099814 -0.30060644 -0.0317899 ]]
<BLANKLINE>
[#] V_pixeldata
========================================================
[0] [[-0.47195872 0.50289367 -0.05244699 -0.72222035]
[-0.60780067 0.40702574 0.11313693 0.67239008]
[-0.44835972 -0.58469285 0.65644993 -0.16180641]
[-0.45476024 -0.48945099 -0.74399115 0.01039344]]
<BLANKLINE>
[#] SingularVectors_pixeldata
============================================================
[0] [[ 18.21704938 6.59797925 3.54086993 0.26080987]]
def multiclass_classification_metrics(
self, label_column, pred_column, beta=1.0, frequency_column=None)
Statistics of accuracy, precision, and others for a multi-class classification model.
Parameters:
label_column | (str): | The name of the column containing the correct label for each instance. |
pred_column | (str): | The name of the column containing the predicted label for each instance. |
beta | (Optional[int]): | This is the beta value to use for :math:F_{ eta} measure (default F1 measure is computed);
must be greater than zero. Defaults is 1.
|
frequency_column | (Optional[str]): | The name of an optional column containing the frequency of observations. |
Returns | (ClassificationMetricsValue): | The data returned is composed of multiple components: <object>.accuracy : double <object>.confusion_matrix : table <object>.f_measure : double <object>.precision : double <object>.recall : double |
Calculate the accuracy, precision, confusion_matrix, recall and :math:F_{ eta}
measure for a
classification model.
-
The f_measure result is the :math:
F_{ eta}
measure for a classification model. The :math:F_{ eta}
measure of a binary classification model is the harmonic mean of precision and recall. If we let:- beta :math:
\equiv eta
, - :math:
T_{P}
denotes the number of true positives, - :math:
F_{P}
denotes the number of false positives, and - :math:
F_{N}
denotes the number of false negatives
then:
.. math::
F_{ eta} = (1 + eta ^ 2) * rac{ rac{T_{P}}{T_{P} + F_{P}} * rac{T_{P}}{T_{P} + F_{N}}}{ eta ^ 2 * rac{T_{P}}{T_{P} + F_{P}} + rac{T_{P}}{T_{P} + F_{N}}}
The :math:
F_{ eta}
measure for a multi-class classification model is computed as the weighted average of the :math:F_{ eta}
measure for each label, where the weight is the number of instances of each label. The determination of binary vs. multi-class is automatically inferred from the data. - beta :math:
-
For multi-class classification models, the recall measure is computed as the weighted average of the recall for each label, where the weight is the number of instances of each label. The determination of binary vs. multi-class is automatically inferred from the data.
-
For multi-class classification models, the precision measure is computed as the weighted average of the precision for each label, where the weight is the number of instances of each label. The determination of binary vs. multi-class is automatically inferred from the data.
-
The accuracy of a classification model is the proportion of predictions that are correctly identified. If we let :math:
T_{P}
denote the number of true positives, :math:T_{N}
denote the number of true negatives, and :math:K
denote the total number of classified instances, then the model accuracy is given by: :math:rac{T_{P} + T_{N}}{K}
. -
The confusion_matrix result is a confusion matrix for a classifier model, formatted for human readability.
Consider Frame my_frame, which contains the data
>>> my_frame.inspect()
[#] a b labels predictions
==================================
[0] red 1 0 0
[1] blue 3 1 0
[2] green 1 0 0
[3] green 0 1 1
[4] red 0 5 4
>>> cm = my_frame.multiclass_classification_metrics('labels', 'predictions')
[===Job Progress===]
>>> cm.f_measure
0.5866666666666667
>>> cm.recall
0.6
>>> cm.accuracy
0.6
>>> cm.precision
0.6666666666666666
>>> cm.confusion_matrix
Predicted_0 Predicted_1 Predicted_4
Actual_0 2 0 0
Actual_1 1 1 0
Actual_5 0 0 1
def power_iteration_clustering(
self, source_column, destination_column, similarity_column, k=2, max_iterations=100, initialization_mode='random')
Power Iteration Clustering finds a low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
source_column | (str): | Name of the column containing the source node |
destination_column | (str): | Name of the column containing the destination node |
similarity_column | (str): | Name of the column containing the similarity |
k | (Optional(int)): | Number of clusters to cluster the graph into. Default is 2 |
max_iterations | (Optional(int)): | Maximum number of iterations of the power iteration loop. Default is 100 |
initialization_mode | (Optional(str)): | Initialization mode of power iteration clustering. This can be either "random" to use a random vector as vertex properties, or "degree" to use normalized sum similarities. Default is "random". |
Returns | (namedtuple): | Returns namedtuple containing the results frame(node and cluster), k (number of clusters), and cluster_sizes(a map of clusters and respective size) |
>>> frame = tc.frame.create([[1,2,1.0],
... [1,3,0.3],
... [2,3,0.3],
... [3,0,0.03],
... [0,5,0.01],
... [5,4,0.3],
... [5,6,1.0],
... [4,6,0.3]],
... [('Source', int), ('Destination', int), ('Similarity',float)])
>>> frame.inspect()
[#] Source Destination Similarity
====================================
[0] 1 2 1.0
[1] 1 3 0.3
[2] 2 3 0.3
[3] 3 0 0.03
[4] 0 5 0.01
[5] 5 4 0.3
[6] 5 6 1.0
[7] 4 6 0.3
>>> x = frame.power_iteration_clustering('Source', 'Destination', 'Similarity', k=3)
>>> x.frame.inspect()
[#] id cluster
================
[0] 4 2
[1] 0 3
[2] 6 2
[3] 2 1
[4] 1 1
[5] 3 1
[6] 5 2
>>> x.k
3
>>> x.cluster_sizes
{u'2': 3, u'3': 1, u'1': 3}
def quantile_bin_column(
self, column_name, num_bins=None, bin_column_name=None)
Classify column into groups with the same frequency.
Group rows of data based on the value in a single column and add a label to identify grouping.
Equal depth binning attempts to label rows such that each bin contains the
same number of elements.
For :math:n
bins of a column :math:C
of length :math:m
, the bin
number is determined by:
.. math::
\lceil n * rac { f(C) }{ m }
ceil
where :math:f
is a tie-adjusted ranking function over values of
:math:C
.
If there are multiples of the same value in :math:C
, then their
tie-adjusted rank is the average of their ordered rank values.
- The num_bins parameter is considered to be the maximum permissible number
of bins because the data may dictate fewer bins.
For example, if the column to be binned has a quantity of :math"
X
elements with only 2 distinct values and the num_bins parameter is greater than 2, then the actual number of bins will only be 2. This is due to a restriction that elements with an identical value must belong to the same bin.
column_name | (str): | The column whose values are to be binned. |
num_bins | (Optional[int]): | The maximum number of quantiles. Default is the Square-root choice :math:`\lfloor \sqrt{m} floor`, where :math:`m` is the number of rows. |
bin_column_name | (Optional[str]): | The name for the new column holding the grouping labels.
Default is |
Returns | (List[float]): | A list containing the edges of each bin |
Given a frame with column a accessed by a Frame object my_frame:
>>> my_frame.inspect( n=11 )
[##] a
========
[0] 1
[1] 1
[2] 2
[3] 3
[4] 5
[5] 8
[6] 13
[7] 21
[8] 34
[9] 55
[10] 89
Modify the frame, adding a column showing what bin the data is in. The data should be grouped into a maximum of five bins. Note that each bin will have the same quantity of members (as much as possible):
>>> cutoffs = my_frame.quantile_bin_column('a', 5, 'aEDBinned')
[===Job Progress===]
>>> my_frame.inspect( n=11 )
[##] a aEDBinned
===================
[0] 1 0
[1] 1 0
[2] 2 1
[3] 3 1
[4] 5 2
[5] 8 2
[6] 13 3
[7] 21 3
[8] 34 4
[9] 55 4
[10] 89 4
>>> print cutoffs
[1.0, 2.0, 5.0, 13.0, 34.0, 89.0]
def quantiles(
self, column_name, quantiles)
Returns a new frame with Quantiles and their values.
column_name | (str): | The column to calculate quantiles on |
quantiles | (List[float]): | The quantiles being requested |
Returns | (Frame): | A new frame with two columns (float64): requested Quantiles and their respective values. |
Calculates quantiles on the given column.
Consider Frame my_frame, which accesses a frame that contains a single column final_sale_price:
>>> my_frame.inspect()
[#] final_sale_price
=====================
[0] 100
[1] 250
[2] 95
[3] 179
[4] 315
[5] 660
[6] 540
[7] 420
[8] 250
[9] 335
To calculate 10th, 50th, and 100th quantile:
>>> quantiles_frame = my_frame.quantiles('final_sale_price', [10, 50, 100])
[===Job Progress===]
A new Frame containing the requested Quantiles and their respective values will be returned:
quantiles_frame.inspect() [#] Quantiles final_sale_price_QuantileValue ============================================== [0] 10.0 95.0 [1] 50.0 250.0 [2] 100.0 660.0
def rename_columns(
self, names)
Rename columns
names | (dict): | Dictionary of old names to new names. |
Start with a frame with columns Black and White.
>>> print my_frame.schema
[('Black', <type 'unicode'>), ('White', <type 'unicode'>)]
Rename the columns to Mercury and Venus:
>>> my_frame.rename_columns({"Black": "Mercury", "White": "Venus"})
>>> print my_frame.schema
[(u'Mercury', <type 'unicode'>), (u'Venus', <type 'unicode'>)]
def reverse_box_cox(
self, column_name, lambda_value=0.0, reverse_box_cox_column_name=None)
Calculate the reverse box-cox transformation for each row on a given column_name of the current frame
column_name: | Name of the column to perform the reverse transformation on |
lambda_value: | Lambda power parameter. Default is 0.0 |
reverse_box_cox_column_name: | Optional column name for the reverse box cox value |
Returns | (Frame): | returns a frame with a new column storing the reverse box-cox transformed value |
Calculate the reverse box-cox transformation for each row in column 'column_name' of a frame using the lambda_value.
Reverse Box-cox transformation is computed by the following formula:
reverse_box_cox = exp(boxcox); if lambda=0, reverse_box_cox = (lambda * boxcox + 1)^(1/lambda) ; else
>>> data = [[7.7132064326674596, 2.81913279907],[0.207519493594015, -1.25365381375],[6.336482349262754, 2.46673638752], [7.4880388253861181, 2.76469126003],[4.9850701230259045, 2.06401101556]]
>>> schema = [("input", float), ("input_lambda_0.3", float)]
>>> my_frame = tc.frame.create(data, schema)
>>> my_frame.inspect()
[#] input input_lambda_0.3
=====================================
[0] 7.71320643267 2.81913279907
[1] 0.207519493594 -1.25365381375
[2] 6.33648234926 2.46673638752
[3] 7.48803882539 2.76469126003
[4] 4.98507012303 2.06401101556
Compute the reverse box-cox transformation on the 'input_lambda_0.3' column which stores the box-cox transformed
value on column 'input' with lambda 0.3
>>> my_frame.reverse_box_cox('input_lambda_0.3',0.3)
A new column gets added to the frame which stores the reverse box-cox transformation for each row.
This value is equal to the original vales in 'input' column
>>> my_frame.inspect()
[#] input input_lambda_0.3 input_lambda_0.3_reverse_lambda_0.3
==========================================================================
[0] 7.71320643267 2.81913279907 7.71320643267
[1] 0.207519493594 -1.25365381375 0.207519493594
[2] 6.33648234926 2.46673638752 6.33648234926
[3] 7.48803882539 2.76469126003 7.4880388254
[4] 4.98507012303 2.06401101556 4.98507012301
def save(
self, path)
Persists the frame to the given file path
def sort(
self, columns, ascending=True)
Sort by one or more columns.
columns | (str or List[str]): | Either a column name, list of column names, or list of tuples where each tuple is a name and an ascending bool value. |
ascending | (Optional[bool]): | True for ascending (default), or False for descending. |
Sort a frame by column values either ascending or descending.
Consider the frame:
>>> frame.inspect()
[#] col1 col2
==================
[0] 3 foxtrot
[1] 1 charlie
[2] 3 bravo
[3] 2 echo
[4] 4 delta
[5] 3 alpha
Sort a single column:
>>> frame.sort('col1')
[===Job Progress===]
>>> frame.inspect()
[#] col1 col2
==================
[0] 1 charlie
[1] 2 echo
[2] 3 foxtrot
[3] 3 bravo
[4] 3 alpha
[5] 4 delta
Sort a single column descending:
>>> frame.sort('col2', False)
[===Job Progress===]
>>> frame.inspect()
[#] col1 col2
==================
[0] 3 foxtrot
[1] 2 echo
[2] 4 delta
[3] 1 charlie
[4] 3 bravo
[5] 3 alpha
Sort multiple columns:
>>> frame.sort(['col1', 'col2'])
[===Job Progress===]
>>> frame.inspect()
[#] col1 col2
==================
[0] 1 charlie
[1] 2 echo
[2] 3 alpha
[3] 3 bravo
[4] 3 foxtrot
[5] 4 delta
Sort multiple columns descending:
>>> frame.sort(['col1', 'col2'], False)
[===Job Progress===]
>>> frame.inspect()
[#] col1 col2
==================
[0] 4 delta
[1] 3 foxtrot
[2] 3 bravo
[3] 3 alpha
[4] 2 echo
[5] 1 charlie
Sort multiple columns: 'col1' decending and 'col2' ascending:
>>> frame.sort([ ('col1', False), ('col2', True) ])
[===Job Progress===]
>>> frame.inspect()
[#] col1 col2
==================
[0] 4 delta
[1] 3 alpha
[2] 3 bravo
[3] 3 foxtrot
[4] 2 echo
[5] 1 charlie
def sorted_k(
self, k, column_names_and_ascending, reduce_tree_depth=2)
Get a sorted subset of the data.
k | (int): | Number of sorted records to return. |
column_names_and_ascending | (List[tuple(str, bool)]): | Column names to sort by, and true to sort column by ascending order, or false for descending order. |
reduce_tree_depth | (int): | Advanced tuning parameter which determines the depth of the reduce-tree (uses Spark's treeReduce() for scalability.) Default is 2. |
Returns | (Frame): | A new frame with a subset of sorted rows from the original frame. |
Take a number of rows and return them sorted in either ascending or descending order.
Sorting a subset of rows is more efficient than sorting the entire frame when the number of sorted rows is much less than the total number of rows in the frame.
The number of sorted rows should be much smaller than the number of rows in the original frame.
In particular:
- The number of sorted rows returned should fit in Spark driver memory. The maximum size of serialized results that can fit in the Spark driver is set by the Spark configuration parameter spark.driver.maxResultSize.
- If you encounter a Kryo buffer overflow exception, increase the Spark configuration parameter spark.kryoserializer.buffer.max.mb.
- Use Frame.sort() instead if the number of sorted rows is very large (in other words, it cannot fit in Spark driver memory).
These examples deal with the most recently-released movies in a private collection. Consider the movie collection already stored in the frame below:
>>> my_frame.inspect()
[#] genre year title
========================================================
[0] Drama 1957 12 Angry Men
[1] Crime 1946 The Big Sleep
[2] Western 1969 Butch Cassidy and the Sundance Kid
[3] Drama 1971 A Clockwork Orange
[4] Drama 2008 The Dark Knight
[5] Animation 2013 Frozen
[6] Drama 1972 The Godfather
[7] Animation 1994 The Lion King
[8] Animation 2010 Tangled
[9] Fantasy 1939 The WOnderful Wizard of Oz
This example returns the top 3 rows sorted by a single column: 'year' descending:
>>> topk_frame = my_frame.sorted_k(3, [ ('year', False) ])
[===Job Progress===]
>>> topk_frame.inspect()
[#] genre year title
=====================================
[0] Animation 2013 Frozen
[1] Animation 2010 Tangled
[2] Drama 2008 The Dark Knight
This example returns the top 5 rows sorted by multiple columns: 'genre' ascending, then 'year' descending:
>>> topk_frame = my_frame.sorted_k(5, [ ('genre', True), ('year', False) ])
[===Job Progress===]
>>> topk_frame.inspect()
[#] genre year title
=====================================
[0] Animation 2013 Frozen
[1] Animation 2010 Tangled
[2] Animation 1994 The Lion King
[3] Crime 1946 The Big Sleep
[4] Drama 2008 The Dark Knight
This example returns the top 5 rows sorted by multiple columns: 'genre' ascending, then 'year' ascending. It also illustrates the optional tuning parameter for reduce-tree depth (which does not affect the final result).
>>> topk_frame = my_frame.sorted_k(5, [ ('genre', True), ('year', True) ], reduce_tree_depth=1)
[===Job Progress===]
>>> topk_frame.inspect()
[#] genre year title
===================================
[0] Animation 1994 The Lion King
[1] Animation 2010 Tangled
[2] Animation 2013 Frozen
[3] Crime 1946 The Big Sleep
[4] Drama 1957 12 Angry Men
def take(
self, n, offset=0, columns=None)
Get data subset.
Take a subset of the currently active Frame.
(See 'collect' operation to simply get all the data from the Frame)
n | (int): | The number of rows to get from the frame (warning: do not overwhelm the python session by taking too much) |
offset | (Optional[int]): | The number of rows to skip before starting to copy. |
columns | (Optional[str or list[str]): | If not None, only the given columns' data will be provided. By default, all columns are included. |
Returns | (list[list[data]]): | raw frame data |
Consider the following frame: >>> frame.inspect() [#] name age tenure phone ==================================== [0] Fred 39 16 555-1234 [1] Susan 33 3 555-0202 [2] Thurston 65 26 555-4510 [3] Judy 44 14 555-2183
Use take to get the first two rows and look at the schema and data in the result:
>>> frame.take(2)
[['Fred', 39, 16, '555-1234'], ['Susan', 33, 3, '555-0202']]
Limit the columns in our result to just the name and age column:
>>> frame.take(2, columns=['name', 'age'])
[['Fred', 39], ['Susan', 33]]
def tally(
self, sample_col, count_val)
Count number of times a value is seen.
sample_col | (str): | The name of the column from which to compute the cumulative count. |
count_val | (str): | The column value to be used for the counts. |
A cumulative count is computed by sequentially stepping through the rows, observing the column values and keeping track of the number of times the specified count_value has been seen.
Consider Frame my_frame, which accesses a frame that contains a single column named obs:
>>> my_frame.inspect()
[#] obs
========
[0] 0
[1] 1
[2] 2
[3] 0
[4] 1
[5] 2
The cumulative percent count for column obs is obtained by:
>>> my_frame.tally("obs", "1")
[===Job Progress===]
The Frame my_frame accesses the original frame that now contains two columns, obs that contains the original column values, and obsCumulativePercentCount that contains the cumulative percent count:
>>> my_frame.inspect()
[#] obs obs_tally
===================
[0] 0 0.0
[1] 1 1.0
[2] 2 1.0
[3] 0 1.0
[4] 1 2.0
[5] 2 2.0
def tally_percent(
self, sample_col, count_val)
Compute a cumulative percent count.
sample_col | (str): | The name of the column from which to compute the cumulative sum. |
count_val | (str): | The column value to be used for the counts. |
A cumulative percent count is computed by sequentially stepping through the rows, observing the column values and keeping track of the percentage of the total number of times the specified count_value has been seen up to the current value.
Consider Frame my_frame, which accesses a frame that contains a single column named obs:
>>> my_frame.inspect()
[#] obs
========
[0] 0
[1] 1
[2] 2
[3] 0
[4] 1
[5] 2
The cumulative percent count for column obs is obtained by:
>>> my_frame.tally_percent("obs", "1")
[===Job Progress===]
The Frame my_frame accesses the original frame that now contains two columns, obs that contains the original column values, and obsCumulativePercentCount that contains the cumulative percent count:
>>> my_frame.inspect()
[#] obs obs_tally_percent
===========================
[0] 0 0.0
[1] 1 0.5
[2] 2 0.5
[3] 0 0.5
[4] 1 1.0
[5] 2 1.0
def timeseries_augmented_dickey_fuller_test(
self, ts_column, max_lag, regression='c')
Performs the Augmented Dickey-Fuller (ADF) Test, which tests the null hypothesis of whether a unit root is present in a time series sample. The test statistic that is returned in a negative number. The lower the value, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.
ts_column | (str): | Name of the column that contains the time series values to use with the ADF test. |
max_lag | (int): | The lag order to calculate the test statistic. |
regression | (Optional(str)): | The method of regression that was used. Following MacKinnon's notation, this can be "c" for constant, "nc" for no constant, "ct" for constant and trend, and "ctt" for constant, trend, and trend-squared. |
Returns | (AugmentedDickeyFullerTestResult): | Object contains the ADF test statistic and p-value. |
Consider the following frame of time series values:
>>> frame.inspect()
[#] timeseries_values
======================
[0] 3.201
[1] 3.3178
[2] 3.6279
[3] 3.5902
[4] 3.43
[5] 4.0546
[6] 3.7606
[7] 3.1231
[8] 3.2077
[9] 4.3383
Calculate augmented Dickey-Fuller test statistic by giving it the name of the column that has the time series values and the max_lag. The function returns an object that has properties for the p-value and test statistic.
>>> frame.timeseries_augmented_dickey_fuller_test("timeseries_values", 0)
p_value = 0.0
test_stat = -9.93422373369
def timeseries_breusch_godfrey_test(
self, residuals, factors, max_lag)
Calculates the Breusch-Godfrey test statistic for serial correlation.
residuals | (str): | Name of the column that contains residual (y) values |
factors | (List[str]): | Name of the column(s) that contain factors (x) values |
max_lag | (int): | The lag order to calculate the test statistic. |
Returns | (BreuschGodfreyTestResult): | Object contains the Breusch-Godfrey test statistic and p-value. |
Consider the following frame that uses a snippet of air quality and sensor data from:
https://archive.ics.uci.edu/ml/datasets/Air+Quality.
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
>>> frame.inspect()
[#] Date Time CO_GT PT08_S1_CO NMHC_GT C6H6_GT Temp
====================================================================
[0] 10/03/2004 18.00.00 2.6 1360 150 11.9 13.6
[1] 10/03/2004 19.00.00 2.0 1292 112 9.4 13.3
[2] 10/03/2004 20.00.00 2.2 1402 88 9.0 11.9
[3] 10/03/2004 21.00.00 2.2 1376 80 9.2 11.0
[4] 10/03/2004 22.00.00 1.6 1272 51 6.5 11.2
[5] 10/03/2004 23.00.00 1.2 1197 38 4.7 11.2
[6] 11/03/2004 00.00.00 1.2 1185 31 3.6 11.3
[7] 11/03/2004 01.00.00 1.0 1136 31 3.3 10.7
[8] 11/03/2004 02.00.00 0.9 1094 24 2.3 10.7
[9] 11/03/2004 03.00.00 0.6 1010 19 1.7 10.3
Calcuate the Breusch-Godfrey test result:
>>> y_column = "Temp"
>>> x_columns = ['CO_GT', 'PT08_S1_CO', 'NMHC_GT', 'C6H6_GT']
>>> max_lag = 1
>>> result = frame.timeseries_breusch_godfrey_test(y_column, x_columns, max_lag)
>>> result
p_value = 0.00353847462468
test_stat = 8.50666768455
def timeseries_breusch_pagan_test(
self, residuals, factors)
Peforms the Breusch-Pagan test for heteroskedasticity.
residuals | (str): | Name of the column that contains residual (y) values |
factors | (List[str]): | Name of the column(s) that contain factors (x) values |
Returns | (BreuschPaganTestResult): | Object contains the Breusch-Pagan test statistic and p-value. |
Consider the following frame:
>>> frame.inspect()
[#] AT V AP RH PE
=========================================
[0] 8.34 40.77 1010.84 90.01 480.48
[1] 23.64 58.49 1011.4 74.2 445.75
[2] 29.74 56.9 1007.15 41.91 438.76
[3] 19.07 49.69 1007.22 76.79 453.09
[4] 11.8 40.66 1017.13 97.2 464.43
[5] 13.97 39.16 1016.05 84.6 470.96
[6] 22.1 71.29 1008.2 75.38 442.35
[7] 14.47 41.76 1021.98 78.41 464
[8] 31.25 69.51 1010.25 36.83 428.77
[9] 6.77 38.18 1017.8 81.13 484.3
Calculate the Bruesh-Pagan test statistic where the "AT" column contains residual values and the other columns are factors:
result = frame.timeseries_breusch_pagan_test("AT",["V","AP","RH","PE"]) [===Job Progress===]
The result contains the test statistic and p-value:
result p_value = 0.000147089380721 test_stat = 22.6741588802
def timeseries_durbin_watson_test(
self, residuals)
Computes the Durbin-Watson test statistic used to determine the presence of serial correlation in the residuals. Serial correlation can show a relationship between values separated from each other by a given time lag. A value close to 0.0 gives evidence for positive serial correlation, a value close to 4.0 gives evidence for negative serial correlation, and a value close to 2.0 gives evidence for no serial correlation.
residuals | (str): | Name of the column that contains residual values |
Returns: | Durbin-Watson statistics test |
In this example, we have a frame that contains time series values. The inspect command below shows a snippet of what the data looks like:
>>> frame.inspect()
[#] timeseries_values
======================
[0] 3.201
[1] 3.3178
[2] 3.6279
[3] 3.5902
[4] 3.43
[5] 4.0546
[6] 3.7606
[7] 3.1231
[8] 3.2077
[9] 4.3383
Calculate Durbin-Watson test statistic by giving it the name of the column that has the time series values:
>>> frame.timeseries_durbin_watson_test("timeseries_values")
0.02678674777710402
def timeseries_from_observations(
self, date_time_index, timestamp_column, key_column, value_column)
Returns a frame that has the observations formatted as a time series.
date_time_index: | List of date/time strings. DateTimeIndex to conform all series to. |
timestamp_column: | The name of the column telling when the observation occurred. |
key_column: | The name of the column that contains which string key the observation belongs to. |
value_column: | The name of the column that contains the observed value. |
Returns: | Frame formatted as a time series (with a column for key and a column for the vector of values). |
Uses the specified timestamp, key, and value columns and the date/time index provided to format the observations as a time series. The time series frame will have columns for the key and a vector of the observed values that correspond to the date/time index.
In this example, we will use a frame of observations of resting heart rate for three individuals over three days. The data is accessed from Frame object called my_frame:
>>> my_frame.inspect(my_frame.count())
[#] name date resting_heart_rate
======================================================
[0] Edward 2016-01-01T12:00:00Z 62
[1] Stanley 2016-01-01T12:00:00Z 57
[2] Edward 2016-01-02T12:00:00Z 63
[3] Sarah 2016-01-02T12:00:00Z 64
[4] Stanley 2016-01-02T12:00:00Z 57
[5] Edward 2016-01-03T12:00:00Z 62
[6] Sarah 2016-01-03T12:00:00Z 64
[7] Stanley 2016-01-03T12:00:00Z 56
We then need to create an array that contains the date/time index, which will be used when creating the time series. Since our data is for three days, our date/time index will just contain those three dates:
>>> datetimeindex = ["2016-01-01T12:00:00.000Z","2016-01-02T12:00:00.000Z","2016-01-03T12:00:00.000Z"]
Then we can create our time series frame by specifying our date/time index along with the name of our timestamp column (in this example, it's "date"), key column (in this example, it's "name"), and value column (in this example, it's "resting_heart_rate").
>>> ts = my_frame.timeseries_from_observations(datetimeindex, "date", "name", "resting_heart_rate")
[===Job Progress===]
Take a look at the resulting time series frame schema and contents:
>>> ts.schema
[(u'name', <type 'unicode'>), (u'resting_heart_rate', vector(3))]
>>> ts.inspect()
[#] name resting_heart_rate
================================
[0] Stanley [57.0, 57.0, 56.0]
[1] Edward [62.0, 63.0, 62.0]
[2] Sarah [None, 64.0, 64.0]
def timeseries_slice(
self, date_time_index, start, end)
Returns a frame split on the specified start and end date/times.
Splits a time series frame on the specified start and end date/times.
date_time_index: | List of date/time strings. DateTimeIndex to conform all series to. |
start: | The start date for the slice in the ISO 8601 format, like: yyyy-MM-dd'T'HH:mm:ss.SSSZ |
end: | The end date for the slice in the ISO 8601 format, like: yyyy-MM-dd'T'HH:mm:ss.SSSZ |
Returns: | Frame that contains a sub-slice of the current frame, based on the specified start/end date/times. |
For this example, we start with a frame that has already been formatted as a time series. This means that the frame has a string column for key and a vector column that contains a series of the observed values. We must also know the date/time index that corresponds to the time series.
The time series is in a Frame object called ts_frame.
>>> ts_frame.inspect()
[#] key series
==================================
[0] A [62, 55, 60, 61, 60, 59]
[1] B [60, 58, 61, 62, 60, 61]
[2] C [69, 68, 68, 70, 71, 69]
Next, we define the date/time index. In this example, it is one day intervals from 2016-01-01 to 2016-01-06:
>>> datetimeindex = ["2016-01-01T12:00:00.000Z","2016-01-02T12:00:00.000Z","2016-01-03T12:00:00.000Z","2016-01-04T12:00:00.000Z","2016-01-05T12:00:00.000Z","2016-01-06T12:00:00.000Z"]
Get a slice of our time series from 2016-01-02 to 2016-01-04:
>>> slice_start = "2016-01-02T12:00:00.000Z"
>>> slice_end = "2016-01-04T12:00:00.000Z"
>>> sliced_frame = ts_frame.timeseries_slice(datetimeindex, slice_start, slice_end)
[===Job Progress===]
Take a look at our sliced time series:
>>> sliced_frame.inspect()
[#] key series
============================
[0] A [55.0, 60.0, 61.0]
[1] B [58.0, 61.0, 62.0]
[2] C [68.0, 68.0, 70.0]
def to_pandas(
self, n=None, offset=0, columns=None)
Brings data into a local pandas dataframe.
Similar to the 'take' function, but puts the data into a pandas dataframe.
n | (Optional(int)): | The number of rows to get from the frame (warning: do not overwhelm the python session by taking too much) |
offset | (Optional(int)): | The number of rows to skip before copying. Defaults to 0. |
columns | (Optional(List[str])): | Column filter. The list of names to be included. Default is all columns. |
Returns | (pandas.DataFrame): | A new pandas dataframe object containing the taken frame data. |
Consider the following spark-tk frame, where we have columns for name and phone number:
>>> frame.inspect()
[#] name phone
=======================
[0] Fred 555-1234
[1] Susan 555-0202
[2] Thurston 555-4510
[3] Judy 555-2183
>>> frame.schema
[('name', <type 'str'>), ('phone', <type 'str'>)]
The frame to_pandas() method is used to get a pandas DataFrame that contains the data from the spark-tk frame. Note that since no parameters are provided when to_pandas() is called, the default values are used for the number of rows, the row offset, and the columns.
>>> pandas_frame = frame.to_pandas()
>>> pandas_frame
name phone
0 Fred 555-1234
1 Susan 555-0202
2 Thurston 555-4510
3 Judy 555-2183
def top_k(
self, column_name, k, weight_column=None)
Most or least frequent column values.
column_name | (str): | The column whose top (or bottom) K distinct values are to be calculated. |
k | (int): | Number of entries to return (If k is negative, return bottom k). |
weight_column | (Optional[str]): | The column that provides weights (frequencies) for the topK calculation. Must contain numerical data. Default is 1 for all items. |
Calculate the top (or bottom) K distinct values by count of a column. The column can be weighted. All data elements of weight <= 0 are excluded from the calculation, as are all data elements whose weight is NaN or infinite. If there are no data elements of finite weight > 0, then topK is empty.
For this example, we calculate the top 2 counties in a data frame:
Consider the following frame:
>>> frame.inspect(frame.count())
[##] rank city population_2013 population_2010 change county
=============================================================================
[0] 1 Portland 609456 583776 4.40% Multnomah
[1] 2 Salem 160614 154637 3.87% Marion
[2] 3 Eugene 159190 156185 1.92% Lane
[3] 4 Gresham 109397 105594 3.60% Multnomah
[4] 5 Hillsboro 97368 91611 6.28% Washington
[5] 6 Beaverton 93542 89803 4.16% Washington
[6] 15 Grants Pass 35076 34533 1.57% Josephine
[7] 16 Oregon City 34622 31859 8.67% Clackamas
[8] 17 McMinnville 33131 32187 2.93% Yamhill
[9] 18 Redmond 27427 26215 4.62% Deschutes
[10] 19 Tualatin 26879 26054 4.17% Washington
[11] 20 West Linn 25992 25109 3.52% Clackamas
[12] 7 Bend 81236 76639 6.00% Deschutes
[13] 8 Medford 77677 74907 3.70% Jackson
[14] 9 Springfield 60177 59403 1.30% Lane
[15] 10 Corvallis 55298 54462 1.54% Benton
[16] 11 Albany 51583 50158 2.84% Linn
[17] 12 Tigard 50444 48035 5.02% Washington
[18] 13 Lake Oswego 37610 36619 2.71% Clackamas
[19] 14 Keizer 37064 36478 1.61% Marion
>>> top_frame = frame.top_k("county", 2)
[===Job Progress===]
>>> top_frame.inspect()
[#] county count
======================
[0] Washington 4.0
[1] Clackamas 3.0
def unflatten_columns(
self, columns, delimiter=',')
Compacts data from multiple rows based on cell data.
columns | (str or List[str]): | Name of the column(s) to be used as keys for unflattening. |
delimiter | (Optional[str]): | Separator for the data in the result columns. Default is comma (,). |
Groups together cells in all columns (less the composite key) using "," as string delimiter. The original rows are deleted. The grouping takes place based on a composite key created from cell values. The column datatypes are changed to string.
Given a data file::
user1 1/1/2015 1 70
user1 1/1/2015 2 60
user2 1/1/2015 1 65
The commands to bring the data into a frame, where it can be worked on:
>>> frame.inspect()
[#] a b c d
===========================
[0] user1 1/1/2015 1 70
[1] user1 1/1/2015 2 60
[2] user2 1/1/2015 1 65
Unflatten the data using columns a & b:
>>> frame.unflatten_columns(['a','b'])
[===Job Progress===]
Check again:
>>> frame.inspect()
[#] a b c d
================================
[0] user1 1/1/2015 1,2 70,60
[1] user2 1/1/2015 1 65
Alternatively, unflatten_columns() also accepts a single column like:
>>> frame.unflatten_columns('a')
[===Job Progress===]
>>> frame.inspect()
[#] a b c d
=========================================
[0] user1 1/1/2015,1/1/2015 1,2 70,60
[1] user2 1/1/2015 1 65
def validate_pyrdd_schema(
self, pyrdd, schema)