View on GitHub

data8

My notes for Data8 - Fall 2019

Data 8 Cheat Sheet

don't sue us pls

This is a really condensed version of the docs that will keep updating throughout the semester to highlight the most crucial functions at any given time. It’s a happy medium between the Python reference and the textbook.

PSA: Something missing/unclear? Contribute to the cheat sheet!!! Edit the text directly or Create an issue!

Stats Notes

CLICK HERE!

Hot Functions

Stuff we just learned and is probably quite confuzzling at the moment.

Minimize: minimize(f) takes in a function with any number of arguments and returns the inputs that result in a minimum value outputted from f.

Linear Regression:

def standard_units(arr):
    """Converts an array of data points into standard units"""
    return (arr - np.mean(arr)) / np.std(arr)

def correlation(t, x, y):
    """Computes the correlation coefficient r given data for x and y in table t"""
    t_x = t.column(x)
    t_y = t.column(y)
    return np.mean(standard_units(x) * standard_units(y))

def slope(t, x, y):
    """Uses the formula r * (Sy / Sx) to calculate the slope of the regression line"""
    r = correlation(t, x, y)
    x_sd = np.std(t.column(x))
    y_sd = np.std(t.column(y))
    return r * y_sd / x_sd

def intercept(t, x, y):
    """Returns the intercept of the line using the formula y = mx + b, substituting ymean = m(xmean) + b as a point on the line"""
    return np.mean(t.column(y)) - slope(t, x, y)*np.mean(t.column(x))

def fitted_data(t, x, y):
    """Returns a table with a new column: the predictions of y for every value in the x column in that table."""
    predictions = slope(t, x, y) * t.column(x) + intercept(t, x, y)
    return t.with_column('Prediction', predictions)

def root_mean_square_error(t, y, predict):
    """Given a table with y and predictions, returns the RMS using the formula sqrt(mean((y - prediction)**2))"""
    return (np.mean((t.column(y) - t.column(predict))**2)) ** 0.5

def minimize_rmse(t):
    return minimize(lambda slope, intercept: root_mean_square_error(slope * t.column(x) + intercept, y, 'Prediction'))

Correlation Coefficient:

# with a table of points values for the two variables x/y in standard units
t = t.with_columns('x_st', x_st, 'y_st', y_st)

#create a column of products
products = x_st * y_st
t = t.with_column('products', products)

#take mean of the products to get the correlation coefficient
r = np.mean(products)

Rows:

#to do k nearest neighbor, iterating over rows after removing the class is most efficient in this way.
# row_distance is a function that takes two row objects and computes their difference.
attributes_only = table.drop('Class')
for row in attributes_only.rows:
    distances = np.append(distances, row_distance(row, example))

#to read a row as an array, use np.array(t.row(i))
rows = np.array(t.row(i))

#t.rows is an attribute that returns all the rows of a table. rows are row objects
for row in t.rows:
    #do anything to the row

Standardized Units (aka Z score): ```def standard_units(x): return (x - np.mean(x))/np.std(x)

    -Converts values in an array to their standard unit values and returns an array
    -Example: `standard_units(make_array(1, 2, 3, 4, 5))` returns `array([-1.41421356, -0.70710678,  0. , 0.70710678, 1.41421356])`

**Standard Deviation:** 

`np.std(array)` calculates the standard deviation of a given array.
    - Example: `np.std(make_array(1, 2, 3, 4, 5)) returns 2
        - Explanation: mean is 3
        - Differences are [-2, -1, 0, 1, 2]
        - Square of differences are [4, 1, 0, 1, 4]
        - Sum of square of differences is 10
        - 10 / (num points 1) = 10 / (5) = 2
        - sqrt(2) = 1.41421357...
    - If you want **sample standard deviation** use `np.std(array, ddof = 1)`

**Creating prefilled arrays:** 

`np.full(length, value)` creates an array of length `length` filled with `value`
    - Example: `np.full(4, 2)` returns `[2, 2, 2, 2]`
    
**Percentile:** The value is the at least as large as the X% of elements. Always round up. Whether it is sorted does not matter.
 - Example:
 ```python
  percentile(x , s) ##x is a number and s is an array
  ## returns item of the array that is at the specified percentile
  s = [9,3,5,7,1]
  percentile(80 , s)
  ##returns 7, (4th largest element)
  ##80th percentile is (80/100 * 5  = 4) so it is the 4th element
  ##75th percentile is also 7 because of rounding

Sampling and Distributions

NO replacement

tbl.sample(n, with_replacement=False)

 - Categorical Sampling
 ```python
 # Out of a random sample of size sample_size, the proportion of times each value in probability_distribution was picked
 sample_proportions(sample_size, probability_distribution)
 # Probability_distribution is an array that has probability values.
 # Returns an array of simulated proportions.

 # Example:
 probability_distribution = make_array(.26, .74)
 sample_proportions(100, probability_distribution)
 # Output should be around [.26, .74] but with some variability

Comparisons

Maps

A table with 3 columns: latitude, longitude, and name of location is required as a parameter.

# Makes a map with pins
Marker.map_table(table.select('lat', 'long', 'name'))
# Makes a map with green circles of radius 10
Circle.map_table(table.select('lat', 'long', 'name'), color='green', radius=10)
# Makes a map with variable colors and sizes
Circle.map_table(table.select('lat', 'long', 'name', 'color', 'size'))

Pivot Tables

Joining two tables

If Table1 Column and Table2 Column have the same name, the call instead is

table1.join(‘Shared Column’, table2)


**Grouping a Table**
 - The default `func_to_apply` is to count the number of rows for each category in `Column 1`.
 - If `func_to_apply` is undefined on a specific column (e.g. trying to `sum` a bunch of strings) then the column will still exist, but will be empty.
```python
# Makes a table with all possible combinations of values in the two columns, then applies func_to_apply to the REMAINING columns in the table.
table.group(['Column 1', 'Column 2'], func_to_apply)

Table Visualizations

Plotting a table’s column as a line

# plots the column as the x axis. Every column after it will be represented as lines in the line graph.
table.plot('column_name')

#can also plot by x,y. Makes a single line
table.plot('x_column','y_column')

In the example below, table.plot('AGE') was called on a table with three columns: AGE, 2010, and 2014. plot example

Plotting a table’s column as a histogram

# Creates a histogram of column values vs. percent per unit.
tbl.hist('Column')

# Optional bins value: can be either one number (specifying number of bins) or an array (specifying the bin start/stop values)
tbl.hist('Column', bins=100) # 100 bins
tbl.hist('Column', bins=np.arange(1, 10)) # 9 bins of equal width 1

# Optional unit value: Instead of percent per unit, it's percent per second or percent per mile, etc.
tbl.hist('Column', unit='second')

Table Operations

Creating a table:

# both are equivalent
tbl = Table.read_table('tablename.csv')
tbl = Table().read_table('tablename.csv')

Converting a table column into an array:

arr = tbl.column('column name')

Creating a table column from an array:

# Remember to re-assign the original table if you want the new table to be saved!
tbl = tbl.with_column('column name', arr)

# Make as many columns as you want in a single function call!
tbl = tbl.with_columns('first column', arr1, 'second column', arr2, ... ,'nth column', arr_n)

Getting specific rows or columns from a table

# ROW: Pass in either an array/list of 0-indexed row numbers, OR a single number.
first_row_only = tbl.take(0)
every_third_row = tbl.take(np.arange(0, tbl.num_rows, 3))

# COLUMN: Pass in either an array of column names, an array of column indices, OR individual names/indices.
first_col_only = tbl.column(0)
dank_memes_only = tbl.column('Deep Fried Memes')

Sorting a Table

Used to easily find max/min of a table, or check if a table has duplicate entries.

# Default: starts at the lowest value and repeats are allowed. With descending=True, starts at the highest value instead. With distinct=False, repeats are ignored.
table.sort(column, descending=False, distinct=True)

np aka “numpy”

Creating an array which is a range of numbers

# Makes a range that goes from n1 to n2. Can optionally increment by n3. 
# n1, n2, n3 are all integers
arr_range = np.arange(n1,n2,n3=1)

Calculating an average value / mean from a given array

np.average(array)

Creating a new array that is the difference between sequential elements in a given array

#len of returned array is len(array)-1
# a=a-b
# b=b-c
# ...
np.diff(array)

Creating a new array with elements that are in the form of (current element + all previous elements)

np.cumsum(array)

Random Number: For returning a random result from an array. Can specify to return multiple results equal to num_times. If multiple are specified, it returns an array. Selects with replacement. (Clarification: num_times is OPTIONAL. Default value: 1)

np.random.choice(array,num_times=1) # Get an array of num_times random elements from array.

sum(np.random.choice(array, num_times) == 'value in array') # Returns total number of times 'value in array' appeared inside the random selection

Appending: For adding to an existing array. Can either add a single value or another array.

For Statements: Used for various applications where you know how many times you want to do something.

for var in array:
    (something)
# runs the (something) multiple times for every element of your array. Additionally assigns a temperary variable to var which is the value in that array for a certain iteration. Ex: the first time the (something) is executed, var is the first element of the array. 
for pet in make_array('cat','rabbit', 'dragon')
    print(pet)
##In this case, every element of the array is printed.