# Table of Contents

1.  [Data](#org498e524)
    1.  [Spreadsheet](#org4c5f4c8)
    2.  [Matrix](#org9cc5358)
        1.  [Sparse matrix](#orga79cfc9)
2.  [Learning](#orgc404075)
    1.  [Types](#orged9701e)
        1.  [Statistical](#orgc164f78)
        2.  [Programming](#org852d7b0)
        3.  [Parametric](#orgd6675a6)
        4.  [Non parametric](#orgb202e12)
        5.  [Supervised](#org58ecc0d)
        6.  [Unsupervised](#org9de217e)
        7.  [Semi supervised](#org5190b3c)
        8.  [Classification vs Regression](#orge3439c9)
    2.  [Errors](#org934b56a)
        1.  [Error Y=f(x) + e](#orgb4f57fc)
        2.  [Bias Variance](#orgd65d546)
        3.  [Overfitting](#orga7d5c68)
3.  [Map reduce](#org2f4f03f)
    1.  [Map](#org57afa8f)
    2.  [Reduce](#orgd61227c)
4.  [Algorithms](#orge247545)
    1.  [Gradiant Descent](#org3457ed6)
        1.  [Stochastic Gradiant Descent](#orgbb5e342)
5.  [Hadoop](#org10efd28)
    1.  [Install](#orgbf2b7cd)
        1.  [From source](#orgeaf39ad)
        2.  [core-site.xml](#orgc224f48)
        3.  [hdfs-site.xml](#orgca71df5)
        4.  [Format hdfs](#org761f8e6)
        5.  [Start](#org7dc5796)
        6.  [Create hdfs folders](#orge8a6492)
        7.  [start yarn](#orgdf870eb)
    2.  [urls](#org2c5b3ba)
        1.  [hdfs fs](#org0a6a998)
        2.  [yarn](#org47cbd04)
        3.  [jobtracker](#orgd78bb47)
    3.  [run test](#orgedf4232)
        1.  [yarn jar somejob.jar args](#org6e5fa4c)
    4.  [hdfs](#orgcc3a213)
        1.  [roles](#orge848108)
        2.  [commands](#org361d0d3)
        3.  [programming](#org6a085ec)
        4.  [HA](#org970c9c2)
        5.  [misc](#org9dbffbd)
    5.  [debug](#org3faa487)
        1.  [/var/log/hadoop](#org43e4c3e)
        2.  [kill](#orgcbd191d)
    6.  [map reduce](#org46f8617)
        1.  [grep | wc -l](#orgdfbff05)
        2.  [helloworld](#org65e0742)
        3.  [shuffle](#orgccd0f12)
        4.  [reduce](#org37150c2)
        5.  [combiner](#org57576cc)
        6.  [streaming](#org08fdac5)
        7.  [pipes](#org5ecd28d)
    7.  [YARN](#org50d92e1)
6.  [Spark](#orga8dadcf)
    1.  [General Ideas](#orgc599f8f)
7.  [Code Examples](#orga9a09c0)
    1.  [Libraries](#orga432155)
        1.  [Graphx](#org56005c1)
    2.  [Operations](#orga51877b)
        1.  [Transformations](#org4e1743e)
        2.  [Actions](#org2f9c501)
    3.  [Data structures](#org6d83be5)
        1.  [RDD](#orga784b50)
        2.  [DF](#org93ce8c2)
    4.  [Fast](#org3751caf)
    5.  [Run](#org50b1dce)
    6.  [Hdfs](#orga2f7ba5)
8.  [Hive](#org62bf64c)
    1.  [Install](#orgd2ebfa4)
        1.  [derby](#orgb21af8f)
9.  [Oozie](#orgc513291)
10. [AWS](#org3b96987)
    1.  [considerations](#orgd5aa938)
        1.  [develop](#orgc2e2713)
        2.  [deploy](#orgc2464d3)
        3.  [iteration time](#org6e0a99d)
        4.  [lower scale](#org83f1ea0)
        5.  [processing time](#orgd22e813)
    2.  [key technologies](#orgae7a1c9)
        1.  [S3](#org2cbd681)
        2.  [redshift](#orgbd4c527)
        3.  [data pipelines](#orgd99a0ae)
        4.  [kinesis](#orgfba103a)
        5.  [ec2](#org854f7ad)
    3.  [resources](#orged854c2)
    4.  [process](#orgdbb663f)
    5.  [ec2](#org33766ea)
    6.  [EMR](#org673323a)
        1.  [s3](#orgcd99fa9)
        2.  [JobFlow](#org49c27e2)
        3.  [Hive](#org60a146e)
        4.  [cli](#org269105e)
    7.  [awscli](#org3d659e3)
        1.  [install](#org8449a75)
        2.  [configure](#org3b84864)
11. [python](#orgceb6ec1)
    1.  [urllib2](#orgea8804a)
        1.  [getfile](#org6dfdae5)
    2.  [matplotlib](#orge693329)
    3.  [pandas](#org7750f54)
        1.  [data](#orge4c5a97)
        2.  [plot](#orga52d38d)
        3.  [build model](#orgae299fe)
12. [Split-out validation dataset](#org468572f)
13. [Test options and evaluation metric](#org65d329c)
14. [Spot Check Algorithms](#org1385d2a)
15. [evaluate each model in turn](#orgdf79ef3)
16. [Compare Algorithms](#org5cc1568)
17. [Make predictions on validation dataset](#org019c684)
        1.  [resources](#orgefae294)
18. [Amazon](#org239029a)

<a id="org498e524"></a>

# Data

<a id="org4c5f4c8"></a>

## Spreadsheet

Think of data as a spreadsheet as a table.

<a id="org9cc5358"></a>

## Matrix

rows: observations, our datadata. columns - features.  Get used to it.

<a id="orga79cfc9"></a>

### Sparse matrix

matrix who's most rows are zeros

<a id="orgc404075"></a>

# Learning

<a id="orged9701e"></a>

## Types

<a id="orgc164f78"></a>

### Statistical

Output = f(input) \* => f(inputVariable) or f(inputVector), or f(independent variables) or Y = F(X) // X1,X2,..

<a id="org852d7b0"></a>

### Programming

OutputAttributes = Program(InputAttributes) or Program(InputFeatures) or Model = Algorithm(Data)

<a id="orgd6675a6"></a>

### Parametric

No matter how much data you throw on it, it will still need these parameters like a line \`Y = ax + b\` (logistic regression, linear discriminant analysis, perceptron)

<a id="orgb202e12"></a>

### Non parametric

No matter how much data you throw on it, it will still need these parameters like a line Y = ax + b (logistic regression, linear discriminant analysis, perceptron)

<a id="org58ecc0d"></a>

### Supervised

You have a teacher he knows the answer, classification, regression

<a id="org9de217e"></a>

### Unsupervised

No teacher, clustering, association

<a id="org5190b3c"></a>

### Semi supervised

Some can be with a teacher

<a id="orge3439c9"></a>

### Classification vs Regression

classification(input) => spam/notspam (categorical)<br />regression(input) => bitcoin price (continous outcome)

<a id="org934b56a"></a>

## Errors

<a id="orgb4f57fc"></a>

### Error Y=f(x) + e

Y = f(X) + e \* => You learn a function!

<a id="orgd65d546"></a>

### Bias Variance

Bias Error (model assumptions), Variance Error, Irreducable Error. Increasing bias error reduce variance, increase variance will decrease bias

<a id="orga7d5c68"></a>

### Overfitting

Resampling to estimate model accuracy, Hold back validation dataset, Cross validation.

<a id="org2f4f03f"></a>

# Map reduce

    grep something | wc -l * => grep is map wc -l is the reduce!

Based on simple [key, value] pair
Moving computation is cheaper than moving data, our data is big ain't it?

<a id="org57afa8f"></a>

## Map

List(input) => List(output) \* => like grep

<a id="orgd61227c"></a>

## Reduce

List(input) => Output(value) \* => like wc -l

<a id="orge247545"></a>

# Algorithms

<a id="org3457ed6"></a>

## Gradiant Descent

Almost every machine learning algorithm uses optimisation at it's core, optimising the target function.  Local minimum.  start with 0 \`coefficient = 0.0\`.  \`cost = evaluate(f(coefficient))\`.  Update coefficient downhill with derivative.  \`coefficient = coefficient - (alpha \* delta)\`.  alpha learning parameter.

<a id="orgbb5e342"></a>

### Stochastic Gradiant Descent

Have large amounts of data, update to coefficients is for each training instance, not in batch, as we have random data we move quickly.

<a id="org10efd28"></a>

# Hadoop

<a id="orgbf2b7cd"></a>

## Install

In general for hadoop, hive, ping installations you download the tar.gz, set environment variables for its home, and add folders in hdfs if needed.

<a id="orgeaf39ad"></a>

### From source

extract hadoop tar.gz, make sure JAVA<sub>HOME</sub> in path, HADOOP<sub>HOME</sub> configured, add yarn, hdfs, mapred users, make directories: *var/data/hadoop/hadfs*[nn,snn], log directory,

<a id="orgc224f48"></a>

### core-site.xml hdfs://localhsot:9000 \*=> set the hdfs port.

<a id="orgca71df5"></a>

### hdfs-site.xml

hdfs parameters, dfs.replication: 1, dfs. directory&#x2026;

<a id="org761f8e6"></a>

### Format hdfs

    su - hdfs
    cd /opt/hadoop-2.8.1/bin
    ./hdfs namenode -format

<a id="org7dc5796"></a>

### Start

    cd /opt/hadoop-2.8.1/sbin
    ./ start namenode
    ./ start secondarynamenode
    ./ start datanode
    jps * => java processes status the above are all java processes.

<a id="orge8a6492"></a>

### Create hdfs folders

hdfs dfs -mkdir -p /mr-history/tmp /mr-history/done chown to yarn:hadoop

<a id="orgdf870eb"></a>

### start yarn

su - yarn
./ start resourcemanager
./ start nodemanager
./ start historyserver

<a id="org2c5b3ba"></a>

## urls

<a id="org0a6a998"></a>

### hdfs fs

1.  <http://localhost:50070>

    hdfs file system

<a id="org47cbd04"></a>

### yarn

1.  <http://localhost:8088>

    as a local file system

<a id="orgd78bb47"></a>

### jobtracker

1.  <http://headnode:50030>

<a id="orgedf4232"></a>

## run test

<a id="org6e5fa4c"></a>

### yarn jar somejob.jar args

run a test mr jar with yarn

<a id="orgcc3a213"></a>

## hdfs

<a id="orge848108"></a>

### roles

1.  namenode

    like a ****traffic cop****, telling us where to find or write data, also handles failures of data nodes, if data node does not report back with status its timeout and namenode will remove it, we see one namespace across the whole data.  Client contacts namenode and then datanode returned from namenode for the actual data.
    1.  inmemory
        stores HDFS metadata in memory at startup reads it fro file \`fsimage\`. Writes added to logfile on startup it merges the log with fsimage.
    2.  secondary namenode
        1.  bad title
            1.  checkpoint node
                better named checkpoint node because it's merging the fsimage to the edits log while the namenode is running so startup will be fast.
    3.  backup node
        same work as checkpoint node but is synchronized to namenode using real time stream from the namenode.  Still no redundancy with this.

2.  datanode

3.  hdfs-client

    1.  calls namenode then datanode
        you do operations on hdfsClient it's doing all the work of communicating with namenodes and then sending the operations to the correct data nodes.

<a id="org361d0d3"></a>

### commands

1.  hdfs dfsadmin -report

2.  dfs -put file.txt

    hdfs dfs -put war-and-peace.txt

3.  dfs -cp file1.txt file2.txt

    copy a file inside hdfs

4.  mount hdfs /mnt/hdfs

    as a local file system!

<a id="org6a085ec"></a>

### programming

1.  java

    import org.apache.hadoop.fs.FileSystem // just same api as java file system.
    Configuration conf = new Configuration();
    conf.addResource(new Path("/etc/hadoop/conf/core-site.xml");
    conf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml");
    FileSystem fileSystem = FileSystem.get(conf);
    // Create new file and write data to it.
    FSDataOutputStream out = fileSystem.create(path);
    InputStream in = new BufferedInputStream(new FileInputStream(
      new File(source)));
    int numBytes = 0;
    while ((numBytes = > 0) 
      out.write(b, 0, numBytes);
    1.  compile
        echo "Main-Class: org/myorg.HDFSClient" > manifest.txt
        javac -classpath *usr/lib/hadoop/hadoop-core.jar -d HDSFClient -classes \* => Note we needed to include hadoop core jar.
        jar -cvfe HDFSClient.jar org/myorg.HDFSClient -C HDFSClient-classes* .
        hadoop jar ./HDFSClient.jar add sometextfile.txt /user/tomer \* => run with program arguments.
    2.  classpath
        export CLASSPATH=$(hadoop classpath)

<a id="org970c9c2"></a>

### HA

1.  namenode

    1.  standby namenode
        acting like checkpoint node so it has the fsimage file, it will take over in case of failure.
    2.  federation
        Break namespace across all namespace
        namenode1: /research/marketing
        namenode2: /data/project
    3.  snapshots
        read onliy point-in-time copies of the file system.  can be of subtree.  it's not data no data copied only block list and file size.  Think of snapshot of a file directory.  can do this on daily basis does not slow things down.

<a id="org9dbffbd"></a>

### misc

1.  nfsv3

    NFS gateway allows you to access hdfs as if it's a local file system, its still not random access but it's convinient.

2.  host:5700

    web gui for nfs is at <http://host:5700>

<a id="org3faa487"></a>

## debug

<a id="org43e4c3e"></a>

### /var/log/hadoop

these are the logs on the headnode you can also ssh to worker nodes and similarly look at /var/log/hadoop/mapred you will see there the task tracker logs.

<a id="orgcbd191d"></a>

### kill

hadoop job -list
hadoop job -kill job<sub>2016982347928</sub><sub>0042</sub>

<a id="org46f8617"></a>

## map reduce

map => banana,1
             banana, 1
             banana, 1
reduce => banana, 3

<a id="orgdfbff05"></a>

### grep | wc -l

\`grep "Samuel" somebook.txt | wc -l\`
grep => map
wc -l => reduce

<a id="org65e0742"></a>

### helloworld

mapper: string tokenizer, emit (word, 1), reduce, sum+= values, in addition you write the "driver", going to run the mapper and reducer, you say which class is mapper conf.setMapperClass(MapClass.class); you also do conf.setCombinerClass and setReducerClass.
\`hadoop jar wordcount.jar org.myorg.WordCount /user/myuser/inputdir /user/myuser/outputdir\`

<a id="orgccd0f12"></a>

### shuffle

shuffle is the only step where we have communication transfer of data between nodes.

<a id="org37150c2"></a>

### reduce

can run on multiple hosts, depending on shuffle, shuffle puts same keys on same hosts, so reduce can work on grouping of same keys and he will know he has all the same keys on the same hosts.

<a id="org57576cc"></a>

### combiner

instead of mapper saying i found earth,1 and earth,1 compiner will have the mapper report earth,2 from a certain node, optimizing the mapper so the reducer has less work.

<a id="org08fdac5"></a>

### streaming

****Streaming interface for hadoop jobs****
you can write a that expects stdin and just run it and amazingly you can also run it on hadoop.  in the java map reduce interface we got line by line, here we get the stdin we can do anything we want. [<>](<>)
Then you run it with:
/usr/lib/hadoop/contrib/streaming/hadoop-streaming- -file ./ -mapeper ./ -file ./ -reducer ./ &#x2026;

<a id="org5ecd28d"></a>

### pipes

****Pipes interface to mapreduce****
it's a clean interface to do map reduce.

<a id="org50d92e1"></a>


does not care that its' map reduce its running could be any job.  the previous job manager and task manager ran only map reduce.  jobTracker manages jobs and taskTracker is on local nodes.

<a id="orga8dadcf"></a>

# Spark

<a id="orgc599f8f"></a>

## General Ideas

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">

<col  class="org-left" />

<col  class="org-left" />
<td class="org-left">Idea</td>
<td class="org-left">Description</td>

<td class="org-left">Transformation</td>
<td class="org-left">`transformation(RDD): RDD`</td>

<td class="org-left">Action</td>
<td class="org-left">`action(RDD): Value`</td>

<a id="orga9a09c0"></a>

# Code Examples

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">

<col  class="org-left" />

<col  class="org-left" />
<th scope="col" class="org-left">action</th>
<th scope="col" class="org-left">example</th>

<td class="org-left">Read text file</td>
<td class="org-left">`sc.textFile("file..")`</td>

<td class="org-left">Count</td>
<td class="org-left">`rdd.count()`</td>

<a id="orga432155"></a>

## Libraries

<a id="org56005c1"></a>

### Graphx

Has a library for computing graph computations (in addition to mlib).

<a id="orga51877b"></a>

## Operations

<a id="org4e1743e"></a>

### Transformations

<a id="org2f9c501"></a>

### Actions

<a id="org6d83be5"></a>

## Data structures

<a id="orga784b50"></a>

### RDD

1.  Blind data

<a id="org93ce8c2"></a>

### DF

dfs.replication: 1, *var/data/hadoop*&#x2026;

1.  Scheme

    Think of it as distributed database table.

2.  Read json element

        object SparkDFOnlineJson extends App 
          override def main(args: Array[String]): Unit = 
            val jsonString ="").mkString
            val spark = org.apache.spark.sql.SparkSession.builder().appName("someapp").master("local[*]").getOrCreate()
            import spark.implicits._
            import org.apache.spark.sql.functions._
            val df =
  $"Data.close".as("close_price")).show(2) // <-- HERE reading Data.close from the json!
            val jsonExplodedDF =$"Aggregated", $"ConversionType", explode($"Data").as("prices")) // <-- HERE reading Data.close from the json!
  $"Aggregated", $"ConversionType", $"prices".getItem("close")).show(10) // Then getItem instead of explode to objects!!
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        // |Aggregated|ConversionType|                Data|FirstValueInArray|Response|  TimeFrom|    TimeTo|Type|
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        // |     false|     [,invert]|[[23.91,25.06,21....|             true| Success|1513209600|1515801600| 100|
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        // [false,[,invert],WrappedArray([23.91,25.06,21.87,23.39,1513209600,62691.53,1452942.54], [25.87,29.03,23.88,23.91,1513296000,50825.4,1342967.63], [28.11,28.62,24.53,25.87,1513382400,38155.01,1013078.48], [26.72,28.11,25.93,28.11,1513468800,36242.76,979762.25], [24.08,26.86,23.29,26.72,1513555200,46712.69,1186390.62], [21.63,24.41,21.29,24.08,1513641600,65125.17,1449434.45], [20.67,22.29,20.42,21.63,1513728000,64539.45,1372742.27], [19.79,20.94,19.4,20.67,1513814400,61802.62,1244602.57], [20.93,21.98,19.47,19.79,1513900800,80230.91,1656134.49], [20.78,20.97,20.42,20.93,1513987200,42893.35,887428.82], [20.53,20.97,20.36,20.77,1514073600,41294.18,855012.67], [19.18,20.53,18.67,20.53,1514160000,48165.25,929653.57], [20.91,21.55,18.75,19.18,1514246400,46999.33,956924.92], [20.88,21.57,20.45,20.91,1514332800,36759.37,769083.49], [20.04,20.95,19.7,20.88,1514419200,40883.16,828193.82], [19.58,20.25,19.32,20.04,1514505600,43487.34,857520.42], [18.14,19.77,18.09,19.58,1514592000,66161.84,1246949.13], [18.68,19.07,18.05,18.14,1514678400,48718.02,902419.05], [17.76,18.7,17.54,18.67,1514764800,50703.72,910875.63], [17.16,18.94,15.25,17.76,1514851200,96092.61,1574640.02], [16.01,17.68,15.62,17.16,1514937600,75289.68,1266911.61], [16.06,16.59,14.43,16.03,1515024000,80755.25,1258516.2], [17.59,18.29,14.54,16.07,1515110400,104693.19,1682729.53], [17.03,17.91,16.25,17.59,1515196800,58014.94,975679.49], [14.49,17.06,14.47,17.03,1515283200,64620.79,994739.35], [13.2,14.5,12.73,14.49,1515369600,102880.99,1380565.72], [11.18,13.21,10.93,13.2,1515456000,95751.66,1168583.78], [11.95,12.06,10.16,11.18,1515542400,143351.13,1546032.52], [11.66,11.96,10.93,11.95,1515628800,97380.62,1100658.4], [10.96,11.8,10.89,11.66,1515715200,63382.56,710582.11], [10.27,11.12,10.24,10.96,1515801600,58214.24,625184.97]),true,Success,1513209600,1515801600,100]
        // root
        //  |-- Aggregated: boolean (nullable = true)
        //  |-- ConversionType: struct (nullable = true)
        //  |    |-- conversionSymbol: string (nullable = true)
        //  |    |-- type: string (nullable = true)
        //  |-- Data: array (nullable = true)
        //  |    |-- element: struct (containsNull = true)
        //  |    |    |-- close: double (nullable = true)
        //  |    |    |-- high: double (nullable = true)
        //  |    |    |-- low: double (nullable = true)
        //  |    |    |-- open: double (nullable = true)
        //  |    |    |-- time: long (nullable = true)
        //  |    |    |-- volumefrom: double (nullable = true)
        //  |    |    |-- volumeto: double (nullable = true)
        //  |-- FirstValueInArray: boolean (nullable = true)
        //  |-- Response: string (nullable = true)
        //  |-- TimeFrom: long (nullable = true)
        //  |-- TimeTo: long (nullable = true)
        //  |-- Type: long (nullable = true)
        // +--------------------+
        // |         close_price|
        // +--------------------+
        // |[23.91, 25.87, 28...|
        // +--------------------+
        // root
        //  |-- Aggregated: boolean (nullable = true)
        //  |-- ConversionType: struct (nullable = true)
        //  |    |-- conversionSymbol: string (nullable = true)
        //  |    |-- type: string (nullable = true)
        //  |-- prices: struct (nullable = true)
        //  |    |-- close: double (nullable = true)
        //  |    |-- high: double (nullable = true)
        //  |    |-- low: double (nullable = true)
        //  |    |-- open: double (nullable = true)
        //  |    |-- time: long (nullable = true)
        //  |    |-- volumefrom: double (nullable = true)
        //  |    |-- volumeto: double (nullable = true)
        // +----------+--------------+------------+
        // |Aggregated|ConversionType|prices.close|
        // +----------+--------------+------------+
        // |     false|     [,invert]|       23.91|
        // |     false|     [,invert]|       25.87|
        // |     false|     [,invert]|       28.11|
        // |     false|     [,invert]|       26.72|
        // |     false|     [,invert]|       24.08|
        // |     false|     [,invert]|       21.63|
        // |     false|     [,invert]|       20.67|
        // |     false|     [,invert]|       19.79|
        // |     false|     [,invert]|       20.93|
        // |     false|     [,invert]|       20.78|
        // +----------+--------------+------------+
        // only showing top 10 rows
        // jsonString: String = "Response":"Success","Type":100,"Aggregated":false,"Data":["time":1513209600,"high":25.06,"low":21.87,"open":23.39,"volumefrom":62691.53,"volumeto":1452942.54,"close":23.91,"time":1513296000,"high":29.03,"low":23.88,"open":23.91,"volumefrom":50825.4,"volumeto":1342967.63,"close":25.87,"time":1513382400,"high":28.62,"low":24.53,"open":25.87,"volumefrom":38155.01,"volumeto":1013078.48,"close":28.11,"time":1513468800,"high":28.11,"low":25.93,"open":28.11,"volumefrom":36242.76,"volumeto":979762.25,"close":26.72,"time":1513555200,"high":26.86,"low":23.29,"open":26.72,"volumefrom":46712.69,"volumeto":1186390.62,"close":24.08,"time":1513641600,"high":24.41,"low":21.29,"open":24.08,"volumefrom":65125.17,"volumeto":1449434.45,"close":21.63,"time":1513728000,"high":22.29,"low":20.42,"open":21.63,"volumefrom":64539.45,"volumeto":1372742.27,"close":20.67,"time":1513814400,"high":20.94,"low":19.4,"open":20.67,"volumefrom":61802.62,"volumeto":1244602.57,"close":19.79,"time":1513900800,"high":21.98,"low":19.47,"open":19.79,"volumefrom":80230.91,"volumeto":1656134.49,"close":20.93,"time":1513987200,"high":20.97,"low":20.42,"open":20.93,"volumefrom":42893.35,"volumeto":887428.82,"close":20.78,"time":1514073600,"high":20.97,"low":20.36,"open":20.77,"volumefrom":41294.18,"volumeto":855012.67,"close":20.53,"time":1514160000,"high":20.53,"low":18.67,"open":20.53,"volumefrom":48165.25,"volumeto":929653.57,"close":19.18,"time":1514246400,"high":21.55,"low":18.75,"open":19.18,"volumefrom":46999.33,"volumeto":956924.92,"close":20.91,"time":1514332800,"high":21.57,"low":20.45,"open":20.91,"volumefrom":36759.37,"volumeto":769083.49,"close":20.88,"time":1514419200,"high":20.95,"low":19.7,"open":20.88,"volumefrom":40883.16,"volumeto":828193.82,"close":20.04,"time":1514505600,"high":20.25,"low":19.32,"open":20.04,"volumefrom":43487.34,"volumeto":857520.42,"close":19.58,"time":1514592000,"high":19.77,"low":18.09,"open":19.58,"volumefrom":66161.84,"volumeto":1246949.13,"close":18.14,"time":1514678400,"high":19.07,"low":18.05,"open":18.14,"volumefrom":48718.02,"volumeto":902419.05,"close":18.68,"time":1514764800,"high":18.7,"low":17.54,"open":18.67,"volumefrom":50703.72,"volumeto":910875.63,"close":17.76,"time":1514851200,"high":18.94,"low":15.25,"open":17.76,"volumefrom":96092.61,"volumeto":1574640.02,"close":17.16,"time":1514937600,"high":17.68,"low":15.62,"open":17.16,"volumefrom":75289.68,"volumeto":1266911.61,"close":16.01,"time":1515024000,"high":16.59,"low":14.43,"open":16.03,"volumefrom":80755.25,"volumeto":1258516.2,"close":16.06,"time":1515110400,"high":18.29,"low":14.54,"open":16.07,"volumefrom":104693.19,"volumeto":1682729.53,"close":17.59,"time":1515196800,"high":17.91,"low":16.25,"open":17.59,"volumefrom":58014.94,"volumeto":975679.49,"close":17.03,"time":1515283200,"high":17.06,"low":14.47,"open":17.03,"volumefrom":64620.79,"volumeto":994739.35,"close":14.49,"time":1515369600,"high":14.5,"low":12.73,"open":14.49,"volumefrom":102880.99,"volumeto":1380565.72,"close":13.2,"time":1515456000,"high":13.21,"low":10.93,"open":13.2,"volumefrom":95751.66,"volumeto":1168583.78,"close":11.18,"time":1515542400,"high":12.06,"low":10.16,"open":11.18,"volumefrom":143351.13,"volumeto":1546032.52,"close":11.95,"time":1515628800,"high":11.96,"low":10.93,"open":11.95,"volumefrom":97380.62,"volumeto":1100658.4,"close":11.66,"time":1515715200,"high":11.8,"low":10.89,"open":11.66,"volumefrom":63382.56,"volumeto":710582.11,"close":10.96,"time":1515801600,"high":11.12,"low":10.24,"open":10.96,"volumefrom":58214.24,"volumeto":625184.97,"close":10.27],"TimeTo":1515801600,"TimeFrom":1513209600,"FirstValueInArray":true,"ConversionType":"type":"invert","conversionSymbol":""
        // spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3fb8bf89
        // import spark.implicits._
        // import org.apache.spark.sql.functions._
        // df: org.apache.spark.sql.DataFrame = [Aggregated: boolean, ConversionType: struct<conversionSymbol: string, type: string> ... 6 more fields]
        // jsonExplodedDF: org.apache.spark.sql.DataFrame = [Aggregated: boolean, ConversionType: struct<conversionSymbol: string, type: string> ... 1 more field]

<a id="org3751caf"></a>

## Fast

1.  Memory
2.  Result of mappers goes to shared memory accross the cluster and not to disk
3.  In reality hadoop mapreduced optimized with Tez which means it keeps values in mem like spark
4.  In reality If spark runs out of memory intermediate results goes to disk.

<a id="org50b1dce"></a>

## Run

    ./bin/pyspark --master local[1] * start spark shell.
    ./bin/pyspark-submit 1 2 just args
    ./bin/sparkR --master local * => (r spark shell)

<a id="orga2f7ba5"></a>

## Hdfs

    val textFile = sc.textFile("hdfs://localhost:9000/user/hdfs/somefile.txt")

<a id="org62bf64c"></a>

# Hive

    CREATE TABLE mytable (a INT, b STRING) -- Hive created that table in hadoop!
    DROP TABLE mytable;
    -- Log file - you could just load a file and query it with SQL!

<a id="orgd2ebfa4"></a>

## Install


<a id="orgb21af8f"></a>

### derby

hive uses apache derby simple database for metastore, so you need to install it.

<a id="orgc513291"></a>

# Oozie

1.  Glue hadoop jobs > them them as one big job.
2.  Oozie workflow is DAG.
3.  Oozie coordinator jobs - repetitive, scheduled, jobs start each day at 2am.
4.  When job done system calls oozie to tell it it has stopped, control flow nodes, action nodes (not hosts) - DAG.

    <workflow myapp>
          <map reduce>

/-&#x2014;> MR &#x2013;\\

1.  Start &#x2013;/             \\
    \\             *&#x2013;> join &#x2013;> finish
     \&#x00ad;&#x2014;> MR --*

2.  Note in DAG we do not go back it's one direction.

Installation and run:

1.  core-site.xml

      <value>hadoop</value> <!-- run oozie as hadoop user -->

- params to workflow.xml
-   oozie (workflow.xml,
-   `oozie job run -ozie http://ooziehost:11000/oozie -config` => returns job id.
-   `oozie job -info job:<jobid>`
-   ~<http://ooziehost:11000/oozie> # => oozie web console.

<a id="org3b96987"></a>


<a id="orgd5aa938"></a>

## considerations

<a id="orgc2e2713"></a>

### develop

<a id="orgc2464d3"></a>

### deploy

<a id="org6e0a99d"></a>

### iteration time

<a id="org83f1ea0"></a>

### lower scale

<a id="orgd22e813"></a>

### processing time

<a id="orgae7a1c9"></a>

## key technologies

<a id="org2cbd681"></a>

### S3

bucket name:

1.  no underscores has to be a valid hostname for hadoop usage in url


1.  ACL

<a id="orgbd4c527"></a>

### redshift

relational database

<a id="orgd99a0ae"></a>

### data pipelines

ETL for data for example from S3 into redshift to view results can apply complex series of transformations.  It uses EC2 for the compute power to do the moving of data.

<a id="orgfba103a"></a>

### kinesis

like kafka

<a id="org854f7ad"></a>

### ec2

<a id="orged854c2"></a>

## resources


<a id="orgdbb663f"></a>

## process


1.  use data-pipelines to ingest data (copy from one place maybe from s3 to s3)
2.  run machine learning algorithm on ec2 or emr.


<a id="org33766ea"></a>

## ec2

create keypair public/private key in order to be able to connect

<a id="org673323a"></a>

## EMR

We it's all going through s3 bucket we create there folders for the jar to run for logs for the results and for the input data.


1.  [<>](<>)

elastic map reduce

<a id="orgcd99fa9"></a>

### s3

EMR uses S3 for input and output data you need to create buckets to put your jar files and input and output.

1.  bucketname/folder for specifying jar to aws console
2.  s3n://bucket/path \* => for hadoop args
3.  s3://bucket/path \* for aws cmd line tools.


<a id="org49c27e2"></a>

### JobFlow

Then create a job flow so that you can create the flow you tell it where your jar file is the jar run arguments.
if you choose keepAlive <- no this means the EMR cluster is stopped once the job fiishes.

<a id="org60a146e"></a>

### Hive

mybucket/scripts/myhive.hql \* => I put there my hive script.
mybucket/data/mydata.csv \* => I put there my data


<a id="org269105e"></a>

### cli

1.  create spark cluster

    aws emr create-cluster &#x2013;name "Spark cluster" &#x2013;release-label emr-5.13.0 &#x2013;applications Name=Spark \\
    &#x2013;ec2-attributes KeyName=tomer-key-pair &#x2013;instance-type m4.small &#x2013;instance-count 2 &#x2013;use-default-roles

2.  list emr clusters

    aws emr list-clusters

3.  terminate clusters

    aws emr terminate-clusters &#x2013;cluster-ids="j-W25BXM9TCOGX"

<a id="org3d659e3"></a>

## awscli

<a id="org8449a75"></a>

### install

pip3 install awscli &#x2013;upgrade &#x2013;user
then add *Users/tomer.bendavid*.local/bin to PATH on bash<sub>profile</sub>

<a id="org3b84864"></a>

### configure


1.  \`aws configure\`
2.  take security credentials from [here](<*/security_credential>)
3.  for default reigon i entered \`us-east-1\`


<a id="orgceb6ec1"></a>

# python

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">

<col  class="org-left" />

<col  class="org-left" />
<th scope="col" class="org-left">command</th>
<th scope="col" class="org-left">description</th>

<td class="org-left">`conda create --name testenv`</td>
<td class="org-left">&#xa0;</td>

<td class="org-left">`conda activate testenv`</td>
<td class="org-left">&#xa0;</td>

<td class="org-left">`conda env list`</td>
<td class="org-left">&#xa0;</td>

<td class="org-left">`conda installs spyder`</td>
<td class="org-left">&#xa0;</td>

<td class="org-left">`conda activate testenv`</td>
<td class="org-left">&#xa0;</td>

<td class="org-left">`conda install -c conda-forge pyspark`</td>
<td class="org-left">install pyspark</td>

<td class="org-left">&#xa0;</td>
<td class="org-left">&#xa0;</td>

<a id="orgea8804a"></a>

## urllib2

<a id="org6dfdae5"></a>

### getfile

import urllib.request
url = "<>"
accesslog =  urllib.request.urlopen(url).read().decode('utf-8')
print("accesslog: " + accesslog)

<a id="orge693329"></a>

## matplotlib

<a id="org7750f54"></a>

## pandas

from pandas import read<sub>csv</sub>

<a id="orge4c5a97"></a>

### data

    url = "<>"
    names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
    dataset =<sub>csv</sub>(url, names=names) \*name is the above name for columns.







5.  print(dataset.groupby('class').size())

6.  pandas.set<sub>option</sub>('expand<sub>frame</sub><sub>repr</sub>', False)

    Don't break table output when printing like with \`.head()\` to new lines, all in one line, wide table.

<a id="orga52d38d"></a>

### plot

1.  dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)

2.  dataset.hist()

3.  scatter<sub>matrix</sub>(dataset)

<a id="orgae299fe"></a>

### build model

1.  validation dataset

    seperate out validation dataset.
    80% for data, 20% for validation.

<a id="org468572f"></a>

# Split-out validation dataset

array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation<sub>size</sub> = 0.20
seed = 7
X<sub>train</sub>, X<sub>validation</sub>, Y<sub>train</sub>, Y<sub>validation</sub> = model<sub>selection.train</sub><sub>test</sub><sub>split</sub>(X, Y, test<sub>size</sub>=validation<sub>size</sub>, random<sub>state</sub>=seed)

1.  cross validation

    10 fold cross validation for accuracy.

<a id="org65d329c"></a>

# Test options and evaluation metric

seed = 7
scoring = 'accuracy'

1.  build choose models

    evaluate 6 models:
    1.  Logistic Regression (LR)
    2.  Linear Discriminant Analysis (LDA)
    3.  K-Nearest Neighbors (KNN).
    4.  Classification and Regression Trees (CART).
    5.  Gaussian Naive Bayes (NB).
    6.  Support Vector Machines (SVM).
    This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms

<a id="org1385d2a"></a>

# Spot Check Algorithms

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

<a id="orgdf79ef3"></a>

# evaluate each model in turn

results = []
names = []
for name, model in models:
    kfold = model<sub>selection.KFold</sub>(n<sub>splits</sub>=10, random<sub>state</sub>=seed)
    cv<sub>results</sub> = model<sub>selection.cross</sub><sub>val</sub><sub>score</sub>(model, X<sub>train</sub>, Y<sub>train</sub>, cv=kfold, scoring=scoring)
    msg = "%s: %f (%f)" % (name, cv<sub>results.mean</sub>(), cv<sub>results.std</sub>())


LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.981667 (0.025000)

plot models comparison:


<a id="org5cc1568"></a>

# Compare Algorithms

fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add<sub>subplot</sub>(111)

1.  make predictions


<a id="org019c684"></a>

# Make predictions on validation dataset

knn = KNeighborsClassifier()<sub>train</sub>, Y<sub>train</sub>)
predictions = knn.predict(X<sub>validation</sub>)
print(accuracy<sub>score</sub>(Y<sub>validation</sub>, predictions))
print(confusion<sub>matrix</sub>(Y<sub>validation</sub>, predictions))
print(classification<sub>report</sub>(Y<sub>validation</sub>, predictions))

1.  errors f1 score

    We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).
    [[ 7  0  0]
     [ 0 11  1]
     [ 0  2  9]]
    precision    recall  f1-score   support
    Iris-setosa       1.00      1.00      1.00         7
    Iris-versicolor   0.85      0.92      0.88        12
    Iris-virginica    0.90      0.82      0.86        11
    avg / total       0.90      0.90      0.90        30

<a id="orgefae294"></a>

### resources

1.  <>

<a id="org239029a"></a>

# Amazon










