markdown机器学习入门(代码片段)

author author     2022-12-15     623

关键词:


# Table of Contents

1.  [Data](#org498e524)
    1.  [Spreadsheet](#org4c5f4c8)
    2.  [Matrix](#org9cc5358)
        1.  [Sparse matrix](#orga79cfc9)
2.  [Learning](#orgc404075)
    1.  [Types](#orged9701e)
        1.  [Statistical](#orgc164f78)
        2.  [Programming](#org852d7b0)
        3.  [Parametric](#orgd6675a6)
        4.  [Non parametric](#orgb202e12)
        5.  [Supervised](#org58ecc0d)
        6.  [Unsupervised](#org9de217e)
        7.  [Semi supervised](#org5190b3c)
        8.  [Classification vs Regression](#orge3439c9)
    2.  [Errors](#org934b56a)
        1.  [Error Y=f(x) + e](#orgb4f57fc)
        2.  [Bias Variance](#orgd65d546)
        3.  [Overfitting](#orga7d5c68)
3.  [Map reduce](#org2f4f03f)
    1.  [Map](#org57afa8f)
    2.  [Reduce](#orgd61227c)
4.  [Algorithms](#orge247545)
    1.  [Gradiant Descent](#org3457ed6)
        1.  [Stochastic Gradiant Descent](#orgbb5e342)
5.  [Hadoop](#org10efd28)
    1.  [Install](#orgbf2b7cd)
        1.  [From source](#orgeaf39ad)
        2.  [core-site.xml](#orgc224f48)
        3.  [hdfs-site.xml](#orgca71df5)
        4.  [Format hdfs](#org761f8e6)
        5.  [Start](#org7dc5796)
        6.  [Create hdfs folders](#orge8a6492)
        7.  [start yarn](#orgdf870eb)
    2.  [urls](#org2c5b3ba)
        1.  [hdfs fs](#org0a6a998)
        2.  [yarn](#org47cbd04)
        3.  [jobtracker](#orgd78bb47)
    3.  [run test](#orgedf4232)
        1.  [yarn jar somejob.jar args](#org6e5fa4c)
    4.  [hdfs](#orgcc3a213)
        1.  [roles](#orge848108)
        2.  [commands](#org361d0d3)
        3.  [programming](#org6a085ec)
        4.  [HA](#org970c9c2)
        5.  [misc](#org9dbffbd)
    5.  [debug](#org3faa487)
        1.  [/var/log/hadoop](#org43e4c3e)
        2.  [kill](#orgcbd191d)
    6.  [map reduce](#org46f8617)
        1.  [grep | wc -l](#orgdfbff05)
        2.  [helloworld](#org65e0742)
        3.  [shuffle](#orgccd0f12)
        4.  [reduce](#org37150c2)
        5.  [combiner](#org57576cc)
        6.  [streaming](#org08fdac5)
        7.  [pipes](#org5ecd28d)
    7.  [YARN](#org50d92e1)
6.  [Spark](#orga8dadcf)
    1.  [General Ideas](#orgc599f8f)
7.  [Code Examples](#orga9a09c0)
    1.  [Libraries](#orga432155)
        1.  [Graphx](#org56005c1)
    2.  [Operations](#orga51877b)
        1.  [Transformations](#org4e1743e)
        2.  [Actions](#org2f9c501)
    3.  [Data structures](#org6d83be5)
        1.  [RDD](#orga784b50)
        2.  [DF](#org93ce8c2)
    4.  [Fast](#org3751caf)
    5.  [Run](#org50b1dce)
    6.  [Hdfs](#orga2f7ba5)
8.  [Hive](#org62bf64c)
    1.  [Install](#orgd2ebfa4)
        1.  [derby](#orgb21af8f)
9.  [Oozie](#orgc513291)
10. [AWS](#org3b96987)
    1.  [considerations](#orgd5aa938)
        1.  [develop](#orgc2e2713)
        2.  [deploy](#orgc2464d3)
        3.  [iteration time](#org6e0a99d)
        4.  [lower scale](#org83f1ea0)
        5.  [processing time](#orgd22e813)
    2.  [key technologies](#orgae7a1c9)
        1.  [S3](#org2cbd681)
        2.  [redshift](#orgbd4c527)
        3.  [data pipelines](#orgd99a0ae)
        4.  [kinesis](#orgfba103a)
        5.  [ec2](#org854f7ad)
    3.  [resources](#orged854c2)
    4.  [process](#orgdbb663f)
    5.  [ec2](#org33766ea)
    6.  [EMR](#org673323a)
        1.  [s3](#orgcd99fa9)
        2.  [JobFlow](#org49c27e2)
        3.  [Hive](#org60a146e)
        4.  [cli](#org269105e)
    7.  [awscli](#org3d659e3)
        1.  [install](#org8449a75)
        2.  [configure](#org3b84864)
11. [python](#orgceb6ec1)
    1.  [urllib2](#orgea8804a)
        1.  [getfile](#org6dfdae5)
    2.  [matplotlib](#orge693329)
    3.  [pandas](#org7750f54)
        1.  [data](#orge4c5a97)
        2.  [plot](#orga52d38d)
        3.  [build model](#orgae299fe)
12. [Split-out validation dataset](#org468572f)
13. [Test options and evaluation metric](#org65d329c)
14. [Spot Check Algorithms](#org1385d2a)
15. [evaluate each model in turn](#orgdf79ef3)
16. [Compare Algorithms](#org5cc1568)
17. [Make predictions on validation dataset](#org019c684)
        1.  [resources](#orgefae294)
18. [Amazon](#org239029a)



<a id="org498e524"></a>

# Data


<a id="org4c5f4c8"></a>

## Spreadsheet

Think of data as a spreadsheet as a table.


<a id="org9cc5358"></a>

## Matrix

rows: observations, our datadata. columns - features.  Get used to it.


<a id="orga79cfc9"></a>

### Sparse matrix

matrix who's most rows are zeros


<a id="orgc404075"></a>

# Learning


<a id="orged9701e"></a>

## Types


<a id="orgc164f78"></a>

### Statistical

Output = f(input) \* => f(inputVariable) or f(inputVector), or f(independent variables) or Y = F(X) // X1,X2,..


<a id="org852d7b0"></a>

### Programming

OutputAttributes = Program(InputAttributes) or Program(InputFeatures) or Model = Algorithm(Data)


<a id="orgd6675a6"></a>

### Parametric

No matter how much data you throw on it, it will still need these parameters like a line \`Y = ax + b\` (logistic regression, linear discriminant analysis, perceptron)


<a id="orgb202e12"></a>

### Non parametric

No matter how much data you throw on it, it will still need these parameters like a line Y = ax + b (logistic regression, linear discriminant analysis, perceptron)


<a id="org58ecc0d"></a>

### Supervised

You have a teacher he knows the answer, classification, regression


<a id="org9de217e"></a>

### Unsupervised

No teacher, clustering, association


<a id="org5190b3c"></a>

### Semi supervised

Some can be with a teacher


<a id="orge3439c9"></a>

### Classification vs Regression

classification(input) => spam/notspam (categorical)<br />regression(input) => bitcoin price (continous outcome)


<a id="org934b56a"></a>

## Errors


<a id="orgb4f57fc"></a>

### Error Y=f(x) + e

Y = f(X) + e \* => You learn a function!


<a id="orgd65d546"></a>

### Bias Variance

Bias Error (model assumptions), Variance Error, Irreducable Error. Increasing bias error reduce variance, increase variance will decrease bias


<a id="orga7d5c68"></a>

### Overfitting

Resampling to estimate model accuracy, Hold back validation dataset, Cross validation.


<a id="org2f4f03f"></a>

# Map reduce

    grep something | wc -l * => grep is map wc -l is the reduce!

Based on simple [key, value] pair
Moving computation is cheaper than moving data, our data is big ain't it?


<a id="org57afa8f"></a>

## Map

List(input) => List(output) \* => like grep


<a id="orgd61227c"></a>

## Reduce

List(input) => Output(value) \* => like wc -l


<a id="orge247545"></a>

# Algorithms


<a id="org3457ed6"></a>

## Gradiant Descent

Almost every machine learning algorithm uses optimisation at it's core, optimising the target function.  Local minimum.  start with 0 \`coefficient = 0.0\`.  \`cost = evaluate(f(coefficient))\`.  Update coefficient downhill with derivative.  \`coefficient = coefficient - (alpha \* delta)\`.  alpha learning parameter.


<a id="orgbb5e342"></a>

### Stochastic Gradiant Descent

Have large amounts of data, update to coefficients is for each training instance, not in batch, as we have random data we move quickly.


<a id="org10efd28"></a>

# Hadoop


<a id="orgbf2b7cd"></a>

## Install

In general for hadoop, hive, ping installations you download the tar.gz, set environment variables for its home, and add folders in hdfs if needed.


<a id="orgeaf39ad"></a>

### From source

<https://www.safaribooksonline.com/library/view/hadoop-and-spark/9780134770871/HASF_01_02_02_01.html>
extract hadoop tar.gz, make sure JAVA<sub>HOME</sub> in path, HADOOP<sub>HOME</sub> configured, add yarn, hdfs, mapred users, make directories: *var/data/hadoop/hadfs*[nn,snn], log directory,


<a id="orgc224f48"></a>

### core-site.xml

fs.default.name: hdfs://localhsot:9000 \*=> set the hdfs port.


<a id="orgca71df5"></a>

### hdfs-site.xml

hdfs parameters, dfs.replication: 1, dfs. directory&#x2026;


<a id="org761f8e6"></a>

### Format hdfs

    su - hdfs
    cd /opt/hadoop-2.8.1/bin
    ./hdfs namenode -format


<a id="org7dc5796"></a>

### Start

    cd /opt/hadoop-2.8.1/sbin
    ./hadoop-daemon.sh start namenode
    ./hadoop-daemon.sh start secondarynamenode
    ./hadoop-daemon.sh start datanode
    jps * => java processes status the above are all java processes.


<a id="orge8a6492"></a>

### Create hdfs folders

hdfs dfs -mkdir -p /mr-history/tmp /mr-history/done chown to yarn:hadoop


<a id="orgdf870eb"></a>

### start yarn

\`\`\`bash
su - yarn
./yarn-daemon.sh start resourcemanager
./yarn-daemon.sh start nodemanager
./mr-jobhistory-daemon.sh start historyserver
jps
\`\`\`


<a id="org2c5b3ba"></a>

## urls


<a id="org0a6a998"></a>

### hdfs fs

1.  <http://localhost:50070>

    hdfs file system


<a id="org47cbd04"></a>

### yarn

1.  <http://localhost:8088>

    as a local file system


<a id="orgd78bb47"></a>

### jobtracker

1.  <http://headnode:50030>


<a id="orgedf4232"></a>

## run test


<a id="org6e5fa4c"></a>

### yarn jar somejob.jar args

run a test mr jar with yarn


<a id="orgcc3a213"></a>

## hdfs


<a id="orge848108"></a>

### roles

1.  namenode

    like a ****traffic cop****, telling us where to find or write data, also handles failures of data nodes, if data node does not report back with status its timeout and namenode will remove it, we see one namespace across the whole data.  Client contacts namenode and then datanode returned from namenode for the actual data.
    
    1.  inmemory
    
        stores HDFS metadata in memory at startup reads it fro file \`fsimage\`. Writes added to logfile on startup it merges the log with fsimage.
    
    2.  secondary namenode
    
        1.  bad title
        
            1.  checkpoint node
            
                better named checkpoint node because it's merging the fsimage to the edits log while the namenode is running so startup will be fast.
    
    3.  backup node
    
        same work as checkpoint node but is synchronized to namenode using real time stream from the namenode.  Still no redundancy with this.

2.  datanode

3.  hdfs-client

    1.  calls namenode then datanode
    
        you do operations on hdfsClient it's doing all the work of communicating with namenodes and then sending the operations to the correct data nodes.


<a id="org361d0d3"></a>

### commands

1.  hdfs dfsadmin -report

2.  dfs -put file.txt

    hdfs dfs -put war-and-peace.txt

3.  dfs -cp file1.txt file2.txt

    copy a file inside hdfs

4.  mount hdfs /mnt/hdfs

    as a local file system!


<a id="org6a085ec"></a>

### programming

1.  java

    \`\`\`java
    import org.apache.hadoop.fs.FileSystem // just same api as java file system.
    
    Configuration conf = new Configuration();
    
    conf.addResource(new Path("/etc/hadoop/conf/core-site.xml");
    conf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml");
    
    FileSystem fileSystem = FileSystem.get(conf);
    
    fileSystem.exists("/users/tomer/test.txt");
    
    // Create new file and write data to it.
    FSDataOutputStream out = fileSystem.create(path);
    InputStream in = new BufferedInputStream(new FileInputStream(
      new File(source)));
    int numBytes = 0;
    while ((numBytes = in.read(b)) > 0) 
      out.write(b, 0, numBytes);
    
    \`\`\`
    
    1.  compile
    
        \`\`\`bash
        echo "Main-Class: org/myorg.HDFSClient" > manifest.txt
        javac -classpath *usr/lib/hadoop/hadoop-core.jar -d HDSFClient -classes HDFSClient.java \* => Note we needed to include hadoop core jar.
        jar -cvfe HDFSClient.jar org/myorg.HDFSClient -C HDFSClient-classes* .
        hadoop jar ./HDFSClient.jar add sometextfile.txt /user/tomer \* => run with program arguments.
        \`\`\`
    
    2.  classpath
    
        export CLASSPATH=$(hadoop classpath)


<a id="org970c9c2"></a>

### HA

1.  namenode

    1.  standby namenode
    
        acting like checkpoint node so it has the fsimage file, it will take over in case of failure.
    
    2.  federation
    
        Break namespace across all namespace
        namenode1: /research/marketing
        namenode2: /data/project
    
    3.  snapshots
    
        read onliy point-in-time copies of the file system.  can be of subtree.  it's not data no data copied only block list and file size.  Think of snapshot of a file directory.  can do this on daily basis does not slow things down.


<a id="org9dbffbd"></a>

### misc

1.  nfsv3

    NFS gateway allows you to access hdfs as if it's a local file system, its still not random access but it's convinient.

2.  host:5700

    web gui for nfs is at <http://host:5700>


<a id="org3faa487"></a>

## debug


<a id="org43e4c3e"></a>

### /var/log/hadoop

these are the logs on the headnode you can also ssh to worker nodes and similarly look at /var/log/hadoop/mapred you will see there the task tracker logs.


<a id="orgcbd191d"></a>

### kill

\`\`\`bash
hadoop job -list
hadoop job -kill job<sub>2016982347928</sub><sub>0042</sub>
\`\`\`


<a id="org46f8617"></a>

## map reduce

map => banana,1
             banana, 1
             banana, 1
reduce => banana, 3


<a id="orgdfbff05"></a>

### grep | wc -l

\`grep "Samuel" somebook.txt | wc -l\`
grep => map
wc -l => reduce


<a id="org65e0742"></a>

### helloworld

mapper: string tokenizer, emit (word, 1), reduce, sum+= values, in addition you write the "driver", going to run the mapper and reducer, you say which class is mapper conf.setMapperClass(MapClass.class); you also do conf.setCombinerClass and setReducerClass.
\`hadoop jar wordcount.jar org.myorg.WordCount /user/myuser/inputdir /user/myuser/outputdir\`


<a id="orgccd0f12"></a>

### shuffle

\`\`\`markdown
shuffle is the only step where we have communication transfer of data between nodes.
\![shuffle](![img](https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png))
\`\`\`


<a id="org37150c2"></a>

### reduce

\`\`\`markdown
can run on multiple hosts, depending on shuffle, shuffle puts same keys on same hosts, so reduce can work on grouping of same keys and he will know he has all the same keys on the same hosts.
\`\`\`


<a id="org57576cc"></a>

### combiner

instead of mapper saying i found earth,1 and earth,1 compiner will have the mapper report earth,2 from a certain node, optimizing the mapper so the reducer has less work.


<a id="org08fdac5"></a>

### streaming

\`\`\`markdown
****Streaming interface for hadoop jobs****
you can write a mapper.py that expects stdin and just run it and amazingly you can also run it on hadoop.  in the java map reduce interface we got line by line, here we get the stdin we can do anything we want. [<https://www.safaribooksonline.com/library/view/hadoop-and-spark/9780134770871/HASF_01_05_01.html?autoStart=True>](<https://www.safaribooksonline.com/library/view/hadoop-and-spark/9780134770871/HASF_01_05_01.html?autoStart=True>)
\`\`\`
Then you run it with:
\`\`\`bash
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.1.2.21.jar -file ./mapper.py -mapeper ./mapper.py -file ./reducer.py -reducer ./reducer.py &#x2026;
\`\`\`


<a id="org5ecd28d"></a>

### pipes

\`\`\`markdown
****Pipes interface to mapreduce****
it's a clean interface to do map reduce.
\`\`\`


<a id="org50d92e1"></a>

## YARN

does not care that its' map reduce its running could be any job.  the previous job manager and task manager ran only map reduce.  jobTracker manages jobs and taskTracker is on local nodes.


<a id="orga8dadcf"></a>

# Spark


<a id="orgc599f8f"></a>

## General Ideas

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">


<colgroup>
<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<tbody>
<tr>
<td class="org-left">Idea</td>
<td class="org-left">Description</td>
</tr>


<tr>
<td class="org-left">Transformation</td>
<td class="org-left">`transformation(RDD): RDD`</td>
</tr>


<tr>
<td class="org-left">Action</td>
<td class="org-left">`action(RDD): Value`</td>
</tr>
</tbody>
</table>


<a id="orga9a09c0"></a>

# Code Examples

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">


<colgroup>
<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">action</th>
<th scope="col" class="org-left">example</th>
</tr>
</thead>

<tbody>
<tr>
<td class="org-left">Read text file</td>
<td class="org-left">`sc.textFile("file..")`</td>
</tr>


<tr>
<td class="org-left">Count</td>
<td class="org-left">`rdd.count()`</td>
</tr>
</tbody>
</table>


<a id="orga432155"></a>

## Libraries


<a id="org56005c1"></a>

### Graphx

Has a library for computing graph computations (in addition to mlib).


<a id="orga51877b"></a>

## Operations


<a id="org4e1743e"></a>

### Transformations


<a id="org2f9c501"></a>

### Actions


<a id="org6d83be5"></a>

## Data structures


<a id="orga784b50"></a>

### RDD

1.  Blind data


<a id="org93ce8c2"></a>

### DF

dfs.replication: 1, dfs.namenode.name.dir: *var/data/hadoop*&#x2026;

1.  Scheme

    Think of it as distributed database table.

2.  Read json element

        object SparkDFOnlineJson extends App 
        
          override def main(args: Array[String]): Unit = 
        
            val jsonString = scala.io.Source.fromURL("https://min-api.cryptocompare.com/data/histoday?fsym=BTC&tsym=ETH&limit=30&aggregate=1&e=CCCAGG").mkString
        
            val spark = org.apache.spark.sql.SparkSession.builder().appName("someapp").master("local[*]").getOrCreate()
        
            import spark.implicits._
            import org.apache.spark.sql.functions._
            val df = spark.read.json(Seq(jsonString).toDS())
        
            df.show()
        
            df.take(10).foreach(println)
            df.printSchema()
        
            df.select($"Data.close".as("close_price")).show(2) // <-- HERE reading Data.close from the json!
        
            val jsonExplodedDF = df.select($"Aggregated", $"ConversionType", explode($"Data").as("prices")) // <-- HERE reading Data.close from the json!
            jsonExplodedDF.printSchema()
            jsonExplodedDF.select($"Aggregated", $"ConversionType", $"prices".getItem("close")).show(10) // Then getItem instead of explode to objects!!
          
        
        
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        // |Aggregated|ConversionType|                Data|FirstValueInArray|Response|  TimeFrom|    TimeTo|Type|
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        // |     false|     [,invert]|[[23.91,25.06,21....|             true| Success|1513209600|1515801600| 100|
        // +----------+--------------+--------------------+-----------------+--------+----------+----------+----+
        
        // [false,[,invert],WrappedArray([23.91,25.06,21.87,23.39,1513209600,62691.53,1452942.54], [25.87,29.03,23.88,23.91,1513296000,50825.4,1342967.63], [28.11,28.62,24.53,25.87,1513382400,38155.01,1013078.48], [26.72,28.11,25.93,28.11,1513468800,36242.76,979762.25], [24.08,26.86,23.29,26.72,1513555200,46712.69,1186390.62], [21.63,24.41,21.29,24.08,1513641600,65125.17,1449434.45], [20.67,22.29,20.42,21.63,1513728000,64539.45,1372742.27], [19.79,20.94,19.4,20.67,1513814400,61802.62,1244602.57], [20.93,21.98,19.47,19.79,1513900800,80230.91,1656134.49], [20.78,20.97,20.42,20.93,1513987200,42893.35,887428.82], [20.53,20.97,20.36,20.77,1514073600,41294.18,855012.67], [19.18,20.53,18.67,20.53,1514160000,48165.25,929653.57], [20.91,21.55,18.75,19.18,1514246400,46999.33,956924.92], [20.88,21.57,20.45,20.91,1514332800,36759.37,769083.49], [20.04,20.95,19.7,20.88,1514419200,40883.16,828193.82], [19.58,20.25,19.32,20.04,1514505600,43487.34,857520.42], [18.14,19.77,18.09,19.58,1514592000,66161.84,1246949.13], [18.68,19.07,18.05,18.14,1514678400,48718.02,902419.05], [17.76,18.7,17.54,18.67,1514764800,50703.72,910875.63], [17.16,18.94,15.25,17.76,1514851200,96092.61,1574640.02], [16.01,17.68,15.62,17.16,1514937600,75289.68,1266911.61], [16.06,16.59,14.43,16.03,1515024000,80755.25,1258516.2], [17.59,18.29,14.54,16.07,1515110400,104693.19,1682729.53], [17.03,17.91,16.25,17.59,1515196800,58014.94,975679.49], [14.49,17.06,14.47,17.03,1515283200,64620.79,994739.35], [13.2,14.5,12.73,14.49,1515369600,102880.99,1380565.72], [11.18,13.21,10.93,13.2,1515456000,95751.66,1168583.78], [11.95,12.06,10.16,11.18,1515542400,143351.13,1546032.52], [11.66,11.96,10.93,11.95,1515628800,97380.62,1100658.4], [10.96,11.8,10.89,11.66,1515715200,63382.56,710582.11], [10.27,11.12,10.24,10.96,1515801600,58214.24,625184.97]),true,Success,1513209600,1515801600,100]
        // root
        //  |-- Aggregated: boolean (nullable = true)
        //  |-- ConversionType: struct (nullable = true)
        //  |    |-- conversionSymbol: string (nullable = true)
        //  |    |-- type: string (nullable = true)
        //  |-- Data: array (nullable = true)
        //  |    |-- element: struct (containsNull = true)
        //  |    |    |-- close: double (nullable = true)
        //  |    |    |-- high: double (nullable = true)
        //  |    |    |-- low: double (nullable = true)
        //  |    |    |-- open: double (nullable = true)
        //  |    |    |-- time: long (nullable = true)
        //  |    |    |-- volumefrom: double (nullable = true)
        //  |    |    |-- volumeto: double (nullable = true)
        //  |-- FirstValueInArray: boolean (nullable = true)
        //  |-- Response: string (nullable = true)
        //  |-- TimeFrom: long (nullable = true)
        //  |-- TimeTo: long (nullable = true)
        //  |-- Type: long (nullable = true)
        
        // +--------------------+
        // |         close_price|
        // +--------------------+
        // |[23.91, 25.87, 28...|
        // +--------------------+
        
        // root
        //  |-- Aggregated: boolean (nullable = true)
        //  |-- ConversionType: struct (nullable = true)
        //  |    |-- conversionSymbol: string (nullable = true)
        //  |    |-- type: string (nullable = true)
        //  |-- prices: struct (nullable = true)
        //  |    |-- close: double (nullable = true)
        //  |    |-- high: double (nullable = true)
        //  |    |-- low: double (nullable = true)
        //  |    |-- open: double (nullable = true)
        //  |    |-- time: long (nullable = true)
        //  |    |-- volumefrom: double (nullable = true)
        //  |    |-- volumeto: double (nullable = true)
        
        // +----------+--------------+------------+
        // |Aggregated|ConversionType|prices.close|
        // +----------+--------------+------------+
        // |     false|     [,invert]|       23.91|
        // |     false|     [,invert]|       25.87|
        // |     false|     [,invert]|       28.11|
        // |     false|     [,invert]|       26.72|
        // |     false|     [,invert]|       24.08|
        // |     false|     [,invert]|       21.63|
        // |     false|     [,invert]|       20.67|
        // |     false|     [,invert]|       19.79|
        // |     false|     [,invert]|       20.93|
        // |     false|     [,invert]|       20.78|
        // +----------+--------------+------------+
        // only showing top 10 rows
        
        // jsonString: String = "Response":"Success","Type":100,"Aggregated":false,"Data":["time":1513209600,"high":25.06,"low":21.87,"open":23.39,"volumefrom":62691.53,"volumeto":1452942.54,"close":23.91,"time":1513296000,"high":29.03,"low":23.88,"open":23.91,"volumefrom":50825.4,"volumeto":1342967.63,"close":25.87,"time":1513382400,"high":28.62,"low":24.53,"open":25.87,"volumefrom":38155.01,"volumeto":1013078.48,"close":28.11,"time":1513468800,"high":28.11,"low":25.93,"open":28.11,"volumefrom":36242.76,"volumeto":979762.25,"close":26.72,"time":1513555200,"high":26.86,"low":23.29,"open":26.72,"volumefrom":46712.69,"volumeto":1186390.62,"close":24.08,"time":1513641600,"high":24.41,"low":21.29,"open":24.08,"volumefrom":65125.17,"volumeto":1449434.45,"close":21.63,"time":1513728000,"high":22.29,"low":20.42,"open":21.63,"volumefrom":64539.45,"volumeto":1372742.27,"close":20.67,"time":1513814400,"high":20.94,"low":19.4,"open":20.67,"volumefrom":61802.62,"volumeto":1244602.57,"close":19.79,"time":1513900800,"high":21.98,"low":19.47,"open":19.79,"volumefrom":80230.91,"volumeto":1656134.49,"close":20.93,"time":1513987200,"high":20.97,"low":20.42,"open":20.93,"volumefrom":42893.35,"volumeto":887428.82,"close":20.78,"time":1514073600,"high":20.97,"low":20.36,"open":20.77,"volumefrom":41294.18,"volumeto":855012.67,"close":20.53,"time":1514160000,"high":20.53,"low":18.67,"open":20.53,"volumefrom":48165.25,"volumeto":929653.57,"close":19.18,"time":1514246400,"high":21.55,"low":18.75,"open":19.18,"volumefrom":46999.33,"volumeto":956924.92,"close":20.91,"time":1514332800,"high":21.57,"low":20.45,"open":20.91,"volumefrom":36759.37,"volumeto":769083.49,"close":20.88,"time":1514419200,"high":20.95,"low":19.7,"open":20.88,"volumefrom":40883.16,"volumeto":828193.82,"close":20.04,"time":1514505600,"high":20.25,"low":19.32,"open":20.04,"volumefrom":43487.34,"volumeto":857520.42,"close":19.58,"time":1514592000,"high":19.77,"low":18.09,"open":19.58,"volumefrom":66161.84,"volumeto":1246949.13,"close":18.14,"time":1514678400,"high":19.07,"low":18.05,"open":18.14,"volumefrom":48718.02,"volumeto":902419.05,"close":18.68,"time":1514764800,"high":18.7,"low":17.54,"open":18.67,"volumefrom":50703.72,"volumeto":910875.63,"close":17.76,"time":1514851200,"high":18.94,"low":15.25,"open":17.76,"volumefrom":96092.61,"volumeto":1574640.02,"close":17.16,"time":1514937600,"high":17.68,"low":15.62,"open":17.16,"volumefrom":75289.68,"volumeto":1266911.61,"close":16.01,"time":1515024000,"high":16.59,"low":14.43,"open":16.03,"volumefrom":80755.25,"volumeto":1258516.2,"close":16.06,"time":1515110400,"high":18.29,"low":14.54,"open":16.07,"volumefrom":104693.19,"volumeto":1682729.53,"close":17.59,"time":1515196800,"high":17.91,"low":16.25,"open":17.59,"volumefrom":58014.94,"volumeto":975679.49,"close":17.03,"time":1515283200,"high":17.06,"low":14.47,"open":17.03,"volumefrom":64620.79,"volumeto":994739.35,"close":14.49,"time":1515369600,"high":14.5,"low":12.73,"open":14.49,"volumefrom":102880.99,"volumeto":1380565.72,"close":13.2,"time":1515456000,"high":13.21,"low":10.93,"open":13.2,"volumefrom":95751.66,"volumeto":1168583.78,"close":11.18,"time":1515542400,"high":12.06,"low":10.16,"open":11.18,"volumefrom":143351.13,"volumeto":1546032.52,"close":11.95,"time":1515628800,"high":11.96,"low":10.93,"open":11.95,"volumefrom":97380.62,"volumeto":1100658.4,"close":11.66,"time":1515715200,"high":11.8,"low":10.89,"open":11.66,"volumefrom":63382.56,"volumeto":710582.11,"close":10.96,"time":1515801600,"high":11.12,"low":10.24,"open":10.96,"volumefrom":58214.24,"volumeto":625184.97,"close":10.27],"TimeTo":1515801600,"TimeFrom":1513209600,"FirstValueInArray":true,"ConversionType":"type":"invert","conversionSymbol":""
        // spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3fb8bf89
        // import spark.implicits._
        // import org.apache.spark.sql.functions._
        // df: org.apache.spark.sql.DataFrame = [Aggregated: boolean, ConversionType: struct<conversionSymbol: string, type: string> ... 6 more fields]
        // jsonExplodedDF: org.apache.spark.sql.DataFrame = [Aggregated: boolean, ConversionType: struct<conversionSymbol: string, type: string> ... 1 more field]


<a id="org3751caf"></a>

## Fast

1.  Memory
2.  Result of mappers goes to shared memory accross the cluster and not to disk
3.  In reality hadoop mapreduced optimized with Tez which means it keeps values in mem like spark
4.  In reality If spark runs out of memory intermediate results goes to disk.


<a id="org50b1dce"></a>

## Run

    ./bin/pyspark --master local[1] * start spark shell.
    ./bin/pyspark-submit myprog.py 1 2 just args
    ./bin/sparkR --master local * => (r spark shell)


<a id="orga2f7ba5"></a>

## Hdfs

    val textFile = sc.textFile("hdfs://localhost:9000/user/hdfs/somefile.txt")
    txtFile.count


<a id="org62bf64c"></a>

# Hive

    CREATE TABLE mytable (a INT, b STRING) -- Hive created that table in hadoop!
    SHOW TABLES;
    DROP TABLE mytable;
    -- Log file - you could just load a file and query it with SQL!
    LOAD DATA LOCAL INPATH 'mylog.log' OVERWRITE INTO TABLE mylog;
    CREATE TABLE mylog(t1, STRING, t2, STRING, ...) ROW FORAMT DELIETED FIELDS TERMINATED BY ' ';


<a id="orgd2ebfa4"></a>

## Install

<https://www.safaribooksonline.com/library/view/hadoop-and-spark/9780134770871/HASF_01_02_02_02.html>


<a id="orgb21af8f"></a>

### derby

hive uses apache derby simple database for metastore, so you need to install it.


<a id="orgc513291"></a>

# Oozie

1.  Glue hadoop jobs > them them as one big job.
2.  Oozie workflow is DAG.
3.  Oozie coordinator jobs - repetitive, scheduled, jobs start each day at 2am.
4.  When job done system calls oozie to tell it it has stopped, control flow nodes, action nodes (not hosts) - DAG.

    <workflow myapp>
      <start>
        <action>
          <map reduce>
    </workflow>

/-&#x2014;> MR &#x2013;\\

1.  Start &#x2013;/             \\
    \\             *&#x2013;> join &#x2013;> finish
     \&#x00ad;&#x2014;> MR --*

2.  Note in DAG we do not go back it's one direction.

Installation and run:

1.  core-site.xml

    <property>
      <name>hadoop.proxy.user.oozie.group</name>
      <value>hadoop</value> <!-- run oozie as hadoop user -->
    </property>

-   job.properties: params to workflow.xml
-   oozie (workflow.xml, job.properties)
-   `oozie job run -ozie http://ooziehost:11000/oozie -config job.properties` => returns job id.
-   `oozie job -info job:<jobid>`
-   ~<http://ooziehost:11000/oozie> # => oozie web console.


<a id="org3b96987"></a>

# AWS


<a id="orgd5aa938"></a>

## considerations


<a id="orgc2e2713"></a>

### develop


<a id="orgc2464d3"></a>

### deploy


<a id="org6e0a99d"></a>

### iteration time


<a id="org83f1ea0"></a>

### lower scale


<a id="orgd22e813"></a>

### processing time


<a id="orgae7a1c9"></a>

## key technologies


<a id="org2cbd681"></a>

### S3

\`\`\`markdown
bucket name:

1.  no underscores has to be a valid hostname for hadoop usage in url

\`\`\`

1.  ACL


<a id="orgbd4c527"></a>

### redshift

relational database


<a id="orgd99a0ae"></a>

### data pipelines

ETL for data for example from S3 into redshift to view results can apply complex series of transformations.  It uses EC2 for the compute power to do the moving of data.


<a id="orgfba103a"></a>

### kinesis

like kafka


<a id="org854f7ad"></a>

### ec2


<a id="orged854c2"></a>

## resources

<https://www.safaribooksonline.com/library/view/learn-how-to/9781491985632/video312545.html>


<a id="orgdbb663f"></a>

## process

\`\`\`md

1.  use data-pipelines to ingest data (copy from one place maybe from s3 to s3)
2.  run machine learning algorithm on ec2 or emr.

\`\`\`


<a id="org33766ea"></a>

## ec2

create keypair public/private key in order to be able to connect


<a id="org673323a"></a>

## EMR

\`\`\`markdown
We it's all going through s3 bucket we create there folders for the jar to run for logs for the results and for the input data.

Resources:

1.  [<https://www.youtube.com/watch?v=cAZur5maWZE&index=3&list=PLB5E99B925DBE79FF>](<https://www.youtube.com/watch?v=cAZur5maWZE&index=3&list=PLB5E99B925DBE79FF>)

\`\`\`
elastic map reduce


<a id="orgcd99fa9"></a>

### s3

\`\`\`markdown
EMR uses S3 for input and output data you need to create buckets to put your jar files and input and output.

1.  bucketname/folder for specifying jar to aws console
2.  s3n://bucket/path \* => for hadoop args
3.  s3://bucket/path \* for aws cmd line tools.

\`\`\`


<a id="org49c27e2"></a>

### JobFlow

\`\`\`markdown
Then create a job flow so that you can create the flow you tell it where your jar file is the jar run arguments.
if you choose keepAlive <- no this means the EMR cluster is stopped once the job fiishes.
\`\`\`


<a id="org60a146e"></a>

### Hive

\`\`\`bash
mybucket/scripts/myhive.hql \* => I put there my hive script.
mybucket/data/mydata.csv \* => I put there my data

\`\`\`


<a id="org269105e"></a>

### cli

1.  create spark cluster

    \`\`\`bash
    aws emr create-cluster &#x2013;name "Spark cluster" &#x2013;release-label emr-5.13.0 &#x2013;applications Name=Spark \\
    &#x2013;ec2-attributes KeyName=tomer-key-pair &#x2013;instance-type m4.small &#x2013;instance-count 2 &#x2013;use-default-roles
    \`\`\`
    
    1.  

2.  list emr clusters

    \`\`\`bash
    aws emr list-clusters
    \`\`\`

3.  terminate clusters

    \`\`\`bash
    aws emr terminate-clusters &#x2013;cluster-ids="j-W25BXM9TCOGX"
    \`\`\`


<a id="org3d659e3"></a>

## awscli


<a id="org8449a75"></a>

### install

\`\`\`bash
pip3 install awscli &#x2013;upgrade &#x2013;user
\`\`\`
\`\`\`markdown
then add *Users/tomer.bendavid*.local/bin to PATH on bash<sub>profile</sub>
\`\`\`


<a id="org3b84864"></a>

### configure

\`\`\`markdown

1.  \`aws configure\`
2.  take security credentials from [here](<https://console.aws.amazon.com/iam/home?region=us-east-1*/security_credential>)
3.  for default reigon i entered \`us-east-1\`

\`\`\`


<a id="orgceb6ec1"></a>

# python

<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">


<colgroup>
<col  class="org-left" />

<col  class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">command</th>
<th scope="col" class="org-left">description</th>
</tr>
</thead>

<tbody>
<tr>
<td class="org-left">`conda create --name testenv`</td>
<td class="org-left">&#xa0;</td>
</tr>


<tr>
<td class="org-left">`conda activate testenv`</td>
<td class="org-left">&#xa0;</td>
</tr>


<tr>
<td class="org-left">`conda env list`</td>
<td class="org-left">&#xa0;</td>
</tr>


<tr>
<td class="org-left">`conda installs spyder`</td>
<td class="org-left">&#xa0;</td>
</tr>


<tr>
<td class="org-left">`conda activate testenv`</td>
<td class="org-left">&#xa0;</td>
</tr>


<tr>
<td class="org-left">`conda install -c conda-forge pyspark`</td>
<td class="org-left">install pyspark</td>
</tr>


<tr>
<td class="org-left">&#xa0;</td>
<td class="org-left">&#xa0;</td>
</tr>
</tbody>
</table>


<a id="orgea8804a"></a>

## urllib2


<a id="org6dfdae5"></a>

### getfile

\`\`\`python
import urllib.request
url = "<http://www.cs.tufts.edu/comp/116/access.log>"
accesslog =  urllib.request.urlopen(url).read().decode('utf-8')
print("accesslog: " + accesslog)
\`\`\`


<a id="orge693329"></a>

## matplotlib


<a id="org7750f54"></a>

## pandas

\`\`\`python
from pandas import read<sub>csv</sub>
\`\`\`


<a id="orge4c5a97"></a>

### data

1.  

    pandas.read<sub>csv</sub>
    
    \`\`\`python
    url = "<https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data>"
    names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
    dataset = pandas.read<sub>csv</sub>(url, names=names) \*name is the above name for columns.
    \`\`\`

2.  

    dataset.shape

3.  

    dataset.head(20)

4.  

    dataset.describe()

5.  print(dataset.groupby('class').size())

6.  pandas.set<sub>option</sub>('expand<sub>frame</sub><sub>repr</sub>', False)

    Don't break table output when printing like with \`.head()\` to new lines, all in one line, wide table.


<a id="orga52d38d"></a>

### plot

1.  dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)

2.  dataset.hist()

3.  scatter<sub>matrix</sub>(dataset)

    plt.show()


<a id="orgae299fe"></a>

### build model

1.  validation dataset

    seperate out validation dataset.
    80% for data, 20% for validation.
    
    \`\`\`python


<a id="org468572f"></a>

# Split-out validation dataset

array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation<sub>size</sub> = 0.20
seed = 7
X<sub>train</sub>, X<sub>validation</sub>, Y<sub>train</sub>, Y<sub>validation</sub> = model<sub>selection.train</sub><sub>test</sub><sub>split</sub>(X, Y, test<sub>size</sub>=validation<sub>size</sub>, random<sub>state</sub>=seed)
\`\`\`

1.  cross validation

    10 fold cross validation for accuracy.
    \`\`\`python


<a id="org65d329c"></a>

# Test options and evaluation metric

seed = 7
scoring = 'accuracy'
\`\`\`

1.  build choose models

    \`\`\`markdown
    evaluate 6 models:
    
    1.  Logistic Regression (LR)
    2.  Linear Discriminant Analysis (LDA)
    3.  K-Nearest Neighbors (KNN).
    4.  Classification and Regression Trees (CART).
    5.  Gaussian Naive Bayes (NB).
    6.  Support Vector Machines (SVM).
    
    This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms
    \`\`\`
    
    \`\`\`python


<a id="org1385d2a"></a>

# Spot Check Algorithms

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))


<a id="orgdf79ef3"></a>

# evaluate each model in turn

results = []
names = []
for name, model in models:
    kfold = model<sub>selection.KFold</sub>(n<sub>splits</sub>=10, random<sub>state</sub>=seed)
    cv<sub>results</sub> = model<sub>selection.cross</sub><sub>val</sub><sub>score</sub>(model, X<sub>train</sub>, Y<sub>train</sub>, cv=kfold, scoring=scoring)
    results.append(cv<sub>results</sub>)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv<sub>results.mean</sub>(), cv<sub>results.std</sub>())
    print(msg)
\`\`\`

results:

\`\`\`bash
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.981667 (0.025000)
\`\`\`

plot models comparison:

\`\`\`python


<a id="org5cc1568"></a>

# Compare Algorithms

fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add<sub>subplot</sub>(111)
plt.boxplot(results)
ax.set<sub>xticklabels</sub>(names)
plt.show()
\`\`\`

1.  make predictions

    \`\`\`python


<a id="org019c684"></a>

# Make predictions on validation dataset

knn = KNeighborsClassifier()
knn.fit(X<sub>train</sub>, Y<sub>train</sub>)
predictions = knn.predict(X<sub>validation</sub>)
print(accuracy<sub>score</sub>(Y<sub>validation</sub>, predictions))
print(confusion<sub>matrix</sub>(Y<sub>validation</sub>, predictions))
print(classification<sub>report</sub>(Y<sub>validation</sub>, predictions))
\`\`\`

1.  errors f1 score

    \`\`\`markdown
    We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).
    \`\`\`
    
    \`\`\`bash
    0.9
    
    [[ 7  0  0]
     [ 0 11  1]
     [ 0  2  9]]
    
    precision    recall  f1-score   support
    
    Iris-setosa       1.00      1.00      1.00         7
    Iris-versicolor   0.85      0.92      0.88        12
    Iris-virginica    0.90      0.82      0.86        11
    
    avg / total       0.90      0.90      0.90        30
    \`\`\`


<a id="orgefae294"></a>

### resources

1.  <https://machinelearningmastery.com/machine-learning-in-python-step-by-step/>


<a id="org239029a"></a>

# Amazon

markdown机器学习笔记(代码片段)

查看详情

markdown机器学习人类:笔记(代码片段)

查看详情

markdown第8课机器学习的特征选择(代码片段)

查看详情

markdown第7课为机器学习准备数据(代码片段)

查看详情

markdown入门学习总结教程(代码片段)

Markdown的创始人是JohnGruber,Markdown作为一种轻量级的「标记语言」,它使我们专心于码字,用「标记」语法,来代替常见的排版格式.➢段落一段内容当做一个段落来显示,则需要保证该段内容上方及下方至少各有一个空行(空行含义:某... 查看详情

机器学习--入门介绍(代码片段)

...一:ARTHURSAMUL显著式编程⭐定义二:TomMitshell一、机器学习的分类监督学习分类一传统的监督学习非监督学习半监督学习⭐⭐分类二分类问题回归问题强化学习总结二、机器学习算法过程第一步:特区特征(FeatureEx... 查看详情

机器学习:从入门到晋级(代码片段)

摘要:什么是机器学习,为什么学习机器学习,如何学习机器学习,这篇文章都告诉给你。目前,人工智能(AI)非常热门,许多人都想一窥究竟。如果你对人工智能有所了解,但对机器学习(MachineLearning)的理解有很多的困惑... 查看详情

机器学习入门机器学习简介|附加小练习(代码片段)

目录1.机器学习是什么2.机器学习、深度学习和人工智能的区别与联系3.机器学习的应用4.机器学习分类4.1监督学习4.2无监督学习4.3半监督学习4.4强化学习5.小练习5.1第一题5.2第二题5.3第三题1.机器学习是什么用老师上课的一张图我... 查看详情

markdown入门学习202004(代码片段)

Markdown入门学习202004标题#一级标题(井号后面有空格)##二级标题###三级标题......######最多到六级标题字体Hello,world!粗体语法:**Hello,world!**Hello,world!斜体语法:*Hello,world!*Hello,world!粗加斜语法:***Hello,world!***Hello,world!删除线语法:~~... 查看详情

机器学习之深度学习入门(代码片段)

...习资料,学习建议本文用浅显易懂的语言精准概括了机器学习的相关知识,内容全面,总结到位,剖析了机器学习的what,who,when,where,how,以及why等相关问题。从机器学习的概念,到机器学习的发... 查看详情

markdown入门语法(代码片段)

Markdown初识Markdown是一种用来写作的轻量级==标记语言==,使用简洁的语法代替排版,而不用向Word那样需要使用大量的时间来排版,让我们可以专心码字而不受其他影响。Markdown的语法简洁明了、学习容易,而且功能比纯文本更强... 查看详情

简单易懂|机器学习如何快速入门?(代码片段)

1 什么是机器学习机器学习是从数据中自动分析获得模型,并利用模型对未知数据进行预测。2 机器学习工作流程1.获取数据2.数据基本处理3.特征工程4.机器学习(模型训练)5.模型评估结果达到要求,上线服务没有达到要... 查看详情

微软开源的机器学习入门课程(代码片段)

导读微软开源的ML-For-Beginners入门机器学习的课程目前在GitHub上已经有将近15k颗星。课程是专门针对机器学习的入门教程,一共包含了12周24节课程,主要是基于Scikit-learn来介绍的。课程介绍每节课程主要包含了以下几个内... 查看详情

机器学习入门四------降低损失(代码片段)

...失:迭代方法介绍了损失的概念。在本单元中,您将了解机器学习模型如何以迭代方式降低损失。迭代学习可能会让您想到“HotandCold” 查看详情

深度学习入门2022最新版深度学习简介(代码片段)

...学习入门2022最新版】第一课深度学习简介概述深度学习vs机器学习机器学习是什么深度学习是什么机器学习和深度学习的区别神经网络机器学习实现二分类神经网络实现二分类TensorFlowPyTorch神经网络的原理张量张量最小值(补充)... 查看详情

pandas高级数据分析快速入门之五——机器学习特征工程篇(代码片段)

...四——数据可视化篇Pandas高级数据分析快速入门之五——机器学习特征工程篇Pandas高级数据分析快速入门之六——机器学习预测分析篇0.Pandas高级数据分析使用机器学习概述需求解决方案技术方案 查看详情

pandas高级数据分析快速入门之五——机器学习特征工程篇(代码片段)

...四——数据可视化篇Pandas高级数据分析快速入门之五——机器学习特征工程篇Pandas高级数据分析快速入门之六——机器学习预测分析篇0.Pandas高级数据分析使用机器学习概述需求解决方案技术方案 查看详情

机器学习编译入门课程学习笔记第一讲机器学习编译概述(代码片段)

文章目录1.课程简介2.本节课内容大纲3.机器学习编译的定义4.机器学习编译的目标5.为什么要学习机器学习编译?6.机器学习编译的核心要素6.1.备注:抽象和实现7.总结1.课程简介  该门课程是由XGBoost的作者陈天奇进行... 查看详情