正文

machinelearning|andrewng|coursera吴恩达机器学习笔记

Limitlessun  Limitlessun  2022-10-21  648

关键词：

Week1:

Machine Learning:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Supervised Learning:We already know what our correct output should look like.

Regression:Try to map input variables to some continuous function.
Classification:Try to map input variables into discrete categories.

Unsupervised Learning:We only have little or no idea what our results should look like.

Clustering:Find a way to automatically group data into groups that are somehow similar or related by different variables.
Non-clustering:Find structure in a chaotic environment,like the "Cocktail Party Algorithm".

Model Representation:

x(i):Input features
y(i):Target variable
(x(i),y(i)):Training example
(x(i),y(i));i=1,...,m:Training set
m:Number of training examples
h(x):Hypothesis,θ0+θ1x1

Cost Function:

This takes an average difference of all the results of the hypothesis with inputs from x‘s and the actual output y‘s.
Algorithm:(The mean is halved 1/2 as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.)
We use contour plot to show how to minimize the cost function.

Gradient Descent:

Help us to estimate the parameters in the hypothesis function.

Algorithm:(repeat until convergence)
j=0,1:Feature index numbe
α:Learning rate or the size of each step.If α is too small,gradient descent can be slow.If α is too large,gradient descent can overshoot the minimum.
Partial Derivative of J:Direction of each step
At each iteration j, one should simultaneously update all of the parameters.

Gradient Descent For Linear Regression:

Algorithm:
This method looks at every example in the entire training set on every step, and is calledbatch gradient descent.

Linear Algebra:

I have learned liner algebra in my college so I will skip this part in my note.

Week2:

Mutiple Features:

n:number of features
x(i):input of ith training example
x(i)j:value of feature j in ith training example
hθ(x):θ0x0+θ1x1+θ2x2+θ3x3+?+θnxn=(assume x0 = 1)

Gradient Descent for Multiple Variables:

Algorithm:
Feature Scaling:

Feature Scaling:Dividing the input values by the range (max - min) of the input variable.Get every feature into approximately a -1 <= xi <= 1 range.
Mean Normalization:Subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
Where μi is the average of all the values for feature i and si is the range of values (max - min), or si is the standard deviation.

Learning Rate:Make a plot with number of iterations on the x-axis. and J(θ) on the y-axis.If J(θ) ever increases, then you probably need to decrease α.It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.To choose α,try 0.001,0.003,0.01......
Features and Polynomial Regression:We can improve our features and the form of our hypothesis function in a couple different ways

We can combine multiple features into one.We can get a new feature x3 by taking x1 * x2
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
if you choose your features this way then feature scaling becomes very important.

Normal Equation:

Formula:
Example:
There is no need to do feature scaling with the normal equation.
If (X^TX) is non-invertibale:

Delete redundant features such as x1 = size in feet^2 and x2 = size in m^2.
Delete features to make sure that m > n or use regularization.

Octave:

GNU Octave Docs
Vectorization can simplify the codes.

Week3:

Classfication:

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.
x(i):Feature
y(i):Label for the tranning example

Logistic Regression:

We change the form for our hypotheses to satisfy 0 <= h(x) =1 by pluggin θ^Tx into the Logistic Function.
Formula:
Decision Boundary:The line that separates the area where y = 0 and where y = 1.It is created by hypothesis function(θ^Tx=0).
Cost Function:

We can compress our cost function‘s two conditional cases into one case: 技术分享图片

Gradient Descent: This algorithm is identical to the one we used in linear regression.But the h(x) is changed.

Optimization Algorithms:

Conjugate gradient
BFGS
L-BGFS
We can write codes below to use Octave‘s "fminunc()"

Multiclass Classification:

Train a logistic regression classifier hθ(x) for each class? to predict the probability that ? ?y = i? ?. To make a prediction on a new x, pick the class ?that maximizes hθ(x)

Overfitting:

Even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor.
Options to address overfitting:

Reduce the number of features.
Regularzation.

Regularized Linear Regression:

Cost Funcion:(lambda is the regularization parameter.)
Gradient Descent:
Normal Equation:

Regularized Logistic Regression:

Cost Function:
Gradient Descent:

Week4:

Neural Network:Representation:

If we had one hidden layer, it would look like:
The values for each of the "activation" nodes:
Each layer gets its own matrix of weights:(The ‘+1‘ comes from the ‘bias nodes‘,the output nodes will not include the bias nodes while the inputs will.)
Vectorized:
We can set different theta matrix to construct fundamental options by using a small neural network.
We can construct more complex options by using hidden layers.
Multiclass Classification:We use one-vs-all method and let hypothesis function return a vector of values.

Week 5:

Neural Network:Learning:

Cost Function:

L:Total number of layers in the network
Sl:Number of units (not counting bias unit) in layer l
K:number of output units/classes

Backpropagation Algorithm:

"Backpropagation" is neural-network terminology for minimizing our cost function.
Algorithm:For t = 1 to m:

We get

Using code like this to unroll all the elements and put them into one long vector.Using code like this to get back original matrices.
Gradient Checking:We can approximate the derivative with respect to θj as follows:
Training:

Week 6:

Applying Machine Learning:

Evaluating a Hypothesis:

Set 70% of date to be the training set and the remainning 30% to be the test set.
In order to choose the model of your hypothesis, we can test each degree of polynomial by using cross validation set.(20% training set,20% cross validation set,60% test set)

Bias vs. Variance:

High bias is underfitting and high variance is overfitting.Ideally, we need to find a golden mean between these two.
High Bias:
High Variance:
In order to choose the model and the regularization term λ, we need to:
If a learning algorithm is suffering from high bias, getting more training data will not help much.
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
A neural neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
A large neural network with more parameters is prone to overfitting. It is also computationally expensive.

Machine Learning System Desing:

The recommended approach:

Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
Plot learning curves to decide if more data, more features, etc. are likely to help.
Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

It is very important to get error results as a single, numerical value.
Precision

Handling Skewed Data:

Skewed Classes:The ratio of positive to negative examples is very close to one of two extremes.
(y = 1 in presence of rare class that we want to detect)
Precision Rate:TP / (TP + FP)
Recall Rate:TP / (TP + FN)
F1 Score:(2 * P * R) / (P + R)

Week 7:

Support Vector Machines:

Optimization Objective:

Because constant doesn‘t change value of the theta that achieves the miinmum,so we multiplying objective function in logistic regression by M.
We can both use (A + λB) or (CA + B) to control the relative.
A support vector machine just makes a prediction of y being equal to one or zero, directly. So the hypothesis will predict one

Large Margin Intuition:

The SVM decision boundary will become like this:
The black line gives SVM a robustness because it has a large margin:

Kernels:

Given (xi,yi),we choose li = xi as landmarks,then let fi = sim(x,li).
We compute new features depending on proximity to landmarks.So our function become theta0 + theta1*f1 + theta2*f2......

Gaussian Kernels:
C and Sigma:
Do perform feature scaling before using the Gaussian kernel.
Linear kernel:meanning no kernel.

Week8:

Unsupervised Learning:

Clustering:

We give unlabeled training set to an algorithm and we ask the algorithm find some structure in the data for us.
K-meas Algorithm:
Cost Function:
Random Initialization:Randomly pick k training examples and set Mu1 of MuK equal to these k examples.
Elbow Method:
Better way to choose the number of clusters is to ask, for what purpose are you running K-means.

Dimensionality Reduction:

Reason:Data compression or speed up our learning algorithm.
Visualization:We can use dimensionality reduction to reduce data from high dimensions down to 2 or 3 dimensions,so that we can plot it and understand our data better.

Principal Component Analysis:

PCA:Find a lower dimensional surface onto which to project the data, so as to minimize the square distance between each point and the location of where it gets projected.
Reduce from 2D to 1D:Find a vector onto which to project the data to minimize the projection error.
Reduce from nD to kD:Find k vectors onto which to project the data to minimize the projection error.
Data preprocessing:Feature scaling/Mean normalization
Algorithm:

If we want to reduce the data from n dimensions down to k dimensions, we need to do is take the first k vectors from U(n * n) as Ureduce(n * k).
z = Ureduce‘ * x.

Reconstruction from Compressed Representation:Xapprox = Ureduce * z.
Applying:(Only if your algorithm doesn‘t do what you want then implement PCA)

Week 9:

Anomaly Detection:

Density Estimation:

We build a model of the probability of x,if p of x-test is less than some epsilon then we flag this as an anomaly.
Gaussian Distribution(Normal Distribution):,
Parameter Estimation:
Algorithm:
Evaluation:Assume we have some labled data of anomalous and nonanomalous examples.Using training set(unlabled,assume normal examples),cross validation set and test set.
Anomaly Detection vs. Supervised Learning:
Non-gaussian Features:Let xNew = log(x)(logarithmic normal distribution),or xNew = x^(0.1)
Choose Features:Choose features that migth take on unusually large or small values in the event of an anomaly

Multivariate Gaussian Distribution:

技术分享图片

Recommender Systems:

n.u = number of users
n.m = number of moives
r(i,j) = 1 if user j have rated movie i
y(i,j) = rating given by user j to movie i(only if r(i,j) = 1)
theta(j) = parameter vector for user j
x(i) = feature vector for movie i

Content Based Recommendations:

We assume we have features for different movies.
For each user j,learn a parameter.Predict user j as rating movie i with stars.
Optimization Objective:
Gradient Descent:

Collaborative Filtering:

We assume that each of our users has told us how much they like the romantic movies and how much they like action packed movies.
Optimization Algorithm:
Given x and movie ratings can estimate theta.
Given theta and movie ratings can estimate x.
Optimization Objective:
Mean Normalization:Compute the average rating that each movie obtained and subtract off the meaning rating.So the rating of movie become + average rating.

Week 10:

Large Scale Machine Learning:

Stochastic Gradient Descent:

Algorithm:

Randomly shuffle the data set.
For i = 1...m:

SGD will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first.
We will usually take 1-10 passes through data set to get near the global minimum.
Convergence:Plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs during the gradient descent iterations.
One strategy for trying to actually converge at the global minimum is to slowly decrease α over time.

Mini-Batch Gradient Descent:

Use b examples in each iteration.(b = mini-batch size)
Algorithm:
The advantage is that we can use vectorized implementations over the b examples.

Online Learning:

With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y.
You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta.

Map Reduce and Data Parallelism:

Many learning algorithms can be expressed as computing sums of functions over the training set.
We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel.

Week 11:

Photo OCR:

Pipeline:

Text detection
Character segmentation
Character classification

Using sliding windows and expansion to text detection and character segmentation
Ceiling Analysis

Artificial Data Synthesis:

Creating new data from scratch(using the ramming funds as an example)
Taking existing label examples and introducing distortions to it, to sort of create extra label examples.

技术分享图片

machinelearning—监督学习与非监督学习

斯坦福大学的MachineLearning课程（讲师是AndrewNg）公开课是学习机器学习的“圣经”，以下内容是听课笔记。一、何谓机器学习MachineLearningisfieldofstudythatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.也就是说机器学习不需要制定... 查看详情

machinelearning之导论一元线性回归

整理自AndrewNg的machinelearnig课程week1 目录：什么是机器学习监督学习非监督学习一元线性回归模型表示损失函数梯度下降算法 1、什么是机器学习ArthurSamuel不是一个playingchecker的高手，但是他编了一个程序，每天和这个程序... 查看详情

机器学习-吴恩达andrewngcoursera学习总结合集，编程作业技巧合集

...ww.coursera.org/learn/machine-learning/home/welcome课程总结机器学习MachineLearning-吴恩达AndrewNg第1~5课总结机器学习MachineLearning-吴恩达AndrewNg第6~10课总结机器学查看详情

caffe入门随笔

...学习的一些资料：（1）课程，最推荐Coursera上的AndrewNG的MachineLearning，最好注册课程，然后跟下来。其次是华盛顿大学的MachineLearning系列课程，一共有6门，包括毕业设计（2）书籍：机器学习（周志华西瓜书）、机器学习实战、... 查看详情

coursera课程《machinelearning》学习笔记（week1）

这是Coursera上比较火的一门机器学习课程，主讲教师为AndrewNg。在自己看神经网络的过程中也的确发现自己有基础不牢、一些基本概念没搞清楚的问题，因此想借这门课程来个查漏补缺。目前的计划是先看到神经网络结束，后面... 查看详情

coursera课程《machinelearning》学习笔记（week1）

这是Coursera上比较火的一门机器学习课程，主讲教师为AndrewNg。在自己看神经网络的过程中也的确发现自己有基础不牢、一些基本概念没搞清楚的问题，因此想借这门课程来个查漏补缺。目前的计划是先看到神经网络结束&... 查看详情

《机器学习》学习笔记：线性回归逻辑回归

...机器学习》时，我主要是通过AndrewNg教授在mooc上提供的《MachineLearning》课程，不得不说AndrewNg老师在讲授这门课程时，真的很用心，特别是编程练习，这门查看详情

gradientdescent

整理自AndrewNg的machinelearning课程。目录：梯度下降算法梯度下降算法的直观展示线性回归中的梯度下降前提：线性回归模型：$h(\theta_0,\theta_1)=\theta_0+\theta_1x$损失函数：$J(\theta_0,\theta_1)=\frac12m\sum_i=1^m(h_\theta(x^(i))-y^(i))^2$ 查看详情

斯坦福公开课-机器学习2.监督学习应用-梯度下降（吴恩达andrewng）(代码片段)

...乘法本系列课程链接：http://open.163.com/special/opencourse/machinelearning.html线性回归（linearregression）梯度下降（gradientdescent）正规方程组（thenormalequations）1线性代数（linearalgebra）1-1符号（Notation&#x... 查看详情

andrewngmachinelearning专题linearregression

...大学，机器学习界superstar—AndrewNg所开设的Coursera课程：MachineLearning的课程笔记。力求简洁，仅代表本人观点，不足之处希望大家探讨。课程网址：https://www.coursera.org/learn/machine-learning/home/welcomeWeek3:LogisticRegression&Regularization 查看详情

andrewng吴恩达近期论文整理

AndrewNg个人主页， http://www.andrewng.org/，其团队近期论文整理如下：2018年：NoisingandDenoisingNaturalLanguage:DiverseBacktranslationforGrammarCorrection2017年：CheXNet:Radiologist-LevelPneumoniaDetectiononChes 查看详情

stanfordmachinelearning学习2016/7/4

...英文字幕的教学视频资源(http://open.163.com/special/opencourse/machinelearning.html),讲义戳这里:http://cs229.stanford.edu/materials.html 网络上有各种类似的课程学习笔记，也会查看详情

机器学习逻辑回归logisticregression

文章内容均来自斯坦福大学的AndrewNg教授讲解的MachineLearning课程，本文是针对该课程的个人学习笔记，如有疏漏，请以原课程所讲述内容为准。感谢博主RachelZhang 的个人笔记，为我做个人学习笔记提供了很好的参考和榜样。&n... 查看详情

斯坦福公开课-机器学习1.机器学习的动机和应用（吴恩达andrewng）

....stanford.edu3-本系列课程链接http://open.163.com/special/opencourse/machinelearning.html1机器学习的定义1-1非正式定义在一直接针对问题进行编程的情况下查看详情

andrewng机器学习课程笔记之逻辑回归

andrewng机器学习课程笔记之神经网络

coursera课程《machinelearning》学习笔记（week1）

...是写给自己的一份笔记吧。如果有兴趣，可以移步《MachineLearning》仔细学习。接下来是第一周的一些我认为需要格外注意的问题。1强调了hypothesis与cost函数究竟是“itofwho”设一个待拟合函数为h(x)=θ1x+θ0h(x)=θ1x+θ0，... 查看详情

正文

machinelearning|andrewng|coursera吴恩达机器学习笔记

machinelearning—监督学习与非监督学习

machinelearning之导论一元线性回归

机器学习-吴恩达andrewngcoursera学习总结合集，编程作业技巧合集

caffe入门随笔

coursera课程《machinelearning》学习笔记（week1）

coursera课程《machinelearning》学习笔记（week1）

《机器学习》学习笔记：线性回归逻辑回归

gradientdescent

斯坦福公开课-机器学习2.监督学习应用-梯度下降（吴恩达andrewng）(代码片段)

andrewngmachinelearning专题linearregression

andrewng吴恩达近期论文整理

stanfordmachinelearning学习2016/7/4

机器学习逻辑回归logisticregression

斯坦福公开课-机器学习1.机器学习的动机和应用（吴恩达andrewng）

andrewng机器学习课程笔记之逻辑回归

andrewng机器学习课程笔记之神经网络

coursera课程《machinelearning》学习笔记（week1）

andrewng机器学习课程笔记之应用机器学习的建议