week1
Tom mitchell (1998) well-posed learning problem: a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
E | experience |
T | task |
P | performance |
categories:
In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
examples of regression
Example 2:
(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture
(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
example: google news, clustering
applications:
example 2 cocktail party
Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds
wrong answer for the following Q
desc: Some of the problems below are best addressed using a supervised
learning algorithm, and the others with an unsupervised
learning algorithm. Which of the following would you apply
supervised learning to? (Select all that apply.) In each case, assume some appropriate
dataset is available for your algorithm to learn from.
univariate=one variable x(i) – i element of input variables y(i) – i elements of output variables
house predicting, regression vs classification
target variable | type |
continuous | regression |
discrete values | classification |
When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
Linear regression predicts a real-valued output based on an input value. We discuss the application of linear regression to housing price prediction, present the notion of a cost function, and introduce the gradient descent method for learning.
fig/ octave code: J=sum(((X*theta-y).2))/(2*m)
an example is the least square error least squre estimation 最小二乘估计法
J(θ) | cost function |
goal: minimize the cost function. We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
J(θ0,θ1) =12m∑i=1m(yi−yi)2=12m∑i=1m(hθ(xi)−yi)2J(θ0, θ1) = \dfrac {1}{2m} \displaystyle ∑ {i=1}m \left ( \hat{y}i- yi \right)2 = \dfrac {1}{2m} \displaystyle ∑ _{i=1}m \left (hθ (xi) - yi \right)2J(θ0,θ1)=2m1i=1∑m(yi−yi)2=2m1i=1∑m(hθ(xi)−yi)2
To break it apart, it is 0.5 \bar{x}, where \bar{x}= the mean of the squares of hθ(xi)−yi or the difference between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (12)\left(\frac{1}{2}\right)(21) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 12\frac{1}{2}21 term. The following image summarizes what the cost function does:
goal: minimize the cost function J(θ1)
an algorithm for automatically finding that value of theta0 and theta1 that minimizes the cost function
We put theta0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis.
https://www.hackerearth.com/blog/developers/gradient-descent-algorithm-linear-regression/
This optional module provides a refresher on linear algebra concepts. Basic understanding of linear algebra is necessary for the rest of the course, especially as we begin to cover models with multiple variables.
% The ; denotes we are going back to a new row.
> A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]
A = 1 2 3 4 5 6 7 8 9 10 11 1
% Get the dimension of the matrix A where m = rows and n = columns >[m,n] = size(A)
m = 4 n = 3 % You could also store it this way
dim_A = size(A)
dimA =
4 3
% let's index into the 2nd row 3rd column of matrix A A23 = A(2,3)
A_23 = 6
% Initialize a vector
v = [1;2;3]
v =
1 2 3
% Get the dimension of the vector v dimv = size(v)
dim_v = 3 1
+ | addition |
- | subtraction |
note:To add or subtract two matrices, their dimensions must be the same.
[a b c d]+[w x y z]=[a+w b+x c+y d+z]
[a b c d]−[w x y z]=[a−w b−x c−y d−z]
% Initialize matrix A and B A = [1, 2, 4; 5, 3, 2] B = [1, 3, 4; 1, 1, 1]
A = 1 2 4 5 3 2
% Initialize constant s s = 2
% See how element-wise addition works addAB = A + B
% See how element-wise subtraction works subAB = A - B
% See how scalar multiplication works multAs = A * s
% Divide A by s divAs = A / s
% What happens if we have a Matrix + scalar? addAs = A + s addAs =
3 4 6 7 5 4
we map the column of the vector onto each row of the matrix, multiplying each element and summing the result. [a b; c d; e f]∗[x; y]=[a∗x+b∗y; c∗x+d∗y; e∗x+f∗y] The result is a vector. The number of columns of the matrix must equal the number of rows of the vector.
An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.
exercise
% Initialize matrix A A = [1, 2, 3; 4, 5, 6;7, 8, 9] % Initialize vector v v = [1; 1; 1] % Multiply A * v Av = A * v
example of an application: house prediction
% Initialize a 3 by 2 matrix A = [1, 2; 3, 4; 5, 6]
% Initialize a 2 by 1 matrix B = [1; 2]
% We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1) multAB = A*B
% Make sure you understand why we got that result
example of an application: house prediction
exercise
% Initialize random matrices A and B A = [1,2;4,5] B = [1,1;0,2]
% Initialize a 2 by 2 identity matrix I = eye(2)
% The above notation is the same as I = [1,0;0,1]
% What happens when we multiply I*A ? IA = I*A
% How about A*I ? AI = A*I
% Compute A*B AB = A*B
% Is it equal to B*A? BA = B*A
% Note that IA = AI but AB != BA
The inverse of a matrix A is denoted A-1. Multiplying by the inverse results in the identity matrix. > A=[1 2 3; 4 5 6]; > pinv(A) % octave > inv(A) % matlab > transpose(A); A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function and in Matlab with the inv(A) function. Matrices that don't have an inverse are singular or degenerate.
The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A':
Created: 2021-12-06 Mon 22:14
Emacs 25.3.1 (Org mode 8.2.10)
Validate