You may enjoy this article more if you know Differential Calculus.

This was my feeling during the first 7-minutes video lecture.

I’ve been listening about machine learning and AI for years as a remarkable thing, still being developed by researchers in the best universities of the world and to aimed in a PhD. Still with this idea, I decided to attend Stanford’s Machine Learning course offered on Coursera.

For me, one of its popular definitions reforces this idea of being something magical:

Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel (1959)

Knowing enough about machines, I used to ask myself: “How can a computer do something I haven’t explicitly programmed?”. The answer is (at least in the scenario presented by Andrew Ng in the course): it doesn’t. Even without fully understanding all the models presented, I can already come up with probably not so efficient but similar solutions to the same problems.

Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Tom Mitchell (1998)

Putting aside the letters which may cause some trouble on understanding it, defines Machine Learning as something more tangible to programmers:

Recording and analyzing results from predefined tasks, the software will group them into satisfactory or not. Satisfactory may suffer another analysis attempting to improve and reach the performance expected.

Irio Musskopf (2014)

Andrew classifies Machine Learning algorithms between supervised and unsupervised learning. Both need to be fed with data to be analyzed.

It’s when we give right answers together with the data. One of its examples - and the most explored in this week - is the housing price prediction. Based on a database of house characteristics (starting with size in feet), the algorithm should try to answer the question of how much a given house is worth.

In this case, it’s supervised because we already know the answer for some prior data. Two houses of 1000 feet squared were sold by $280,000 and $300,000. For houses of 1500, we might guess something like $400,000. Greater the amount of data fed, more trustable the guesses become. It’s also a problem of regression analysis, because we’re trying to build a continuous line between all possible “feet squared“ units, even having just some answers in the start.

Another possible case for supervised learning is the one called classification. In front of a question asking for yes/no, it classifies the data between the distinct groups.

In the course example, breast cancer is categorized between malignant/benign - using the question “Is malignant?” - according to the tumor size.

My most exciting thought about supervised learning is that we don’t need (sometimes we should, but it isn’t required) to build a fancy formula relating each characteristic with the results. You may build a software to analyze existing data and find relationships between variables and results for you. And there is no magic included in the process.

Be careful: section with explicit Math.

Two dimensional graph with blue dots dispersed over a red line

Two dimensional graph with blue dots dispersed over a red line

Imagining this (axes and blue dots) as a graph plot of the housing prices from our existing data, one possible tool for doing regression analysis is Linear Regression. It will give us a straight line corresponding as near as possible with the training examples, represented by the blue dots.

$$ h_\theta(x) = \theta_0 + \theta_1x $$

Hypotheses for the red line must follow the format of a line equation, where $\theta_0$ and $\theta_1$ are constants.

To find the best solution for the equation, we must define a function to add a little more meaning for it:

$$ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 $$

This function stands for giving a weight for how much the line formed with defined $\theta_0$ and $\theta_1$ is far from the list of training examples. Unless every each of the blue dots is on the red line, the function will always return a positive number. We aim to minimize as much as possible from this so called “cost function”. To accomplish this job, we are presented to the gradient descent algorithm.

$$ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1) $$

The colon equals operator means assignment, where you give a value for the “name” $\theta_j$.

To accomplish the algorithm’s goal, we run the above statement twice (one for each $\theta$ we are trying to discover the value). The first will give us the value for $\theta_0$ and the second of $\theta_1$.

$$ \theta_0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1) $$

$$ \theta_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1) $$

We consider these two assignments a single iteration. To start, take initial values of 0 for $\theta_0$ and 1 for $\theta_1$.

The weird $\frac{\partial}{\partial\theta_j}$ symbol says we should take the derivative of the cost function. In other words, on each assignment of $\theta_j$, we follow the derivative arrow and “walk” a little more in the direction of the local minimum (the place with 0 slope).

Running one iteration over another will take care of changing the values of both θ’s and approximating them of the local minimum. The time to stop the assignments is when the result stay unchanged between two iterations (convergence). At that point, we will have best possible values for $\theta_0$ and $\theta_1$. Or, if you prefer, the equation for that red line above.

You may as well feed an algorithm with data and expect it to be grouped in some way. I don’t know how, but I want to split it between meaningful groups. It’s the same learning used in applications like Google News to populate a “related articles” section. Or put it into “business” and not “sports” category.

So far, this kind of learning wasn’t well explored yet, but the Eureka! moment stays with the possibilities. Tagging articles are useful - and may be helpful for unsupervised machine learning - but common people does not know how to use and isn’t a considerable choice for systems with size of those from Google. For the majority of cases, a computer could do a better (considering cost-benefit) job compared to a human.