Sort2013 Part III: Machine Learning in Python

I haven’t done much with Machine Learning since graduating from school years ago. However recently there have been a number of projects where the use of machine learning can bring a significant benefit. This lecture was a great refresher and introduction to how task can be accomplished using python.

Why should we focus on machine learning now?

The power of a machine learning algorithm is its ability to GENERALIZE from a finite set of examples.


Clustering is grouping all items that have a similiar relationship than items that appear in other clusters

There are a few types of clustering and how the compare. Below is alist of this different cluster types.

In the lecture he specifically covered K-Means clustering.

K-Means clustering

Allows you to take a feature vector and figure out how the information should group together (correlate)

Given a training data set and a number of clusters find the position of the centroids. However the weakness you have with K is that you need to specify a number to begin with for it to use with grouping.

It is often used with Image Color Compression (Converting a 16-bit image to a 6-bit image). Which is accomplished by Replace each pixel color in the original with the color of its nearest k-means centroid.



Regression Prediction (Intuition)


Gradient Descent



Precision: percentage of the objects classified as A, really are A
Recall: Of all the A objects the percentage that we actually classified as A

Decision Tree

Fruit & Vegetable Example

Can be constructed working with a tree that sets what is there where every leaf from a root becomes a new rule

Classifiers can be set using entropy - A coin toss has an entropy of 1 bit. The highest information gain will have the least entropy (unpredictability).

Overfitting is a disadvantage because it doesn’t create the clear seperation necessary, this can be solved using “Random Forests” or multiple decision tress.


The main question that one might ask is why to look into python as the language for machine learning. Well it turns out that Python has basically become the defacto standard for scientific tools and languages.



Resources is a machine learning competition problem.

Coursera classes