A Few Useful Things to Know About Machine Learning

$$ %---- MACROS FOR SETS ----% \newcommand{\znz}[1]{\mathbb{Z} / #1 \mathbb{Z}} \newcommand{\twoheadrightarrowtail}{\mapsto\mathrel{\mspace{-15mu}}\rightarrow} % popular set names \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\I}{\mathbb{I}} % popular vector space notation \newcommand{\V}{\mathbb{V}} \newcommand{\W}{\mathbb{W}} \newcommand{\B}{\mathbb{B}} \newcommand{\D}{\mathbb{D}} %---- MACROS FOR FUNCTIONS ----% % linear algebra \newcommand{\T}{\mathrm{T}} \renewcommand{\ker}{\mathrm{ker}} \newcommand{\range}{\mathrm{range}} \renewcommand{\span}{\mathrm{span}} \newcommand{\rref}{\mathrm{rref}} \renewcommand{\dim}{\mathrm{dim}} \newcommand{\col}{\mathrm{col}} \newcommand{\nullspace}{\mathrm{null}} \newcommand{\row}{\mathrm{row}} \newcommand{\rank}{\mathrm{rank}} \newcommand{\nullity}{\mathrm{nullity}} \renewcommand{\det}{\mathrm{det}} \newcommand{\proj}{\mathrm{proj}} \renewcommand{\H}{\mathrm{H}} \newcommand{\trace}{\mathrm{trace}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\card}{\mathrm{card}} \newcommand\norm[1]{\left\lVert#1\right\rVert} % differential equations \newcommand{\laplace}[1]{\mathcal{L}\{#1\}} \newcommand{\F}{\mathrm{F}} % misc \newcommand{\sign}{\mathrm{sign}} \newcommand{\softmax}{\mathrm{softmax}} \renewcommand{\th}{\mathrm{th}} \newcommand{\adj}{\mathrm{adj}} \newcommand{\hyp}{\mathrm{hyp}} \renewcommand{\max}{\mathrm{max}} \renewcommand{\min}{\mathrm{min}} \newcommand{\where}{\mathrm{\ where\ }} \newcommand{\abs}[1]{\vert #1 \vert} \newcommand{\bigabs}[1]{\big\vert #1 \big\vert} \newcommand{\biggerabs}[1]{\Bigg\vert #1 \Bigg\vert} \newcommand{\equivalent}{\equiv} \newcommand{\cross}{\times} % statistics \newcommand{\cov}{\mathrm{cov}} \newcommand{\var}{\mathrm{var}} \newcommand{\bias}{\mathrm{bias}} \newcommand{\E}{\mathrm{E}} \newcommand{\prob}{\mathrm{prob}} \newcommand{\unif}{\mathrm{unif}} \newcommand{\invNorm}{\mathrm{invNorm}} \newcommand{\invT}{\mathrm{invT}} % real analysis \renewcommand{\sup}{\mathrm{sup}} \renewcommand{\inf}{\mathrm{inf}} %---- MACROS FOR ALIASES AND REFORMATTING ----% % logic \newcommand{\forevery}{\ \forall\ } \newcommand{\OR}{\lor} \newcommand{\AND}{\land} \newcommand{\then}{\implies} % set theory \newcommand{\impropersubset}{\subseteq} \newcommand{\notimpropersubset}{\nsubseteq} \newcommand{\propersubset}{\subset} \newcommand{\notpropersubset}{\not\subset} \newcommand{\union}{\cup} \newcommand{\Union}[2]{\bigcup\limits_{#1}^{#2}} \newcommand{\intersect}{\cap} \newcommand{\Intersect}[2]{\bigcap\limits_{#1}^{#2}} \newcommand{\intersection}[2]{\bigcap\limits_{#1}^{#2}} \newcommand{\Intersection}[2]{\bigcap\limits_{#1}^{#2}} \newcommand{\closure}{\overline} \newcommand{\compose}{\circ} % linear algebra \newcommand{\subspace}{\le} \newcommand{\angles}[1]{\langle #1 \rangle} \newcommand{\identity}{\mathbb{1}} \newcommand{\orthogonal}{\perp} \renewcommand{\parallel}[1]{#1^{||}} % calculus \newcommand{\integral}[2]{\int\limits_{#1}^{#2}} \newcommand{\limit}[1]{\lim\limits_{#1}} \newcommand{\approaches}{\rightarrow} \renewcommand{\to}{\rightarrow} \newcommand{\convergesto}{\rightarrow} % algebra \newcommand{\summation}[2]{\sum\limits_{#1}^{#2}} \newcommand{\product}[2]{\prod\limits_{#1}^{#2}} \newcommand{\by}{\times} \newcommand{\integral}[2]{\int_{#1}^{#2}} % exists commands \newcommand{\notexist}{\nexists\ } \newcommand{\existsatleastone}{\exists\ } \newcommand{\existsonlyone}{\exists!} \newcommand{\existsunique}{\exists!} \let\oldexists\exists \renewcommand{\exists}{\ \oldexists\ } % statistics \newcommand{\distributed}{\sim} \newcommand{\onetoonecorresp}{\sim} \newcommand{\independent}{\perp\!\!\!\perp} \newcommand{\conditionedon}{\ |\ } \newcommand{\given}{\ |\ } \newcommand{\notg}{\ngtr} \newcommand{\yhat}{\hat{y}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\muhat}{\hat{\mu}} \newcommand{\transmatrix}{\mathrm{P}} \renewcommand{\choose}{\binom} % misc \newcommand{\infinity}{\infty} \renewcommand{\bold}{\textbf} \newcommand{\italics}{\textit} $$

A Few Useful Things to Know about Machine Learning is a high-level machine learning paper written by Pedro Domingos of the computer science and engineering department at the University of Washington. His paper details some useful machine learning guidelines and the following are some highlights I took from it.

We’re after generalization when creating models, so there are a few things to note:

  • Cross-validation is a must; that is, randomly dividing your training data into $n$ subsets, holding out each subset while training on the rest, validating your model on the held-out subset and then averaging the results.
  • Although cross-validation helps you to choose the best model parameters, your model is still biased toward the validation set. As a final unbiased test, you need to test your model on a subset of data that wasn’t used for training or validation
  • The objective function is only a proxy for the true goal so we may not even need a global optimum

Problems with high-dimensional datasets:

  • Choosing to increase the number of dimensions of your dataset increases the entire input space. For example, if you have 100 variables each taking a binary value, then your input space is of size $2^{100}$ which is greater than the number of atoms in the universe. In this scenario, even if you have a million samples, there are still $2^{100}-10^6$ samples whose classes you don’t know. This makes generalization a lot harder.
  • The more irrelevant variables you have, the more they will dominate the relevant variables and your model effectively makes random predictions since it was trained on noise
  • Similarity-based reasoning that models use to learn breaks down in higher dimensions since samples in higher dimensions appear to be sparse and therefore similar in many ways

Feature engineering is really important:

  • Automate as much of it as you can
  • Features that look irrelevant in isolation may be relevant in combination
  • But brute-forcing which features have useful combinations might be intractable so use your smarts when feature engineering

On models:

  • A less sophisticated algorithm with lots of a data can beat a clever algorithm with only a moderate amount of data
  • Although model accuracy is important, insights and reduction in human labor are usually more important; this makes decision trees and other rule-based learners attractive
  • Model ensembles are awesome this includes bagging, boosting, and stacking
  • There isn’t a mandatory connection between the number of parameters a model has and its tendency to overfit. SVMs can have a lot of parameters, but can avoid overfitting. Even though $\sign(\sin(ax))$ only has one parameter, it can discriminate an arbitrarily large, arbitrarily labeled set of points on the $x$ axis.
  • A model with a larger hypothesis space that tries fewer hypotheses from it is less likely to overfit than a model that tries more hypotheses from a smaller space
  • Simpler models should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy
  • Just because a function can be represented by a model doesn’t mean the model can learn the function. This is because:
    • The number of training samples you have might not be enough
    • Most models can only learn a tiny subset of all possible functions and these subsets are different for each model; so usually the model gets stuck in a local optima