darkmighty 18 hours ago [-]

If you're looking for a bit of an unconventional entry point, I recommend the seminal text 'Elements of Information Theory' by T. Cover (skipping chapters like Network Information/Gaussian channel should be fine), paired with David MacKay?'s 'Information Theory, Inference and Learning Algorithms'. Both seem available online:

They cover some fundamentals of what optimal inference looks like, why current methods work, etc (in a very abstract way by understanding Kolmogorov complexity and its theorems and in a more concrete way in MacKay?'s text). Another good theoretical partner could be the 'Learning from data' course, yet a little more applied: (also available for free)

Excellent lecturer/material (to give a glimpse, take lecture 6: 'Theory of Generalization -- how an infinite model can learn from a finite sample').

Afterward I would move to modern developments (deep learning, or whatever interests you), but you'll be well equipped.




Tree-based methods

Topological data analysis


What works well where, and lists of top algorithms

Test sets


Vision tools

Object detection:




Case studies / examples

Reinforcement learning case studies


NLP techniques



Try and combine all (or many) of the classifiers (not via ensemble methods, but by combining their core concepts to make a new algorithm).

eg combine random forest, neural net, svm, genetic alg, stochastic local search, nearest neighbor

eg deep neural decision forests seem to be an attempt to combine the first two

perhaps some of the combinations would only be a linear combination of the output scores, but it would be better to find 'the key idea(s)' of each one.

Intros to and notes on classic algorithms



deep learning


evolution strategies

imitation learning

reinforcement learning of collaboration


what works well where

natural language processing (NLP)

case studies /examples / instances


When CPPNs are used to generate the connectivity patterns of evolving ANNs, the resulting algorithm, also from my lab, is called HyperNEAT (Hypercube-based NEAT, co-invented with David D’Ambrosio? and Jason Gauci) because under one mathematical interpretation, the CPPN can be conceived as painting the inside of a hypercube that represents the connectivity of an ANN. Through this technique, we began to evolve ANNs with hundreds of thousands to millions of connections. Indirectly encoded ANNs have proven useful, in particular, for evolving robot gaits because their regular connectivity patterns tend to support the regularity of motions involved in walking or running. Researchers like Jeff Clune have helped to highlight the advantages of CPPNs and HyperNEAT? through rigorous studies of their various properties. Other labs also explored different indirect encoding in neuroevolution, such as the compressed networks of Jan Koutník, Giuseppe Cuccu, Jürgen Schmidhuber, and Faustino Gomez."


moultano 5 hours ago [-]

A practical issue for Naive Bayes that also infects linear models is bias w.r.t. document length. Typically when you are detecting a rare, relatively compact class such as sports articles (or spam) you will tend to have a strongly negative prior, many positive features, and few negative ones. As a consequence, as the length of your text increases, not only does the variance of your prediction increase, but the mean tends to as well. This leads to all very long documents being classified as positive, regardless of their text. You can observe this by training your model and then classifying /usr/dict/words.

This is the most common mistake I've seen in production use of linear models on document text. Invariably, they'll misfire on any unusually long document.


mooman219 3 hours ago [-]

I agree, there are issues with NB such as the ones you brought up. I don't think document length is the real offender here though. This really boils down to noise and how well you filter and devalue it. Stacking more filters like stemming, stopwords, and high frequency features definitely helps in this case to the point where longer documents can actually improve accuracy. Additionally, tuning your ngram lengths or using variable lengths, choosing between word or character ngrams, and limiting your distribution size all will help depending on what you're trying to categorize.


intune 4 hours ago [-]

Is there some way to normalize the document length?


moultano 4 hours ago [-]

Lots of reasonable hacks.

1. Use only the beginning of the document, as that's probably the most important part anyways, and it's fast.

2. Divide the sum of your feature scores by sqrt(n) to give it constant variance, and hopefully keep it comparable with your prior.

3. Split the doc into reasonably sized chunks, and average their scores rather than adding them.


Houshalter 3 hours ago [-]

You can use term frequency instead of binary features. This is invariant to the size of the document. This is called multinomial naive Bayes:


moultano 2 hours ago [-]

This is not invariant to the size of the document (though agreed, generally better). It doesn't solve the problem of having mostly positive features and a negative prior.

Stated more formally, your model is b + wᵀx. Generally, b is < 0, and E[wᵀx] > 0. As the document grows, wᵀx tends to dominate b. You'll have bias with length as long as E[wᵀx]≠0 and there aren't any constraints on w that would force this.

reply "

" There is a temptation to use just the word pair counts, skipping SVD, but it won't yield in the best results. Creating vectors not only compresses data, but also finds general patterns. This compression is super important for less frequent words (otherwise we get a lot of overfitting). See "Why do low dimensional embeddings work better than high-dimensional ones?" from "

maps and glossaries*pAB3XvTKhjUkyajoNPY2DQ.jpeg

meta learning deep learning architectures

standard datasets and tasks and benchmarks and contests


toread for me