Intro
If you're looking for a bit of an unconventional entry point, I recommend the seminal text 'Elements of Information Theory' by T. Cover (skipping chapters like Network Information/Gaussian channel should be fine), paired with David MacKay?'s 'Information Theory, Inference and Learning Algorithms'. Both seem available online:
http://www.cs-114.org/wp-content/uploads/2015/01/Elements_of_Information_Theory_Elements.pdf
http://www.inference.org.uk/itprnn/book.pdf
They cover some fundamentals of what optimal inference looks like, why current methods work, etc (in a very abstract way by understanding Kolmogorov complexity and its theorems and in a more concrete way in MacKay?'s text). Another good theoretical partner could be the 'Learning from data' course, yet a little more applied: (also available for free)
https://work.caltech.edu/telecourse.html
Excellent lecturer/material (to give a glimpse, take lecture 6: 'Theory of Generalization -- how an infinite model can learn from a finite sample').
Afterward I would move to modern developments (deep learning, or whatever interests you), but you'll be well equipped.
reply
Books
Survey
CCM
https://en.m.wikipedia.org/wiki/Convergent_cross_mapping
Tree-based methods
Topological data analysis
Features
What works well where, and lists of top algorithms
Test sets
Tools
I have been using Spacy3 nightly for a while now. This is game changing.
Spacy3 practically covers 90% of NLP use-cases with near SOTA performance. The only reason to not use it would be if you are literally pushing the boundaries of NLP or building something super specialized.
Hugging Face and Spacy (also Pytorch, but duh) are saving millions of dollars in man hours for companies around the world. They've been a revelation.
reply
JPKab 12 hours ago [–]
Everything in the above paragraph sounds like a hyped overstatement. None of it is.
As someone that's worked on some rather intensive NLP implementations, Spacy 3.0 and HuggingFace? both represent the culmination of a technological leap in NLP that started a few years ago with the advent of transfer learning in NLP. The level of accessibility to the masses these libraries offer is game-changing and democratizing.
reply
-- [2]
binarymax 14 hours ago [–]
I have lots of experience with both, and I use both together for different use cases. SpaCy? fills the need of predictable/explainable pattern matching and NER - and is very fast and reasonably accurate on a CPU. Huggingface fills the need for task based prediction when you have a GPU.
reply
danieldk 12 hours ago [–]
Huggingface fills the need for task based prediction when you have a GPU.
With model distillation, you can make models that annotate hundreds of sentences per second on a single CPU with a library like Huggingface Transformers.
For instance, one of my distilled Dutch multi-task syntax models (UD POS, language-specific POS, lemmatization, morphology, dependency parsing) annotates 316 sentences per second with 4 threads on a Ryzen 3700X. This distilled model has virtually no loss in accuracy compared to the finetuned XLM-RoBERTa? base model.
I don't use Huggingface Transformers, but ported some of their implementations to Rust [1], but that should not make a big difference since all the heavy lifting happens in C++ in libtorch anyway.
tl;dr: it is not true that tranformers are only useful for GPU prediction. You can get high CPU prediction speeds with some tricks (distillation, length-based bucketing in batches, using MKL, etc.).
[1] https://github.com/tensordot/syntaxdot/tree/main/syntaxdot-t...
reply
ZeroCool?2u 13 hours ago [–]
SpaCy? and HuggingFace? fulfill practically 99% of all our needs for NLP project at work. Really incredible bodies of work.
Also, my team chat is currently filled with people being extremely stoked about the SpaCy? + FastAPI? support! Really hope FastAPI? replaces Flask sooner rather than later.
reply
langitbiru 12 hours ago [–]
So with SpaCy? 3.0, HuggingFace?, do we still have a reason to use NLTK? Or they complement each other? Right now, I lost track of the progress in NLP.
reply
gillesjacobs 9 hours ago [–]
NLTK is showing its age. In my information extraction pipelines, the heavy lifting for modelling is done by SpaCy?, AllenNLP?, and Huggingface (and Pytorch or TF ofc).
I only use NLTK since it has some base tools for low-resource languages for which noone has pretrained a transformer model or for specific NLP-related tasks. I still use their agreement metrics module, for instance. But that's about it. Dep parsing, NER, lemmatising and stemming is all better with the above mentioned packages.
reply
Vision tools
Object detection:
Books
Courses
Contests
Case studies / examples
Reinforcement learning case studies
SVMs
time series
EDM
NLP techniques
Tips
Idea
Try and combine all (or many) of the classifiers (not via ensemble methods, but by combining their core concepts to make a new algorithm).
eg combine random forest, neural net, svm, genetic alg, stochastic local search, nearest neighbor
eg deep neural decision forests seem to be an attempt to combine the first two
perhaps some of the combinations would only be a linear combination of the output scores, but it would be better to find 'the key idea(s)' of each one.
Intros to and notes on classic algorithms
Reviews
games
deep learning
KR
evolution strategies
imitation learning
https://blog.openai.com/robots-that-learn/
reinforcement learning
RL tools
reinforcement learning of collaboration
datasets
news
what works well where
- linear/logistic regression vs ANNs vs SVMs
- in the NetFlix contest
- "Neural network models are highly expressive and flexible, and if we are able to find a suitable set of model parameters, we can use neural nets to solve many challenging problems....However, there are many problems where the backpropagation algorithm cannot be used. For example, in reinforcement learning (RL) problems, we can also a train a neural network to make decisions to perform a sequence of actions to accomplish some task in an environment. However, it is not trivial to estimate the gradient of reward signals given to the agent in the future to an action performed by the agent right now, especially if the reward is realised many timesteps in the future. Even if we are able to calculate accurate gradients, there is also the issue of being stuck in a local optimum, which exists many for RL tasks. A whole area within RL is devoted to studying this credit-assignment problem, and great progress has been made in recent years. However, credit assignment is still difficult when the reward signals are sparse. In the real world, rewards can be sparse and noisy. Sometimes we are given just a single reward, like a bonus check at the end of the year, and depending on our employer, it may be difficult to figure out exactly why it is so low. For these problems, rather than rely on a very noisy and possibly meaningless gradient estimate of the future to our policy, we might as well just ignore any gradient information, and attempt to use black-box optimisation techniques such as genetic algorithms (GA) or ES...OpenAI? published a paper called Evolution Strategies as a Scalable Alternative to Reinforcement Learning where they showed that evolution strategies, while being less data efficient than RL, offer many benefits. The ability to abandon gradient calculation allows such algorithms to be evaluated more efficiently. It is also easy to distribute the computation for an ES algorithm to thousands of machines for parallel computation. By running the algorithm from scratch many times, they also showed that policies discovered using ES tend to be more diverse compared to policies discovered by RL algorithms...Although ES might be a way to search for more novel solutions that are difficult for gradient-based methods to find, it still vastly underperforms gradient-based methods on many problems where we can calculate high quality gradients...CMA-ES is my algorithm of choice when the search space is less than a thousand parameters. I found it still usable up to ~ 10K parameters if I’m willing to be patient....I use PEPG if the performance of CMA-ES becomes an issue. I usually use PEPG when the number of model parameters exceed several thousand." [6]
- on fitness shaping, an optional add-on to evolutionary learning: "I find fitness shaping to be very useful for RL tasks if the objective function is non-deterministic for a given policy network, which is often the cases on RL environments where maps are randomly generated and various opponents have random policies. It is less useful for optimising for well-behaved functions that are deterministic, and the use of fitness shaping can sometimes slow down the time it takes to find a good solution...." [7]
- "One thing I don't see mentioned here is flexibility. You can plug any ugly old thing into an evolutionary algorithm and let it run. Give me a working computational model at breakfast time and I'll have runs going by lunch. Cranky ancient FORTRAN that requires you to write a new input file for every evaluation? No problem. You have to compile the inputs into the model to make it run? Fine. Badly scaled inputs or outputs? EA doesn't care. More than one objective? Great! Population-based search is a natural fit for multiple objective optimization. As long as you're clear on what the decisions and objectives are, and you're able to run the model yourself, layering evolutionary optimization on top is easy." [8]
- "I think the multiobjective-part is very important. In other algorithms, you often have to specify the solution-space before hand. For instance, when optimizing between lightweight and strength, one would have to beforehand say how to weight those two properties. (f = 100s - 5w for instance). This throws away a whole dimension in your search space. For EAs, you can let it roam free, and select the bests tradeoffs from the pareto-set after the algorithm is done." [9]
- in collaborative filtering, "SVD can struggle when some users have many more “likes” than others" instead try "multi-step co-occurrence" (see https://www.quora.com/What-is-a-co-occurrence-matrix ) -- [10]
- Classifier Technology and the Illusion of Progress
- (my note: in that paper, linear discriminant analysis is given as an example of a simple classifier which is often quite good; also, decision trees with low numbers of leaves, and neural nets with low numbers of hidden nodes, nearest neighbor)
- Why do tree-based models still outperform deep learning on tabular data?
natural language processing (NLP)