What's Paper of the Week?

Part of the fun of working in the data field is reading all the new research coming out of various academic and corporate AI labs. I spent part of the afternoon browsing arxiv-sanity -- a very nice bit of work that puts a good UI on top of machine learning papers published on arXiv. There's so much to learn!

My mid-year resolution is to write about 20 new papers by the end of the year. Note that the year has 25 weeks left, and I set a target of 20 papers. That's called having realistic expectations of yourself haha!

I'm feeling fairly omnivorous at the moment. While most of these will probably machine learning papers, I'm putting a few computer science papers on the list too.

The First Paper of the Week

This week's paper, Bag of Tricks for Efficient Text Classification. Key idea -- it's an extension of word2vec that can tackle sentence and document classification that achieves comparable accuracy while training leagues faster than a deep learning model.

Better than Deep Learning?

The model is called fastText. It can be trained in minutes on a standard CPU, and performs favorably against deep learning classifiers trained on GPUs. Given how pricey AWS GPU instances are and all the additional devops setup time involved there, seems like a winner.

Model Nitty Gritty

Input: sequence of words
Output: probability distribution over predefined classes (softmax)

fastText is a model with one hidden layer. It's trained with stochastic gradient descent and backpropagation, with a linearly decaying learning rate. The authors combined a bunch of existing computational tricks for linear models to create fastText. It's more of a very clever extension of existing ideas rather than a brand new idea.

Neat Tricks:

  • Using hierarchical softmax to decrease linear classifiers computation cost.
  • Since P(node) < P(parent) by definition, exploring the Huffman coding tree via depth first search allows us to discard branches associated with a smaller prob. Further reduces complexity.
  • Use a binary heap to compute T-top targets (cost O(log(T) seems too good to be true at first, but if you think about it, it's the logical outcome of building the tree!)
  • Preserving sequential information w/o using RNNs, which is cool enough to deserve its own section

Preserving Sequential Information

Bag of words doesn't preserve word order, only word appearance. Taking word order into account adds incredibly amounts of complexity, which is a strong reason for using a recurrent neural network that tracks sequential data.

Sequential Data Tricks (How do get things done w/o using an RNN):

  • Using bag of n-gram features to capture partial information about local word order. efficient in practice, which is my favorite type of efficiency.
  • Memory efficient n-gram mapping done using a hashing trick with 10M bins for bigrams, 100M bins otherwise. (The paper references other papers for the hashing trick and doesn't describe this trick at all.)

Works well for sentiment analysis and tag prediction. Better yet, trains FAST FAST FAST.

This trained model performs better than SVM+TF, CNN, and is as good as Conv-GRNN on sentiment analysis for Yelp reviews. More impressively, on the YFCC100M dataset (100M images with captions, titles and tags), it predicted the tags with only titles and caption data at a performance level that's better than Tagspace in terms of both speed and accuracy.

6-15 mins of training time for fastText vs 3-5.5 hours of training time for Tagspace! No waiting all night by my computer or training on AWS overnight before realizing that I just coded in something utterly dumb into the model.

Confusion Points!!!

So is this a neural network? There's one hidden layer, but fastText is introduced as "closely related to standard linear text classifiers"!? I feel like this is a question that'll get answered when I read the code.

We will publish our code so that the research community can easily build on top of our work.

Future tense? NOOO!!!! It's not public yet! Okay, it was only published 4 days ago, so I'm not going to start freaking out, but it's fairly disappointing not to have this. Seems like a very exciting piece of work that I'd like to add to my personal bag of tricks.