Aran Komatsuzaki

Some Notable Recent ML Papers and Future Trends

Advertisements

I have aggregated some of the notable papers released recently, esp. ICLR 2021 submissions, with concise summaries, visualizations and my comments. The development in each field is summarized, and the future trends are speculated.

Caveats: I have omitted some very well-known recent papers such as GPT-3, as most readers should be adequately familiar with them. Admittedly, the coverage is far from exhaustive with heavy bias toward the areas of my interest (e.g. language models), and the amount of details I have written varies by papers.

Table of Contents with Summary & Conclusion

General Scaling Method

Summary:

Contents:

Future Trends:

NLP

Summary:

Contents:

Future Trends:

CV

Summary:

Contents:

Future Trends:

RL

Summary:

Contents:

Future Trends:

Unsupervised pre-training for RL, world model, sequence/video modeling and the notion of optimization of data will improve further, both individually and synergically, so that

Optimizer

Summary:

Contents:

Optimization of Data

What is it?:

Optimization of data, in this linked post listing many relevant papers, refers to the idea of treating RL as (un-)supervised learning on “good data” that the model finds from its interaction with the environment and therefore a joint optimization of the model and the data. In this section, it is argued that this joint optimization also applies to ML as a whole.

Optimization of data for ML in general:

I believe it is natural to argue that the notion of optimization of data is also applicable to ML in general. Let us consider two examples:

Summary:

Thus, ML, including RL, can be broadly thought of as joint optimization of

Conclusion

The recent development can be summarized as approaching problems according to the paradigm of (efficient) Transformer, scaling, pre-training, retrieval and joint optimization of model and data. This trend will continue and simplify ML research toward a unified model.

Acknowledgement: I would like to thank Madison May for his valuable feedbacks and his blog posts that inspired this blog post!

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

SGMoE replaces every other FFN layer of Transformer

tl;dr:

Details:

(Shazeer, 2017) A SGMoE layer. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network

Comments:

Relevant links:

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

tl;dr: Proposes L2L, with which GPU/TPU memory usage becomes constant w.r.t. the number of layers. Able to fit 50B parameters with a single V100 and 500GB CPU memory with no speed loss.

Comments:

Relevant links:

DeepSpeed

tl;dr: A library with various scaling tools, notably the following:

Comments:

Relevant links:

Pre-training via Paraphrasing

tl;dr: By training a language model and a retriever jointly and modeling a passage from the retrieved similar passages, MARGE achieves:

Details:

Performance of zero-shot translation

Comments:

Relevant links:

Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval

tl;dr: Achieves the SOTA performance-computes trade-off in multi-hop open-domain QA (better than Fusion-in-Decoder). Best published accuracy on HotpotQA with 10x faster inference.

Details:

Efficiency-performance trade-off comparison with published HotpotQA systems. The curve is plotted with different number of top k (k=1,5,10,20,50,100,200) passage sequences we feed into the reader model. seq/Q denotes the time required for each query.

Relevant links:

Cross-lingual Retrieval for Iterative Self-Supervised Training

tl;dr: Achieves SotA (unsupervised NMT) BLEU on 9 language directions (+2.4 BLEU on avg.) without back-translation by retrieving the target with faiss.

Comments:

Relevant links:

Long Range Arena: A Benchmark for Efficient Transformers

tl;dr: Various Transformer variants are benchmarked over various tasks. Performance-computes trade-off for each model is obtained as above.

Details:

Comments:

Relevant links:

Rethinking Attention with Performers

tl;dr: O(N)-Transformer with competitive performance that approximates regular attention with provable accuracy. Performer outperforms Reformer and Linformer.

Comments:

Relevant links:

Scaling Laws for Neural Language Models

tl;dr:

Details:

The optimal allocation of a fixed compute budget and sufficiently large dataset available. B stands for the batch size.
For a billion-fold increase in compute and optimally compute-efficient training, most of the increase should go towards increased model size, with a relatively small increase in batch size, and almost none toward serial step count.

Comments:

Relevant links:

Learning to Summarize From Human Feedback

tl;dr: Achieves super human-level summarization on TL;DR dataset by training a reward function on human feedback and fine-tuning a pre-trained generator (GPT-3 variants) with PPO.

Performance of various training procedures for different model sizes on TL;DR dataset. Supervised models are fine-tuned with an additional dataset upon the pre-trained GPT-3 variants.

Comments:

Relevant links:

Measuring Massive Multitask Language Understanding

tl;dr: A smaller, fine-tuned T5-like model (UnifiedQA) outperforms GPT-3 on solving various academic problems, ranging from elementary mathematics to US history, with few-shot learning.

Details:

Relevant links:

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

tl;dr: Proposes a multi-task fine-tuning that is more performant and memory-saving than the conventional per-task fine-tuning.

Details:

ST = single task fine-tuning

Comments:

Relevant links:

Current Limitations of Language Models: What You Need is Retrieval

tl;dr:

Details:

Comments:

Relevant links:

More Future Trends of NLP

This section expands and elaborates what is described in Future Trends of NLP section of Table of Contents.

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

Architecture is essentially the same as vanilla BERT, except that the embedding is a linear projection of each patch into a dense vector.

tl;dr: When pre-trained and transferred to CV tasks, Vision Transformer, needing minimal amount of inductive bias only at preprocessing, attains excellent results compared to SOTA CNNs while requiring much fewer computational resources to train.

Performance versus cost for Vision Transformers and ResNets. Vision Transformers generally outperform ResNets with the same computational budget.

Comments:

Relevant links:

Generative Pretraining from Pixels

tl;dr: GPT-2 scale image model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification despite having minimal 2D inductive bias and low-resolution ImageNet without label. The generated images are stunning!

Details:

Comments:

Relevant links:

VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers

There are two sequential stages in training: training VQ-VAE (Left) and training an autoregressive transformer in the latent space (Right).

tl;dr: VQ-VAE-based, GPT-like model with 3D convolutions and axial self-attention improves the SotA bits/dim on BAIR dataset from 3.94 (Axial Attention) to 3.62.

Comments:

Relevant links:

Mastering Atari with Discrete World Models


Components of Dreamer. Taken from DreamerV1 paper (Hafner, 2019)

tl;dr: Proposes DreamerV2, the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.

Performance on the Atari benchmark of 55 games over 200M steps

Details:

Relevant links:

Unsupervised Active Pre-Training for Reinforcement Learning

tl;dr: Highly competitive performance compared to canonical RL algorithms is achieved using only 100k steps on Atari by a novel unsupervised active pre-training.

Evaluation on Atari games

Details:

Relevant links:

Using a thousand optimization tasks to learn hyperparameter search strategies

A 2D TSNE embedding of all 1162 tasks used for training the learned optimizer.

tl;dr: An optimizer learned from a dataset of hyperparameters of a thousand tasks leads to a substantial improvement over various tasks, including large-scale problems such as LM1B and ImageNet.

Details:

The learned optimizer outperforms the learning rate tuned Adam by a large margin and matches the default.
The learned optimizer outperforms the learning rate tuned Adam with both a constant learning rate, and a fixed learning rate schedule on a 53M parameter Transformer trained on LM1B.

Comments:

Relevant links:

Second Order Optimization Made Practical

Preconditioner statistics are computed at each step by the accelerators. Preconditioners are only computed every N steps and this computation is distributed to all available CPU cores.

tl;dr:

Test log-perplexity of a Transformer-Big model on WMT’14 en→fr with the batch size of 1536. Improvement of 41% in wall-time for convergence.

Relevant links:


Advertisements

Advertisements