Big picture:

  • Most state of the art NLP models today are based on attention mechanisms, specifically multi-layer self attention mechanisms, also known as “Transformer” architectures.
  • Landmark models include
    • The original transformer model (google Brain, original paper)
      • For the first time, use a complex natural language model for multiple tasks (translation, English constituency parsing) that does not use convolutions or recurrence (the previous state of the art). Instead, a multi-layer self attention mechanism generates increasingly powerful contextual embedding representations for every token in the input, which can be used for any task.
    • BERT (developed by google, original paper, blog post). Key components include:
      • General natural language encoding model, any decoding model can be added and fine tuned for a specific task (e.g. next sentence prediction, word prediction, question answering, etc.)
      • Multi-layer self attention mechanism
      • Pre-training of the encoding layer is achieved with extremely large corpora of natural language -> general modell
    • Generative Pre-Trained Transformers (GPT, developed by OpenAI)
      • Standard language modeling objective (next token prediction) as pre-training for powerful transformer based language model
      • Task-conditioning as auxiliary input in natural language form to the model (e.g. naming the task and providing some examples)
      • This allows few-shot or zero shot learning, i.e. the model can be essentially directly applied to any new task, without fine-tuning or changing parameters, by providing a description of the task as part of the input
      • GPT-3 performs well on seemingly unrelated tasks such as writing code from natural language descriptions, or generating subject-specific text that looks like it was written by a human
      • GPT-2 and GPT-3 build on this by employing larger text corpora and larger models with longer training
      • Parameters: GPT: 100M GPT-2 (paper, blog): 1.5B GPT-3 (blog): 100B (!!!)
      • Limitations:
        • Large context and summarization are difficult
        • Unidirectional training creates limitations
        • Models have the biases from the corpora it was trained on
        • Large and costly inference
        • GPT-2 and GPT-3 are extremely large models that are difficult to train without the computational resources of a cash-flooded company