본문 바로가기
NLP

A Survey of Large Language Model - Wayne Xin Zhao et al (2024)

by 아르카눔 2025. 3. 17.

A Survey of Large Language Model - Wayne Xin Zhao et al (2024)

 

LLM에 대해서 공부할 때 전체적인 흐름을 파악하기 위해서 본 서베이 페이퍼다.

 

구글 스칼라에서 인용수가 2025년 3월 18일 기준 4000이 넘으며 2023년 이후 지속적으로 업데이트 되고 있는 논문이다.

 

특정 분야에 대해서 처음 접하거나 이미 공부한 다음 큰 틀에서 흐름을 정리하고자 할 때 유용한 것이 서베이 논문이라고 생각한다. 

 

상기한 이유와 레퍼런스를 제외하고도 90페이지가 넘는 분량이기도 해서 전체적인 개요와 키워드, 그림 및 표 몇가지만 정리하고자 한다.

 

자세한 내용은 서베이 논문과 레퍼런스를 참고하면 좋겠다. 

 

논문 목차 정리

1. Introduction: 

Statistical Language Model (SLM), Pre-trained LM, Large LM 

 

2. Overview: 

Scaling Law, Instruction, In-Context Learning (ICL), Chain-of-Thought (CoT), Human Alignment, GPT series

 

3. Resources:

Checkpoints, API of LLMs, Data for Pre-Training, Fine-Tuning, and, Alignment Learning, Libraries

 

4. Pre-Training:

Model Architectures, Data, Preprocessing, Normalization Methods, Positional Embeddings, Various Attentions,

Decoding Strategies - including top k sampling 

 

5. Post-Training

Instruction Tuning, In-Context Learning, Alignment Learning, Reinforcement Learning from Human Feedback (RLHF),

Proximal Policy Optimization (PPO), Supervised fine-tuning (SFT), Prefix Tuning, Adapter Tuning (including LoRA)

 

6. Utilization

Prompting, In-Context Learning (ICL), Chain-of-Thought (CoT) Prompting, Planning

 

7. Capacity and Evaluation

 

8. Applications:

Tasks, Domains, Multimodal Model, etc

 

9. Advanced Topics:

RAG, Hallucination, etc

 

10. Conclusion and Future Directions

 

 

1. Introduction

 

 

Statistical language models (SLM)

Natural language models (NLM)

Pre-trained language models (PLM)

Large language models (LLM)

 

2. Overview

Timeline of LLMs

 

 

2.1. Background or LLMs

 

KM scaling law

Chincilla scaling law

Predictable scaling

Task-level predictability

 

Emergent abilities of LLMs

  • In-context learning (ICL)
  • Instruction folowing
  • Step-by-step reasoning
  • Chain-of-thought (CoT)

Key Techniques for LLMs

  • Scaling
  • Training
  • Ability eliciting
  • Alignment learning
  • Tools manipulation

 

2.2. Techinal evolution of GPT-series Models

 

 

GPT-1

GPT-2

GPT-3

  • Capacity Enhancement
  • Training on code data
  • Human alignment
  • The Milestones of Language Models

ChatGPT

GPT-4 and beyond

 

3. Resources of LLMs

 

3.1. Publicaly Available Model Checkpoints of APIs

 

Publicaly Available Model Checkpoints

  • LLaMA 
  • Mistral
  • Gemma
  • Qwen
  • GLM
  • Baichuan

LLaMA Model Family

Alpaca, Alpaca-LoRA, Koala, BELLE, Vicuna, LLaVA, MiniGPT-4, InstructBLIP, PandaGPT

 

 

 

 

Public API of LLMs

GPT-series

 

 

3.2. Commonly Used Corpora for Pre-training

Web pages

  • CommonCrawl
  • C4, The Colossal Clean Crawled Corpus
  • RedPajama-Data
  • RefinedWeb
  • WebText

 

Books & Academic Data

  • Book Data - Book Corpus
  • Academic Data - arXiv Dataset, S2ORC

 

Wikipedia

 

Code

 

Mixed Data

  • The Pile
  • Dolma

 

3.3. Commonly Used Datasets for Fine-tuning

 

Instruction Tuning Datasets

 

NLP Task Datasets

P3

FLAN

 

Daily Chat Datasets

ShareGPT

OpenAssistant

Dolly

 

Sythetic Datasets

Self-Instruct-52K

Alpaca

Baize

 

3.3.2. Alignment Datasets

HH-RLHF

SHP

PKU-SafeRLHF

Stack Exchange Preferences

Sandbox Alingment Data

 

Library Resourse

  • Transformers
  • DeepSpeed
  • Megatron-LM
  • JAX
  • Colossal-AI
  • BMTrain
  • FastMoE
  • vLLM
  • DeepSpeed-MII
  • DeepSpeed-Chat

 

 

4. Pre-Training

 

4.1. Data Collection and Preparation

 

Data Source

Webpage

Converstaion text

Books

Multilingual text

Scientific text

Code

 

 

 

 

4.1.2. Data preprocessing

Filtering and selection

Privacy reduction

Tokenization

BPE (Byte-Pari Encoding)

WordPiece

Unigram tokenization

 

 

 

4.1.3. Data Scheduling

Data Mixture

Increasing the diversity of data sources

Optimizing data mixtures

Specializing the targeted abilities

 

Data curriculum

 

Coding

Mathematics

Long context

 

 

 

4.2. Architecture

Encoder-Decoder Architecture

T5, BART, Flan-T5

 

Causal Decode Architecture

GPT-series

OPT, BLOOM, Gopther

 

Prefix Decoder Architecture

U-PaLM

GLM

 

 

 

 

 

Mixture-of-Experts (MoE)

Switch-Transformer

GLaM

Mixtral

 

Emergent Architectures

SSM (State-space Model) - Mamba, RetNet, RWKV

 

 

 

4.2.2. Detailed Configuration

Normalization Methods

  • LayerNorm
  • RMSNorm
  • DeepNorm

 

 

 

Normalization position

  • Post-LN
  • Pre-LN
  • Sandwich-LN

 

Position Embeddings

  • Absolute position embedding
  • Relative position embdding
  • Rotary position embedding (RoPE)
  • ALiBi

 

Attention

  • Full Attention
  • Sparse Attention
  • Multi-query / grouped-query attention
  • FlashAttention
  • PagedAttention

 

4.2.3. Pre-training Tasks

Denoising Autoencoding

Mixture-of-Denoisers (MoD)

 

4.2.4. Decoding Strategy

 

Improvement for Greedy Search

  • Beam Search
  • Lenght penality

 

Improvement for Random Sampling

  • Temperature sampling
  • Top-k sampling
  • Top-p sampling

 

4.3. Model Training

 

4.3.1. Optimization Setting

 

Batch Training

2048 examples or 4M tokens

dynamic schedule of batch size

 

Learning Rate

from 5 x 10^-5 to 1 x 10^-4.

 

Optimizer

Adam, AdamW, AdaFactor

 

Stabilizaing the Training

Gradient clpping to 1.0 and weight decay to 0.1 

 

4.3.2. Scalable Training Techniques

Data paralleism

Pipeline parallelsim

Tensor parallelism

 

Mixed Precision Training

 

 

 

 

5. Post-Training of LLMs

 

5.1. Instruction Tuning

 

5.1.1. Formatted Instance Construction

 

Formatting Daily Chat Data

Formatting Synthetic Data

Formatting design

Instruction quality improvement

Instruction selection

 

5.1.2. Instruction Tuning Strategies

Balacing the Data Distribution

Combining Instruction Tuning and Pre-training

Multi-stage Instruction Tuning

 

Efficient training for multi-turn chat data

Establishing self-identificaiton for LLM

 

5.1.3. The Effect of Instruction Tuning

 

Performance Improvement

Task Generalization

Domain Specialization

 

5.1.4. Empirical Analysis for Instruction Tuning

 

Instruction datasets

Task-specific instructions

Daily chat instructions

Synthetic instructions

 

Improvement Strategies

Enhancing the instruction complexity

Increasing the topic diversity

Scaling the instruction number

Balancing the instruction difficulty

 

 

 

 

 

 

5.2. Alignment Tuning

 

5.2.1. Background and Criteria

Alignment Criteria

Helpfulness

Honesty

Harmlessness

 

5.2.2. Collecting Human Feedback

Human Labeler Selection

Human Feedback Collection

Ranking-based approach

Question-based approach

Rule-based approach

 

5.2.3. Reinforcement Learning from Human Feedback (RLHF)

Proximal Policy Optimization (PPO)

 

Supervised fine-tuning (SFT)

Reward model training

RL fine-tuning

 

Alignment without RLHF

Supervised Alignment Tuning

 

 

 

 

 

5.3. Parameter-Efficient Model Adaptation

 

5.3.1.  Parameter-Efficient Fine-Tuning Methods

 

Adapter Tuning

Prefix Tuning

Prompt Tuning

Low-Rank Adaptation (LoRA)

 

 

 

6. Utilization

 

6.1. Prompting

 

 

 

6.2. In-Context Learning (ICL)

 

 

6.3. Chain-of-Thought (CoT) Prompting  

Tree-of-Thought (ToT)란 개념도 등장 

 

 

6.4. Planning

 

 

 

 

 

7. Capacity and Evaluation

 

7.1. Basic Ability

 

Language Generation

Text Generation

Code Synthesis

Closed-Book QA

Open-Book QA

Knowledge Completion

Knowledge Reasoning

Symbolic Reasoning

Mathematical Reasoning

 

7.2. Advanced Ability

Human Alignment

Interaction with External Environment

Tool Manipulation - Agent

 

 

 

7.3. Benchmarks and Evaluation Approaches

MMLU

BIG-bench

HELM

Human-level test benchmarks - AGIEval, MMCU, M3KE, C-Eval, Xiezhi

 

Evaluation on Base LLMs and Fine-tuned LLMs

 

 

 

7.4. Empirical Evaluation

Open-source models

Closed-source models

 

 

 

 

8. Applications

 

 

Classic NLP Tasks

  • Word/Sentence-level Tasks
  • Sequence Tagging
  • Information Extraction
  • Text Generation
  • Summary

 

LLM for Information Retrieval

  • LLMs as IR Models
  • LLM-Enhanced IR Models

 

LLM for Recommender Systems

  • LLM as Recommender Systems
  • LLM-enhanced Recommender Systems
  • LLM as Recommender Simulator

 

Multimodal LLMs - Text + Vision or Audio or etc

 

LLM for Evaluation

 

LLM for Specific Domains

Healthcare

Education

Law

Finance

Scientific research

 

9. Advanced Topics

  • Long Context Modeling
  • LLM-empowered Agent
  • Analysis and Optimization for Model Training
  • Analysis and Optimization for Model Inference

 

 

 

  • Model Compression

 

 

  • Retrieval-Augmented Generation (RAG)

 

  • Hallucination

 

 

10. Conclusion and Future Directions

Basics and Principles

Model Arthitecture

Model Training

Model Utilization

Saftey and Alignment

Application and Ecosystem