Mistral 7B (2023) 논문 리뷰

Mistral 7B의 논문 이름은 Mistral 7B다. (링크)

저자는 Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed다.

Github:

Mistral common (링크)
Mistral Coobook (링크)

프랑스에서 만든 LLM이다. 지금은 Chatbot 형태인 Le Chat도 지원한다.

Abstract

Mistral 7B는 Llama 2의 13B 모델 보다 모든 벤치마크에 대해서 좋은 성능을 거두었으며, reasoning, mathmatics, code geneation에서 SOTA인 Llama 1 34B를 뛰어넘었다. GQA (Grouped Query Attention)을 도입하여 추론 속도를 높였으며 SWA (Sliding Window Attention)을 도입하여 arbitrary length 임의의 길이의 문장들에 대한 추론 비용을 절감했다. Instructions를 따르는 fine-tuning을 수행하여 Mistral-7B는 Llama2 13B-chat을 human and automated benchmarks 양쪽에서 더 나은 성능을 달성했다.

2. Architectural details

Mistral 7B의 구조는 아래 Table 1에 나와있다.

GQA (Grouped Query Attention)

우선 GQA를 알아본다. 미디엄 블로그 (링크)에서 잘 설명된 그림을 보면 쉽게 이해할 수 있다.

이전에 MQA (링크)를 리뷰한 적이 있는데 MQA는 K, V를 모든 쿼리에 대해서 share한다면, GQA는 쿼리를 g개의 그룹으로 묶고 해당 그룹 내에서만 K와 V를 공유한다. Llama 2에서 쓰던 GQA다.

SWA (Sliding Window Attention)

Figure 1은 Sliding window attention을 보여준다. Longformer와 Sparse Transformer에서 소개한 구조와 같다.

Window size가 $W$일 때, layer $k$의 position $i$에 있는 hidden states $h_i$는 이전 layer의 pisition $i - W$와 $i$를 attend한다. 재귀적으로 $h_i$는 총 $W \times k$개의 tokens를 Figure 1에 나온것처럼 참조할 수 있다.

마지막 레이어에서 window size $W$는 4096이다. 이론적으로 attention span은 약 131 K tokens다.

Sequence length는 16K, 그리고 윈도우 크기는 4096으로 설정하고 FlasshAttention와 xFormers를 사용해서 vanilla attention 보다 2배의 속도 향상을 보였다.

Flash Attention

하드웨어 레벨에서 Attention 메커니즘을 최적화 하는 방법이다.

자세한 내용은 Flash Attention 1 (링크), 2 (링크), 3 (링크)를 참조하면 좋다.

종합해서 결론만 말하자면 Softmax 계산에서의 테크닉과 tiling, 그리고 잦은 읽기와 쓰기를 방지해서 최적화한다.

Rolling Buffer Cache

우선 KV Cache에 대해서 간략하게 알아본다.

[ Attention 과정에서 생기는 Key와 Value를 버리지 않고 저장해서 다시 사용함으로써 추론 속도를 향상시킨다.

하지만 반대급부로 메모리양이 많이 필요하게 된다.

2 (K와 V) * batch_size * n_layers * n_heads * d_head * sequence_length * precision

이때 d_model = n_heads * d_head 다.

이때 FP16으로 가정하고 sliding window 이므로 sequence_length 대신 window size를 넣으면,

FP16이면 2 bytes 이므로 2 * 32 * 4096 * 4096 * 2 = 약 2 GB다. ]

다시 rolling buffer cache로 돌아오자.

Fixed attention span은 cache size를 제한하므로 rolling buffer cache를 사용한다.

Cache는 고정된 크기인 $W$를 가지며 timestep $i$에서 keys와 values는 position $i \text{mod} W$의 위치에 캐시로 저장된다.

결과적으로 W 보다 큰 position $i$는 오버라이팅 되고 캐시의 크기가 더 이상 커지지 않는다.

자세한 과정은 아래 Figure 2에 묘사되어 있다. 아래는 $W$ = 3일 때의 예시다.

Sequence의 길이가 32K tokens 일 때 이 방법을 이용해서 모델의 퀄리티에 영향을 주지 않으면서도 캐시 메모리의 사용을 8x 감소시켰다.

Pre-fill and Chunking

문장을 생서할 때 토큰은 one-by-one으로 생성하고 이는 이전 생성들에 대해서 conditioned on 된다.

하지만 사전에 주어지는 프롬프트는 알고있으므로 (k, v) cache를 pre-fill할 수 있다.

만약 프롬프트가 매우 크다면 이를 더 작은 사이즈로 chunk it하고 각각의 청크에 대해서 pre-fill the cache를 진행한다.

Window size는 청크의 사이즈와 동일하게 했으며, 각각의 청크 마다 cache와 chunk에 대한 attetnion을 계산한다.

아래 Figure 3가 해당 내용을 담고있다.

3. Results

다음의 벤치마크에 대해서 평가를 수행했다.

Commonsense Reasoning (0-shot): Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA , ARC-Easy, ARC-Challenge, CommonsenseQA
World Knowledge (5-shot): NaturalQuestions, TriviaQA
Reading Comprehension (0-shot): BoolQ, QuAC
Math: GSM8K (8-shot) with maj@8 and MATH(4-shot) with maj@4
Code: Humaneval(0-shot) and MBPP (3-shot)
Popular aggregated results: MMLU (5-shot), BBH (3-shot), and AGI Eval(3-5-shot, English multiple-choice questions only)

대부분의 벤치마크에서 Llama 보다 좋은 성능임을 알 수 있다.

4. Instruction Finetuning

Huggingface repository에서 공개적으로 구할 수 있는 instruction datasets에 대해서 파인 튜닝을 수행했다.

MT-Bench의 모든 7B 모델에 대해서는 더 좋은 성능을, 그리고 13 B - Chat Models와 비견할만한 성능임을 보였다.

Refernces:

https://devocean.sk.com/blog/techBoardDetail.do?ID=165192

https://verticalserve.medium.com/group-query-attention-58283b337c65

https://velog.io/@jpseo99/Flash-Attention

https://taewan2002.medium.com/%EC%84%B1%EB%8A%A5-%EC%B5%9C%EC%A0%81%ED%99%94%EB%A5%BC-%EC%9C%84%ED%95%9C-flash-attention-2-41a345808005

https://pytorch.kr/blog/2024/flashattention-3/

https://moon-walker.medium.com/long-context%EB%A1%9C-%EC%9D%B8%ED%95%9C-large-kv-cache%EC%9D%98-%EB%AC%B8%EC%A0%9C%EC%A0%90%EA%B3%BC-%ED%95%B4%EA%B2%B0-%EB%B0%A9%EC%95%88-part-i-kv-cache%EC%9D%98-%EB%A9%94%EB%AA%A8%EB%A6%AC-%EC%9A%94%EA%B5%AC%EB%9F%89-025f3d5dea93

https://medium.com/@joaolages/kv-caching-explained-276520203249

https://dytis.tistory.com/54

'NLP > LLM' 카테고리의 다른 글

LLM에서의 temperature, Top-k, Top-p, Penalties (0)	2025.05.11
LLM 서빙 관련 글 모음 (0)	2025.04.27
GPT 4 (2023) 리뷰 (0)	2025.04.26
Self-Instruct (2022) 논문 리뷰 (0)	2025.04.17
LLaMA 2 (2023) 논문 리뷰 (0)	2025.04.17

공부 기록하는 블로그

Mistral 7B (2023) 논문 리뷰

Abstract

2. Architectural details

3. Results

4. Instruction Finetuning

'NLP > LLM' 카테고리의 다른 글

티스토리툴바

Mistral 7B (2023) 논문 리뷰

Abstract

2. Architectural details

3. Results

4. Instruction Finetuning

'NLP > LLM' 카테고리의 다른 글

관련글

티스토리툴바