Nei Memory I/O Kubudirira Kunokosha kune AI Model Performance

by Batching5m2025/02/25
Read on Terminal Reader

Kurebesa; Kuverenga

Kutarisisa kweBifurcated kunovandudza kushanda kweAI nekudzikisa latency uye ndangariro I/O mutengo, inosimudzira maapplication senge kodhi chizvarwa, chatbots, uye kureba-mamiriro ekugadzirisa.
featured image - Nei Memory I/O Kubudirira Kunokosha kune AI Model Performance
Batching HackerNoon profile picture
0-item

Vanyori:

(1) Ben Athiwaratkun, AWS AI Labs;

(2) Sujan Kumar Gonugondla, AWS AI Labs;

(3) Sanjay Krishna Gouda, AWS AI Labs;

(4) Haifeng Qian, AWS AI Labs;

(5) Sanjay Krishna Gouda, AWS AI Labs;

(6) Hantian Ding, AWS AI Labs;

(7) Qing Zuva, AWS AI Labs;

(8) Jun Wang, AWS AI Labs;

(9) Jiacheng Guo, AWS AI Labs;

(10 Liangfu Chen, AWS AI Labs;

(11) Parminder Bhatia, GE HealthCare (basa rakaitwa paAWS);

(12) Ramesh Nallapati, Amazon AGI (basa rakaitwa paAWS);

(13) Sudipta Sengupta, AWS AI Labs;

(14) Bing Xiang, Goldman Sachs (basa rakaitwa paAWS).

Table of Links

Abstract uye 1 Sumo

2. Basa Rinoenderana

3. Background

3.1. Notation uye 3.2. Mutauro Model Inference

3.3. Multi-Query, Multi-Musoro uye Iyo Yakajairwa Multi-Mubvunzo Kuteerera

4. Context-Aware Bifurcated Attention uye 4.1. Kukurudzira

4.2. Kugadzira uye 4.3. Memory IO Complexity

5. Miedzo

5.1. Kuenzanisa Kugona kweMusoro-Musoro, Multi-Query, uye Multi-Group Attention

5.2. Latencies of Capabilities-Equivalent Models

5.3. Applications

6. Mhedziso uye References


A. FAQs

B. Basa Rinoenderana

C. Setup

D. Multi-Group Attention Mhuri

E. Context-Aware Bifurcated Attention

F. Zvishandiso: Zvimwe Zvabuda

G. Kuenderana neKufungidzira Decoding uye Fast Decoding matekiniki

B. Basa Rinoenderana

B.1. Zvishandiso zveSingle-Context Batch Sampling

Iyo yakacherechedzwa kudzikiswa kwekunonoka kwatinowana kunogona kuve nekukanganisa kwakadzama pane akawanda maapplication. Zvimwe zvezvikumbiro izvi zvinosanganisira:


• Code Generation: Mukugadzirwa kwepurogiramu, AI-assisted code generation inogona kubatsirwa zvakanyanya kubva pakuderedzwa latency, kunyanya kana ichiita snippets dzakawanda dzekodhi kana mazano emamiriro akapiwa. Izvi zvinogona kutungamira kune inoteedzera uye inoshanda mushandisi ruzivo rwevagadziri vanoshandisa AI-powered Integrated Development Mamiriro (IDEs) kana maturusi ekuzadza kodhi (Nijkamp et al., 2023; 2022; Chen et al., 2021; Le et al., 2022; Fried et al., 20; Et al. et al., 2023; Li et al., 2023; Ahmad et al., 2021).


• Shanduro yemuchina: Mumamiriro ezvinhu apo panodiwa shanduro dzakawanda pamutauro mumwe chete, sekugadzira shanduro dzine mwero wakasiyana-siyana kana kuti kugadzira shanduro yemitauro yakasiyana-siyana, kutarisisa kwakasiyana-siyana kwechirevo chechinyorwa kunogona kupa komputa ine hunyanzvi, zvichikonzera kukurumidza uye nekuwedzera masevhisi ekushandura muchina (Costajussà et al., 2022; Farhad et al., Trarhan et al. 2021; Yee et al., 2019).


• Chatbots neConversational AI: Vamiririri vekukurukura vanowanzoda kuunza mhinduro dzakawanda kuti vabate kududzira kwakasiyana kwezvinoiswa nemushandisi kana kupa mazano akawanda. Iyo yakaderedzwa latency inopihwa nenzira yakarongwa inogona kuvandudza zvakanyanya kupindurwa kwechatbots, zvichitungamira kune yakasarudzika uye yemvura hurukuro nevashandisi (Google, 2023).


• Kugadzira Zvigadzirwa Zvemukati: Muzvishandiso zvakaita senhetembo, nyaya, kana kugadzirwa kwekushambadzira, kugona kugadzira misiyano yakawanda yekukurumidza yakapihwa kwakakosha. Iyo nzira yakatsanangurwa inogonesa kugadzirwa kwakawanda kwezvakasiyana zvemukati, zvichiita kuti zvive nyore kune chaiyo-nguva kana yakakura-makuro ekushandisa (Lin naRiedl, 2021; Mirowski et al., 2023; Chikwata, 2023; Yuan et al., 2022).


• Kuwedzera Data: Mumamiriro ezvinhu ekuwedzera data pakudzidza muchina, kugadzira mimwe mienzaniso yakawanda kune yakapihwa inogona kubatsira kunatsiridza kusimba kwemodhi uye kuita zvakazara. Nekuderedzwa latency yakapihwa nemamiriro-anoziva bifurcated kutarisisa, maitiro ekugadzira akawedzera data anogona kuitwa nekukurumidza, zvichiita kuti iwedzere kunyatso shandisa zviwanikwa zvemakomputa panguva yekudzidziswa.


• General Large Scale Evaluation: Kuwedzera kune yambotaurwa-makesi ekushandisa kune akawanda niche ekushandisa-kesi uko LLM uye mamwe akavhurika-akapera echizvarwa modhi anotariswa kune huturu (Dathathri et al., 2019; Gehman et al., 2020; Nadeem et al., 2020 panjodzi ye2020 kodhi, 2020 kodhi, al. kuita kunatsiridza kodhi dhizaini yekugadzira (Madaan et al., 2023), shanduro yemutauro wechirongwa (Roziere et al., 2020) nevamwe vazhinji. Muzviitiko zvese izvi zvizvarwa zvakawanda pakukurumidza kwega kwega zvakaunganidzwa kuti tinzwisise zvakadzama mamodheru, kutarisisa kwakapetwa kaviri kunogona kukurumidzira zvakanyanya maitiro echizvarwa mumamiriro akadai.


Mukupedzisa, iyo yakatsanangurwa mamiriro-inoziva bifurcated yekutarisisa nzira inogona kuderedza zvakanyanya ndangariro I/O mutengo uye kuvandudza latency mune akasiyana maapplication, zvichitungamira mukuwedzera kugona uye scalability. Iyi nzira ine mukana wekugonesa makesi matsva ekushandisa uye kuwedzera ruzivo rwemushandisi mune akawanda AI-powered masisitimu, zvichiita kuti zviite zvakanyanya kushanda kune chaiyo-pasirese kutumirwa.

B.2. Kutsigira Yakareba Context Inoda IO-Inoshanda Kuteerera

Sezvo mamodheru emitauro ari kuita chinangwa chakajairika uye anokwanisa zvakanyanya, kudiwa kwemamodheru emitauro kubata kutevedzana kwechirevo kwakawedzera zvakanyanya. Munguva pfupi yapfuura, kune kuenderera mberi kwekutarisa kune mamodheru anogona kubata kunyange akareba mamiriro akateedzana (Bulatov et al., 2023; OpenAI, 2023; Chikwata, 2023). Kubva nhasi, GPT-4 (OpenAI, 2023) inotsigira kureba kwechiratidziro che32k tokens, uye MPT-7B (Team, 2023) inoiwedzera kusvika ku64k ukuwo Anthropic's Claude [3] inotsigira sekureba kwe100k yekuisa kureba. Nguva pfupi yadarika, Bulatov et al vakakurudzira 1M tokeni yekuisa mamiriro ehurefu hwevanoshandura. Aya mamodheru anosundira miganhu yekunzwisisa kwechinyorwa uye kugona kwechizvarwa, zvichiita kuti kunzwisiswa kwakazara kwehurukuro uye mhinduro dzine ruzivo.


Aya maitiro anofambiswa nekudiwa kwekunzwisisa kwakadzama kwehurukuro mumashandisirwo akaita seKudzoreredza-Augmented Generation (RAG), pamwe nenzira dzakawanda dzakaoma dzekukurudzira. Zvishandiso zvakaita seRAG (Guu et al., 2020; Izacard et al., 2022; Menick et al., 2022; Zhen et al., 2022) vanotora ndima dzakakura kana zvinyorwa kubva kune ekunze corpora, ichipa hupfumi uye hwakadzika mamiriro ekupa mhinduro. Pamusoro pezvo, mamodheru akaita seToolformer (Schick et al., 2023) uye WebGPT (Nakano et al., 2021) anowedzera maturusi ekunze, akadai semaAPI uye injini dzekutsvaga, kuwedzera mamiriro uye kuwedzera chizvarwa.


Mamiriro akareba anodhura zvisingaenzaniswi kune emhando yemhuri yeshanduko nekuti kune vanilla yekuzvitarisa zvese ndangariro uye nguva kuomarara zvine quadratic kune kutevedzana kureba. Kuti ubate zvinoteedzana kwenguva yakareba, kugadzirisa ndangariro I/O uye kuderedza computational pamusoro kwakakosha. Parizvino, nzira huru dzekugadzirisa dambudziko iri dzave dzekuita kuti kutarisisa kudhure. Beltagy et al. (2020) yakakurudzira kudzikamisa kuzvidzora uchishandisa akasiyana maitiro ekutarisisa. Wang et al. (2020) inoongorora yakaderera-chinzvimbo fungidziro yekuzvitarisira iwe pachako. Pamusoro pekuvandudza kwakasungwa nekombuta, kufambira mberi mundangariro-inoshanda yekutarisisa nzira uye matekiniki ekudzikisa ndangariro I/O icharamba ichiendesa munda mberi, kufambisa kubata kwekutevedzana kwechirevo mumitauro mikuru. FlashAttention (Dao et al., 2022) inokurudzirwa kuti ikurumidze kuzvidzora uye kuderedza ndangariro tsoka pasina fungidziro. Iyo inosimudzira fused kernel yekuwedzeredza matrix uye softmax mashandiro ayo anoderedza zvakanyanya memory IO panguva yekudzidziswa.


Iri bepa rinowanikwa pa arxiv pasi peCC BY 4.0 DEED rezinesi.


[3] https://www.anthropic.com/index/100k-context-windows

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks