Ababhali:
(1) David Raposo, Google DeepMind kanye nomnikelo olinganayo;
(2) USam Ritter, Google DeepMind;
(3) u-Blake Richards, i-Google DeepMind kanye ne-McGill University & Mila;
(4) Timothy Lillicrap, Google DeepMind;
(5) Peter Conway Humphreys, Google DeepMind;
(6) Adam Santoro, Google DeepMind kanye nomnikelo olinganayo.
Inothi loMhleli: lena ingxenye yoku-1 kwezi-5 zocwaningo oluchaza kabanzi indlela yokwenza amamodeli olimi asekelwe ku-transformer asebenze kahle ngokunikezela ngamandla izinsiza zokubala. Funda okusele ngezansi.
3.1. Ukuchaza ibhajethi yokubala
3.2. Ukuzulazula kumabhulokhi e-transformer
3.3. Izikimu zomzila
3.4. Ukuqaliswa komzila
3.5. Isampula kanye no-3.6. Izindlela zokuqeqesha
Amamodeli olimi asekelwe ku-Transformer asabalalisa ama-FLOP ngokufanayo kuwo wonke ama-input alandelayo alandelayo. Kulo msebenzi sibonisa ukuthi ama-transformer angakwazi ukufunda ukwaba ama-FLOP (noma ukubala) ezindaweni ezithile ngokulandelana, athuthukise ukwaba ngokulandelana kwezendlalelo ezihlukene kulo lonke ukujula kwemodeli. Indlela yethu iphoqelela isamba sebhajethi yekhomputha ngokufaka inani lamathokheni (𝑘) angabamba iqhaza ekubaleni kokuzinaka kanye ne-MLP kusendlalelo esithile. Amathokheni azocutshungulwa anqunywa inethiwekhi esebenzisa indlela ephezulu𝑘 yomzila. Njengoba 𝑘 ichazwa ngokuthi i-priori, le nqubo elula isebenzisa igrafu yokubala emile enosayizi abaziwayo be-tensor, ngokungafani nezinye izindlela zokubala ezinemibandela. Noma kunjalo, njengoba ubunikazi be-𝑘 amathokheni buwuketshezi, le ndlela ingasebenzisa ama-FLOP ngokungafani phakathi nesikhathi nobukhulu bemodeli bokujula. Ngakho-ke, izindleko zokubala zingabikezelwa ngokuphelele ngesamba, kodwa zishintshashintsha futhi zizwela umongo ezingeni lamathokheni. Akukhona nje ukuthi amamodeli aqeqeshwe ngale ndlela afunda ukwaba ikhompuyutha ngokushintshashintshayo, akwenza kahle lokho. Lawa mamodeli ahambisana nokusebenza kwesisekelo kwe-FLOPS efanayo nezikhathi zewashi lasodongeni ukuze aziqeqeshe, kodwa adinga ingxenyenamba yama-FLOP ngokudlula okuya phambili, futhi angashesha ngaphezu kuka-50% ukuze anyathele ngesikhathi sokusampula kwangemva kokuqeqeshwa.
Akuzona zonke izinkinga ezidinga isikhathi noma umzamo ofanayo ukuze uzixazulule. Ngokulinganayo, ekufanekiseni ulimi akuwona wonke amathokheni nokulandelana okudinga isikhathi noma umzamo ofanayo ukwenza isibikezelo ngokunembile. Noma kunjalo, amamodeli e-transformer asebenzisa inani elifanayo lekhompiyutha ngethokheni ngayinye ekudluleleni phambili. Ngokufanelekile, ama-transformer angasebenzisa isabelomali sekhompiyutha encane ngokungachithi ngekhompyutha ngokungadingekile.
Ukubala okunemibandela kuyindlela ezama ukunciphisa ikhompuyutha ephelele ngokuyisebenzisa kuphela lapho kudingeka (Bengio et al., 2016; Bengio, 2013; Bengio et al., 2013). Ama-algorithms ahlukahlukene anikeza izixazululo zokuthi kufanele kusetshenziswe nini futhi kangakanani ikhompuyutha (u-Ainslie et al., 2023; Bapna et al., 2020; Fedus et al., 2022). Kodwa-ke, ukwakheka okuvamile kwale nkinga eyinselele kungase kungasebenzi kahle nezingqinamba ezikhona zehadiwe njengoba kuvame ukwethula amagrafu wokubala ashukumisayo (Dehghani et al., 2018; Graves, 2016). Izindlela zokubala ezinemibandela ezithembisayo kakhulu kungase kube yilezo ezivumelanayo nesitaki sethu sezingxenyekazi zekhompuyutha zamanje, ezibeka phambili amagrafu wokubala amile, nosayizi abaziwayo be-tensor abakhethiwe ukuze kukhuliswe ukusetshenziswa kwehadiwe.
Lapha sibheka inkinga yokumodela ulimi kusetshenziswa ibhajethi yekhompiyutha emile engenziwa ibe ngaphansi kwaleyo esetshenziswa i-vanilla transformer. Inethiwekhi kufanele ifunde indlela yokwaba ngokuguquguqukayo ikhompuyutha etholakalayo ngokwenza izinqumo ngethokheni ngayinye, kusendlalelo ngasinye, mayelana nokuthi uzochitha kuphi ukubala kusukela kubhajethi etholakalayo. Ekusebenziseni kwethu ukubala okuphelele kuchazwa umsebenzisi futhi akuguquki ngaphambi kokuqeqeshwa, kunokuba kube umsebenzi wezinqumo zenethiwekhi lapho undiza. Ngakho-ke, izinzuzo ezisebenza kahle zehadiwe-njengokuncishiswa kwenkumbulo, noma ama-FLOP ancishisiwe ngokudlula phambili-kungalindelwe futhi kusetshenziswe ngaphambi kwesikhathi. Njengoba sizobonisa, lezi zinzuzo zingatholakala ngaphandle kokudela ukusebenza okuphelele.
Sisebenzisa indlela efana ne-Mixture of Experts (MoE) transformers, lapho izinqumo zomzila zamathokheni ashukumisayo zenziwa kuyo yonke inethiwekhi ukujula. Ukusuka ku-MoE, sikhetha ukusebenzisa ukubala kwethokheni (njengoba kungaba njalo ngesiguquli esijwayelekile), noma sikudlulise ngoxhumano olusalela (okusele lungashintshiwe futhi lulondoloza ikhompuyutha). Futhi ngokungafani ne-MoE, sisebenzisa lo mzila kukho kokubili ama-MLP aya phambili kanye nokunaka kwamakhanda amaningi. Ngakho-ke njengoba lokhu kuphinde kube nomthelela kokhiye nemibuzo esiyicubungulayo, umzila awugcini nje ngokwenza izinqumo mayelana nokuthi imaphi amathokheni okufanele avuselelwe, kodwa nokuthi yimaphi amathokheni enziwa atholakale ukuze anakekelwe. Sibhekisela kulelisu njenge-Mixture-of-Depths (MoD) ukuze sigcizelele ukuthi amathokheni angawodwana adlula kanjani ezinombolweni ezehlukene zezendlalelo, noma amabhlogo, ngokujula kwe-transformer (bheka umfanekiso 1).
Indlela ye-MoD iphinde ivumele umuntu ukuthi ahwebe ngokusebenza ngesivinini. Ngakolunye uhlangothi, umuntu angakwazi ukuqeqesha isiguquli se-MoD esithuthuka kuma-vanilla transformer aze afike ku-1.5% kunhloso yokugcina yokuqeqeshwa kwamathuba elogi yokuqeqeshwa okufanayo kwama-FLOPs (isoFLOP), kuyilapho ethatha inani elilinganayo lesikhathi sodonga ukuze aziqeqeshe. Ngakolunye uhlangothi, umuntu angakwazi ukuqeqesha i-MoD transformer efinyelela ukulingana kokulahlekelwa kokuqeqeshwa nge-isoFLOP vanilla transformer, kodwa esebenzisa ingxenyana ye-FLOPs (ngaphezulu kuka-50%) ngokudlula okuya phambili, futhi yingakho ishesha ukunyathela. Ndawonye, le miphumela isho ukuthi iziguquli ze-MoD zifunda umzila ngokukhalipha (okungukuthi, ukweqa ukubala okungadingekile) njengoba zingafinyelela amathuba okungena alinganayo noma angcono ngokulandelana ngakunye naphezu kokunyathelisa kwezinyawo kwe-FLOP encane ngokudlula okuya phambili.
Leli phepha litholakala ku-arxiv ngaphansi kwelayisensi ye-CC BY 4.0 DEED.