paint-brush
Embeddings ye RAG - A Complete Overviewby@aibites
270 kuverenga

Embeddings ye RAG - A Complete Overview

by Shrinivasan Sankar9m2024/11/30
Read on Terminal Reader

Kurebesa; Kuverenga

Kumisikidza inhanho yakakosha uye yakakosha pakuvaka pombi yeKudzoreredza Augmented Generation(RAG). BERT uye SBERT imhando-ye-the-art yekumisikidza modhi. Sentence transformers ndiyo raibhurari yepython inoshandisa ese mamodheru. Ichi chinyorwa chinonyura mukati mezvose dzidziso uye maoko-on
featured image - Embeddings ye RAG - A Complete Overview
Shrinivasan Sankar HackerNoon profile picture

Ichi chinyorwa chinotanga nevanoshandura uye chinotarisa zvikanganiso zvayo semuenzaniso wekubatanidza. Inobva yapa mucherechedzo weBERT uye kunyura kwakadzika muSentence BERT (SBERT) inova iyo-ye-ye-iyo-yepamusoro mukumisikidzwa kwemutsara kweLLMs nemapaipi eRAG.

Tsanangudzo Inooneka

Kana iwe uri munhu anoona seni uye uchida kuona tsananguro yekuona, ndapota tarisa vhidhiyo iyi:

Transformers

Transformers haidi sumo. Kunyange zvazvo pakutanga akanga akagadzirirwa mabasa ekushandura mutauro, ndiwo ari kutyaira mabhiza ari shure kweLLM dzose nhasi.


Pamwero wepamusoro, iwo anoumbwa nematombo maviri - encoder uye decoder. Iyo encoder block inotora mune yekupinza uye inoburitsa matrix inomiririra. Iyo decoder block inotora mukubuda kweyekupedzisira encoder uye inogadzira inobuda. Iyo encoder uye decoder mabhuroko anogona kuumbwa nemateya akati wandei, kunyangwe iyo yekutanga transformer ine 6 akaturikidzana mubhuroko rega rega.


Matunhu ese anoumbwa ne-multi-headed self-attention. Zvakadaro, mutsauko uripo pakati peencoder nedecoder ndewekuti kubuda kweencoder kunopihwa kune yega yega yedhikodha. Panyaya yematanho ekutarisisa, iyo decoder yekutarisisa zvikamu zvakavharwa. Saka, iyo inobuda pane chero chinzvimbo inokonzerwa nezvakabuda pane zvakapfuura zvinzvimbo.


Iyo encoder uye decoder block inoumbwa zvakare ne layer yakajairwa uye feed-forward neural network layer.


Kusiyana nemhando dzekare dzakaita seRNNs kana LSTMs dzakagadzirisa tokens dzakazvimirira, simba revashanduri riri mukugona kwavo kutora mamiriro echiratidzo chega chega maererano nekutevedzana kwese. Nokudaro, inotora mamiriro ezvinhu akawanda kana ichienzaniswa nemhando ipi neipi yekare yakagadzirwa nekugadzirisa mutauro.

Chii Chakashata NeTransformers?

Transformers ndiwo akabudirira ekuvaka ari kutyaira iyo AI shanduko nhasi. Saka, ndinogona kuratidzwa gonhi kana ndikanongedza miganhu naro. Nekudaro, sekureva kwechokwadi, kudzikisa computational pamusoro, iyo yekutarisisa maseru akagadzirirwa chete kutarisisa kune zvakapfuura tokeni. Izvi zvakanakira mabasa mazhinji. Asi zvinogona kunge zvisina kukwana kune basa rakaita sekupindura mibvunzo. Ngatitorei muenzaniso uri pasi apa.


John akauya naMilo kumabiko. Milo akafara zvikuru papati. Akanaka, katsi chena ine makushe.


Ngatitii tinobvunza mubvunzo, “Milo akanwa here papati naJohn?” Zvichingobva pamitsara miviri yekutanga mumuenzaniso uri pamusoro, zvingangoita kuti LLM ipindure, "Tichifunga kuti Milo aive nemafaro akawanda anoratidza kuti Milo akanwa papati."


Zvakadaro, modhi yakadzidziswa ine mamiriro emberi angave achiziva mutsara wechitatu unoti, “Akanaka, ane hushamwari katsi ”. Uye saka, vaizopindura kuti, "Milo ikati, saka hazviite kuti anwe papati."


Kunyangwe uyu uri wekufungidzira muenzaniso, iwe unowana iyo pfungwa. Mubasa rekupindura mibvunzo, kudzidza kumberi nekumashure kunova kwakakosha. Apa ndipo panopinda modhi yeBERT.

BERT

BERT inomiririra Bidirectional Encoder Representations kubva kuTransformers. Sezvinoratidzwa nezita racho, yakavakirwa paTransformers, uye inobatanidza zvese zviri zviviri kumberi nekumashure mamiriro. Kunyangwe yakatanga kudhindwa kumabasa akaita semhinduro yemibvunzo uye muchidimbu, ine mukana wekuburitsa zvine simba zvakanamirwa nekuda kwechimiro chayo chechipiri.

BERT Model

BERT haisi chinhu kunze kweiyo transformer encoder yakaturikidzana pamwechete mukutevedzana. Musiyano chete ndewekuti iyo BERT inoshandisa bidirectional self-attention , ukuwo vanilla transformer inoshandisa kuzvidzora kuzvidzora uko chiratidzo chega chega chinogona kungoenda kune mamiriro kuruboshwe.


Cherechedza: kutevedzana vs mutsara. Ingori chinyorwa pane terminology kudzivirira kuvhiringidzika paunenge uchibata neBERT modhi. Mutsara mutsara wemazwi akapatsanurwa nenguva. Kutevedzana kunogona kuve chero nhamba yezvirevo zvakarongedzerwa pamwechete.


Kuti tinzwisise BERT, ngatitorei muenzaniso wekupindura mubvunzo. Sezvo mubvunzo-kupindura uchisanganisira mashoma emitsara miviri, BERT yakagadzirirwa kugamuchira mitsetse miviri mufomati <mubvunzo-mhinduro>. Izvi zvinotungamira kune zvipatsanuri tokens senge [CLS] yakapfuura pakutanga kuratidza kutanga kwekutevedzana. Iyo [SEP] chiratidzo chinobva chashandiswa kupatsanura mubvunzo nemhinduro.


Saka, yekuisa iri nyore ikozvino yave, [CLS]<mubvunzo>[SEP]<mhinduro>[SEP] sezvaratidzwa pamufananidzo uri pazasi.

Mitsara miviri A uye B inopfuudzwa kuburikidza neWordPiece embedding modhi mushure mekubatanidza [CLS] uye [SEP] tokeni. Sezvo isu tine mitsara miviri, modhi inoda mamwe embeddings kuti isiyanise. Izvi zvinouya muchimiro chechikamu uye chinzvimbo chinomisikidzwa.


Kuiswa kwechikamu kunoratidzwa negirinhi pazasi kunoratidza kana zviratidzo zvekupinza zviri zvemutsara A kana B. Panobva pauya kuiswa kwenzvimbo kunoratidza nzvimbo yechiratidzo chega chega mukutevedzana.

Mufananidzo wakatorwa kubva paBERT bepa rinoratidza chinomiririra chinomiririra chemuenzaniso.


Ese matatu akamisikidzwa anopfupikiswa pamwe chete uye anopihwa kune iyo BERT modhi inova bidirectional sezvakaratidzwa mumufananidzo wekutanga. Iyo inobata kwete chete mamiriro ekumberi asiwo mamiriro ekumashure isati yatipa zvinobuda kune yega yega chiratidzo.

Pre-Kudzidziswa BERT

Pane nzira mbiri idzo modhi yeBERT inofanodzidziswa uchishandisa mabasa maviri asina kutariswa:

  • Masked language model(MLM). Pano isu tinovhara imwe yezana muzana yezviratidzo mukutevedzana uye rega modhi ifanotaura zviratidzo zvemasiki. Inozivikanwawo sebasa rekuvhara . Mukuita, 15% yezviratidzo zvakavharwa basa iri.

  • Next Sentence Prediction(NSP). Pano, tinoita kuti muenzaniso ufembere mutsara unotevera mukutevedzana. Pese kana mutsara uri iwo chaiwo unotevera, isu tinoshandisa zita rekuti IsNext uye kana zvisiri, tinoshandisa zita rekuti NotNext .

    Kufanodzidziswa kweBERT modhi neNSP neMLM tokens pakubuda.


Sezvinoonekwa kubva pamufananidzo uri pamusoro kubva pabepa, chiratidzo chekutanga chekubuda chinoshandiswa pakuita basa reNSP uye zviratidzo zviri pakati zvakavharwa zvinoshandiswa pabasa reMLM.


Sezvo isu tiri kudzidzira padanho rechiratidzo, tokeni yega yega yekuisa inoburitsa tokeni yekubuda. Sezvakaita chero basa rekuisa, kuyambuka-entropy kurasikirwa kunoshandiswa kudzidzisa modhi.

Chii Chakashata neBERT?

Nepo BERT inogona kunge yakanaka pakubata zvese zviri zviviri kumberi nekumashure mamiriro, zvingave zvisina kunyatsokodzera kuwana kufanana pakati pezviuru zvemitsara. Ngatitarisei basa rekutsvaga mitsara yakada kufanana muunganidzwa hombe wemitsara zviuru gumi. Nemamwe manzwi, tinoda "kutorazve" mutsara wakafanana nemutsara A kubva pamitsara gumi.


Kuti tiite izvi, tinoda kubatanidza zvese zvinogoneka musanganiswa wemitsara miviri kubva pagumi,000. Izvozvo zvingava n * (n - 1) / 2 = 4,999,500 pairs! Damn, ndiyo quadratic yakaoma. Zvinotora iyo BERT modhi maawa makumi matanhatu neshanu kugadzira embeddings uye kugadzirisa kwekuenzanisa uku.


Zvichingotaura, iyo BERT modhi haisi iyo yakanakisa yekutsvaga yakafanana. Asi kutsvaga nekufanana kutsvaga kuri pamwoyo wepiipi yeRAG. Mhinduro iri ne SBERT.

SBERT - Chirevo Chikamu BERT

Kudzikiswa kweBERT kunobva zvakanyanya kubva muchinjika-encoder dhizaini apo tinodyisa mitsara miviri pamwechete mukutevedzana ne[SEP] chiratidzo pakati. Dai chete mutsara wega wega waifanira kubatwa wakaparadzana, taigona kufanoverengera zvakamisikidzwa tozvishandisa zvakananga kuverengera zvakafanana uye pazvinenge zvichidikanwa. Ichi ndicho chaicho chirevo cheMutongo BERT kana SBERT muchidimbu.


SBERT inosuma iyo Siamese network kune yeBERT yekuvaka. Shoko iri rinoreva mapatya kana hukama hwapedyo.

Zvinorehwa neSiamese zvakatorwa kubva kuduramazwi.com


Saka, muSBERT tine imwecheteyo BERT network yakabatana se "mapatya." Muenzanisiro unopinza mutsara wekutanga unoteverwa newechipiri pane kubata nawo zvakatevedzana.

Ongorora: Yakajairika tsika kudhirowa 2 network padivi-ne-padivi kuona ma siamese network. Asi mukuita, inetiweki imwe chete inotora maviri akasiyana ekuisa.

SBERT Architecture

Pazasi pane dhayagiramu inopa tarisiro yeSBERT yekuvakisa.

Iyo Siamese network dhizaini ine chinangwa chechikamu chekurasikirwa. Izvo zvinobuda U uye V kubva kumapazi maviri zvakabatanidzwa pamwe nekusiyana kwavo

.

Chekutanga, isu tinogona kuona kuti SBERT inosuma yekubatanidza layer munguva pfupi mushure meBERT. Izvi zvinodzikisira chiyero chekubuda kweBERT kuderedza komputa. BERT kazhinji inogadzira zvinobuda pa512 X 768 zviyero. The pooling layer inoderedza izvi kusvika pa 1 X 768. The default pooling is mean though average and max pooling do work.


Tevere, ngatitarisei nzira yekudzidzira iyo SBERT inosiyana kubva kuBERT.

Pre-Training

SBERT inopa nzira nhatu dzekudzidzisa modhi. Ngatitarisei pane imwe neimwe yadzo.


Natural Mutauro Inference (NLI) - Classification Chinangwa

SBERT yakanyatso gadziridzwa paStanford Natural Language Inference (SNLI) uye Multi-Genre NLI datasets zveizvi. SNLI ine 570K mitsara miviri uye MNLI ine 430K. Vaviri ava vane chirevo (P) uye fungidziro (H) inotungamira kune imwe yemazita matatu:


  • Eltailment - premise inoratidza iyo hypothesis
  • Neutral - premise uye hypothesis inogona kuva yechokwadi asi isingaite zvine hukama
  • Kupokana - chirevo uye fungidziro zvinopokana


Tichifunga mitsara miviri P naH, modhi yeSBERT inoburitsa zviviri zvinobuda U neV. Izvi zvinobva zvabatanidzwa se (U, V na |U — V|).


Iyo yakabatanidzwa inobuda inoshandiswa kudzidzisa SBERT neClassification Objective. Izvi zvakabatanidzwa zvinobuda zvinopihwa kune Feed Forward neural network ine 3 kirasi zvinobuda (Eltailment, Neutral, uye Contradiction). Softmax cross-entry inoshandiswa kudzidzisa zvakafanana nemadzidzisiro atinoita kune chero rimwe basa rekuisa.


Sentence Kufanana - Regression chinangwa

Panzvimbo pekubatanidza U neV, isu tinotora zvakananga cosine kufanana pakati pemavheji maviri. Zvakafanana nechero dambudziko rekudzoreredza, tinoshandisa kureva-squared kukanganisa kurasikirwa kudzidzisa kudzoreredza. Munguva yekufungidzira, network imwechete inogona kushandiswa zvakananga kuenzanisa chero mitsara miviri. SBERT inopa chibodzwa chekuti mitsara miviri yakafanana sei.


Triplet Kufanana - Chinangwa cheTriplet

Chinangwa chekufanana katatu chakatanga kuunzwa mukuzivikanwa kwechiso uye zvishoma nezvishoma chakagadziridzwa kune dzimwe nzvimbo dzeAI senge zvinyorwa uye marobhoti.


Pano 3 zvekushandisa zvinopihwa kuSBERT pachinzvimbo che2 - anchor, yakanaka, uye isina kunaka. Iyo dataset inoshandiswa kune iyi inofanirwa kusarudzwa zvinoenderana. Kuti tiugadzire, tinogona kusarudza chero data data, uye tosarudza mitsara miviri yakatevedzana seyakanaka mubatanidzwa. Zvadaro sarudza mutsara usina kurongeka kubva mundima yakasiyana muenzaniso wakashata.


Kurasika katatu kunobva kwaverengerwa nekuenzanisa kuti phositive iri padyo zvakadii neancho maringe kuti iri padyo zvakadii kune inegative.

Nesumo iyoyo yeBERT neSBERT, ngatiitei kukurumidza-maoko kuti tinzwisise kuti tingawana sei kumisikidzwa kwechero mitsara yakapihwa tichishandisa mamodheru aya.

Maoko-pa SBERT

Kunyangwe kubva pakadhindwa, raibhurari yepamutemo yeSBERT inova sentence-transformer yakawana mukurumbira uye yakakura. Yakanaka zvakakwana kuti ishandiswe mumakesi ekushandisa ekugadzira eRAG. Saka ngatishandisei kunze kwebhokisi.


Kuti titange, ngatitange nekuisa mune nyowani nyowani Python nharaunda.

 !pip install sentence-transformers


Pane akati wandei akasiyana eiyo SBERT modhi yatinogona kurodha kubva kuraibhurari. Ngatitorei muenzaniso wemufananidzo.

 from sentence_transformers import SentenceTransformer model = SentenceTransformer('bert-base-nli-mean-tokens')


Isu tinogona kungogadzira rondedzero yemitsara uye kudaidza iyo encode basa remuenzaniso kugadzira embeddings. Zviri nyore!

 sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ] embeddings = model.encode(sentences) print(embeddings.shape)


Uye isu tinowana tinogona kuwana kufanana mamakisi pakati pekumisikidza tichishandisa iri pazasi 1 mutsara:

 similarities = model.similarity(embeddings, embeddings) print(similarities)


Ziva kuti kufanana pakati pemutsara mumwechete ndeye 1 sezvaitarisirwa:

 tensor([[1.0000, 0.6660, 0.1046], [0.6660, 1.0000, 0.1411], [0.1046, 0.1411, 1.0000]])

Mhedziso

Kumisikidza inhanho yakakosha uye yakakosha pakuita kuti pombi yeRAG ishande nepainogona napo. Tariro iyo yaibatsira uye yakavhura maziso ako kuti chii chiri kuitika pasi pehodhi pese patinoshandisa mitsara inoshandura kunze kwebhokisi.


Gara wakagadzirira zvinyorwa zvinouya zveRAG uye mashandisiro ayo emukati akasanganiswa nemaoko-pane zvidzidzo zvakare.