paint-brush
Kuvaka Flexible Framework yeMultimodal Data Input muMakuru Mutauro Modelsby@ritabratamaiti
Nhoroondo itsva

Kuvaka Flexible Framework yeMultimodal Data Input muMakuru Mutauro Models

by ritabratamaiti5m2024/11/19
Read on Terminal Reader

Kurebesa; Kuverenga

AnyModal ndeye yakavhurika-sosi sisitimu yakagadzirirwa kuita kudzidzisa multimodal LLMs kuve nyore nekudzikisa boilerplate uye kurerutsa kubatanidzwa kweakasiyana emhando yedata senge zvinyorwa, mifananidzo, uye odhiyo. Iyo inopa modular zvikamu zve tokenization, chimiro encoding, uye fungidziro, ichibvumira vanogadzira kuti vatarise pakuvaka maapplication pasina kubata nekuoma kwekubatanidzwa kwemultimodal. Madhimoni anosanganisira kudzidzisa maVLM ekutora mifananidzo, LaTeX OCR, uye radiology captioning.
featured image - Kuvaka Flexible Framework yeMultimodal Data Input muMakuru Mutauro Models
ritabratamaiti HackerNoon profile picture
0-item
1-item

Yangu Yakavhurika Source Project: A Flexible Multimodal Mutauro Model Framework yePyTorch


Vimbiso yemultimodal AI iri kwese kwese, kubva pakuongorora hutano hwepamusoro kusvika pakugadzira yakapfuma, ine simba rakawanda zviitiko zvevatengi. Asi kune avo vedu vari mumigero, kuvaka multimodal masisitimu-anokwanisa kugadzirisa zvinyorwa, mifananidzo, odhiyo, uye nekupfuura-kazhinji inonzwa senge isingaperi tangle yekubatanidzwa kwetsika, boilerplate kodhi, uye inoenderana nyaya. Uku kwaiva kushushikana kwangu, uye zvakazoita kuti pave nekusikwa kweAnyModal .


Nei Multimodal AI?

Ngatitarisei: kudyidzana kwevanhu nenyika hakuna kugumira kune imwe mhando yedata. Isu tinodudzira mazwi, zvinoonekwa, manzwi, uye manzwiro emuviri panguva imwe chete. Pfungwa ye multimodal AI inobva pane iyi pfungwa. Nekuunza akawanda marudzi edata mupombi imwechete yekugadzirisa, multimodal AI inogonesa mamodheru kubata mabasa aimbove akaomarara kune imwechete-modality masisitimu. Fungidzira mashandisirwo ehutano anoongorora maX-rays uye manotsi ekurapa pamwe chete, kana masisitimu ebasa revatengi anofunga zvese zvinyorwa uye odhiyo cues kuyera manzwiro evatengi nenzira kwayo.


Asi heino dambudziko: nepo single-modality modhi yemavara (seGPT) kana mifananidzo (seViT) yakanyatso kusimbiswa, kuvasanganisa kuti vadyidzane zvinyoro-nyoro hakuna kutwasuka. Iyo tekinoroji yakaoma yakadzivirira vazhinji vanotsvaga uye vanogadzira kubva mukunyatso ongorora multimodal AI. Enter AnyModal .


Dambudziko neIripo Multimodal Solutions

Mukushanda kwangu pachedu nekudzidza muchina, ndakaona kuti nepo maturusi akaita seGPT, ViT, uye maodhiyo processor ane simba ari ega, kugadzira masisitimu emultimodal nekubatanidza maturusi aya kazhinji zvinoreva kuasona pamwe chete nekodhi, purojekiti-chaiyo kodhi. Iyi nzira haina kukura. Mhinduro dzazvino dzekubatanidza modalities dzinogona kunge dzakanyanya hunyanzvi, dzakagadzirirwa chete mabasa chaiwo (senge mufananidzo wemifananidzo kana mhinduro yemubvunzo wekuona), kana ivo vanoda huwandu hunoshungurudza hweboilerplate kodhi kuti mhando dzedata dzishande pamwechete.


Mafuremu aripo anotarisa zvishoma pane chaiwo musanganiswa wemodalities, zvichiita kuti zviome kuwedzera kuva mhando dzedata idzva kana kugadzirisa iyo imwechete setup kumabasa akasiyana. Ichi "siloed" chimiro cheAI modhi chaireva kuti ndaigara ndichidzoreredza vhiri. Ndipo pandakafunga kuvaka AnyModal -inochinjika, modular framework inounza marudzi ese edata pamwechete pasina kunetsekana.


Chii chinonzi AnyModal?

AnyModal chimiro chakagadzirirwa kurerutsa uye kufambisa multimodal AI budiriro. Yakagadzirirwa kudzikisa kuomarara kwekubatanidza marudzi akasiyana ekuisa nekubata tokenization, encoding, uye kugadzirwa kwezvisiri zvinyorwa, zvichiita kuti zvive nyore kuwedzera mhando dzedata kumhando huru dzemitauro (LLMs).


Iyo pfungwa inotenderera yakatenderedza modular nzira kune yekupinza pombi. NeAnyModal, unogona kuchinjanisa maencoder emhando (seVision Transformer yemifananidzo kana spectrogram processor yeodhiyo) uye wozvibatanidza zvisina musono kuLLM. Iyo dhizaini inobvisa yakawanda yekuoma, zvichireva kuti haufanirwe kupedza mavhiki uchinyora kodhi kuti masisitimu aya aenderane.

Izvo Zvinokosha zveAnyModal: Input Tokenization

Chinhu chakakosha cheAnyModal ndiyo yekuisa tokenizer , iyo inovhara mukaha uripo pakati peisiri-zvinyorwa data uye iyo LLM's text-based input processing. Heino mashandiro ayo:

  • Feature Encoding : Kune yega yega modhi (senge mifananidzo kana odhiyo), yakasarudzika encoder inoshandiswa kubvisa zvakakosha. Semuenzaniso, kana uchishanda nemifananidzo, AnyModal inogona kushandisa Vision Transformer (ViT) iyo inogadzirisa mufananidzo uye inoburitsa akatevedzana emhando mavheji. Aya mavekita anotora zvinhu zvakakosha, senge zvinhu, hukama hwepakati, uye maumbirwo, akakosha kumashandisirwo akaita semifananidzo yemifananidzo kana mhinduro yemubvunzo.
  • Projection Layer : Mushure me encoding, maficha mavheji kazhinji haaenderane neiyo LLM's token space. Kuti ive nechokwadi chekubatanidzwa kwakatsetseka, AnyModal inoshandisa fungidziro layer inoshandura aya mavheji kuti aenderane neiyo LLM yekupinza tokeni. Semuyenzaniso, iwo encoded vectors kubva kuViT anoiswa mepu munzvimbo yekumisikidza yeLLM, achibvumira kuyerera kwakabatana kweakawanda data data mukati meiyo LLM yekuvaka.

Iyi nzira yemhando mbiri inogonesa iyo modhi kubata data yemultimodal senhevedzano imwe chete, ichibvumira kuti ibudise mhinduro dzinoverengera marudzi ese ekuisa. Chaizvoizvo, AnyModal inoshandura akasiyana data sosi kuita fomu yakabatana iyo LLMs inogona kunzwisisa.


Mashandiro Aanoita: Muenzaniso une Mifananidzo Inopinza

Kuti ndikupe pfungwa yekuti AnyModal inoshanda sei, ngatitarisei muenzaniso wekushandisa data yemifananidzo neLLMs.

 from transformers import ViTImageProcessor, ViTForImageClassification from anymodal import MultiModalModel from vision import VisionEncoder, Projector # Step 1: Initialize Vision Components processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224') vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224') vision_encoder = VisionEncoder(vision_model) # Step 2: Define Projection Layer for Compatibility vision_tokenizer = Projector(in_features=vision_model.config.hidden_size, out_features=768) # Step 3: Initialize LLM and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM llm_tokenizer = AutoTokenizer.from_pretrained("gpt2") llm_model = AutoModelForCausalLM.from_pretrained("gpt2") # Step 4: Build the AnyModal Multimodal Model multimodal_model = MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision_tokenizer, language_tokenizer=llm_tokenizer, language_model=llm_model, input_start_token='<|imstart|>', input_end_token='<|imend|>', prompt_text="Describe this image: " )

Iyi modular setup inogonesa vanogadzira plug uye kutamba nemaencoder akasiyana uye maLLM, kugadzirisa modhi kune akasiyana multimodal mabasa, kubva pamifananidzo yemifananidzo kuenda kumhinduro yemubvunzo.


Ikozvino Zvishandiso zveAnyModal

AnyModal yakatoiswa kune akati wandei makesi ekushandisa, aine mhedzisiro inonakidza:

  • LaTeX OCR : Shandura yakaoma masvomhu equation kuita zvinyorwa zvinoverengwa.
  • Chest X-Ray Captioning : Kugadzira tsananguro yezvekurapa yerutsigiro rwekuongorora mune hutano.
  • Image Captioning : Kugadzira otomatiki macaptions ezvinyorwa zvinoonekwa, izvo zvinobatsira kusvikika uye midhiya maapplication.

Nekubvisa kuomarara kwekubata mhando dzakasiyana dzedata, AnyModal inopa simba vanogadzira kukurumidza kuvaka prototypes kana kunatsa masisitimu epamberi pasina mabhodhoro anowanzo kuuya nekubatanidzwa kwemultimodal.


Sei Uchishandisa AnyModal?

Kana iwe uri kuyedza kuvaka multimodal system, iwe unogona kunge wakasangana nematambudziko aya:

  • Yakanyanya kuomarara mukuenzanisa marudzi akasiyana e data neLLMs.
  • Redundant uye inonetesa boilerplate kodhi yeimwe neimwe modhi.
  • Yakaganhurirwa scalability kana uchiwedzera mhando dze data.

CheroModal inogadzirisa idzi marwadzo mapoinzi nekudzikisa boilerplate, ichipa inoshanduka mamodule, uye kubvumira kukurumidza kugadzirisa. Panzvimbo pekurwa nenyaya dzekuenderana, vagadziri vanogona kutarisa pakuvaka smart masisitimu nekukurumidza uye zvakanyanya.


Chii Chinotevera kune AnyModal?

Rwendo rweAnyModal rwuri kutanga. Ini ndiri kushanda pakuwedzera rutsigiro rwemamwe modalities senge audio captioning uye kuwedzera sisitimu kuti iite kuti iwedzere kuchinjika kune niche makesi ekushandisa. Mhinduro yenharaunda uye mipiro yakakosha mukusimudzira kwayo-kana iwe uchifarira multimodal AI, ndinoda kunzwa mazano ako kana kushandira pamwe.


Kwaunowana CheroModal




Kana iwe uchinakidzwa nezve multimodal AI kana kutarisa kugadzirisa maitiro ako ekuvandudza, ipa AnyModal yekuedza. Ngatishandei pamwechete kuti tivhure muganho unotevera weAI innovation.