Vimbiso yemultimodal AI iri kwese kwese, kubva pakuongorora hutano hwepamusoro kusvika pakugadzira yakapfuma, ine simba rakawanda zviitiko zvevatengi. Asi kune avo vedu vari mumigero, kuvaka multimodal masisitimu-anokwanisa kugadzirisa zvinyorwa, mifananidzo, odhiyo, uye nekupfuura-kazhinji inonzwa senge isingaperi tangle yekubatanidzwa kwetsika, boilerplate kodhi, uye inoenderana nyaya. Uku kwaiva kushushikana kwangu, uye zvakazoita kuti pave nekusikwa kweAnyModal .
Ngatitarisei: kudyidzana kwevanhu nenyika hakuna kugumira kune imwe mhando yedata. Isu tinodudzira mazwi, zvinoonekwa, manzwi, uye manzwiro emuviri panguva imwe chete. Pfungwa ye multimodal AI inobva pane iyi pfungwa. Nekuunza akawanda marudzi edata mupombi imwechete yekugadzirisa, multimodal AI inogonesa mamodheru kubata mabasa aimbove akaomarara kune imwechete-modality masisitimu. Fungidzira mashandisirwo ehutano anoongorora maX-rays uye manotsi ekurapa pamwe chete, kana masisitimu ebasa revatengi anofunga zvese zvinyorwa uye odhiyo cues kuyera manzwiro evatengi nenzira kwayo.
Asi heino dambudziko: nepo single-modality modhi yemavara (seGPT) kana mifananidzo (seViT) yakanyatso kusimbiswa, kuvasanganisa kuti vadyidzane zvinyoro-nyoro hakuna kutwasuka. Iyo tekinoroji yakaoma yakadzivirira vazhinji vanotsvaga uye vanogadzira kubva mukunyatso ongorora multimodal AI. Enter AnyModal .
Mukushanda kwangu pachedu nekudzidza muchina, ndakaona kuti nepo maturusi akaita seGPT, ViT, uye maodhiyo processor ane simba ari ega, kugadzira masisitimu emultimodal nekubatanidza maturusi aya kazhinji zvinoreva kuasona pamwe chete nekodhi, purojekiti-chaiyo kodhi. Iyi nzira haina kukura. Mhinduro dzazvino dzekubatanidza modalities dzinogona kunge dzakanyanya hunyanzvi, dzakagadzirirwa chete mabasa chaiwo (senge mufananidzo wemifananidzo kana mhinduro yemubvunzo wekuona), kana ivo vanoda huwandu hunoshungurudza hweboilerplate kodhi kuti mhando dzedata dzishande pamwechete.
Mafuremu aripo anotarisa zvishoma pane chaiwo musanganiswa wemodalities, zvichiita kuti zviome kuwedzera kuva mhando dzedata idzva kana kugadzirisa iyo imwechete setup kumabasa akasiyana. Ichi "siloed" chimiro cheAI modhi chaireva kuti ndaigara ndichidzoreredza vhiri. Ndipo pandakafunga kuvaka AnyModal -inochinjika, modular framework inounza marudzi ese edata pamwechete pasina kunetsekana.
AnyModal chimiro chakagadzirirwa kurerutsa uye kufambisa multimodal AI budiriro. Yakagadzirirwa kudzikisa kuomarara kwekubatanidza marudzi akasiyana ekuisa nekubata tokenization, encoding, uye kugadzirwa kwezvisiri zvinyorwa, zvichiita kuti zvive nyore kuwedzera mhando dzedata kumhando huru dzemitauro (LLMs).
Iyo pfungwa inotenderera yakatenderedza modular nzira kune yekupinza pombi. NeAnyModal, unogona kuchinjanisa maencoder emhando (seVision Transformer yemifananidzo kana spectrogram processor yeodhiyo) uye wozvibatanidza zvisina musono kuLLM. Iyo dhizaini inobvisa yakawanda yekuoma, zvichireva kuti haufanirwe kupedza mavhiki uchinyora kodhi kuti masisitimu aya aenderane.
Chinhu chakakosha cheAnyModal ndiyo yekuisa tokenizer , iyo inovhara mukaha uripo pakati peisiri-zvinyorwa data uye iyo LLM's text-based input processing. Heino mashandiro ayo:
Iyi nzira yemhando mbiri inogonesa iyo modhi kubata data yemultimodal senhevedzano imwe chete, ichibvumira kuti ibudise mhinduro dzinoverengera marudzi ese ekuisa. Chaizvoizvo, AnyModal inoshandura akasiyana data sosi kuita fomu yakabatana iyo LLMs inogona kunzwisisa.
Kuti ndikupe pfungwa yekuti AnyModal inoshanda sei, ngatitarisei muenzaniso wekushandisa data yemifananidzo neLLMs.
from transformers import ViTImageProcessor, ViTForImageClassification from anymodal import MultiModalModel from vision import VisionEncoder, Projector # Step 1: Initialize Vision Components processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224') vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224') vision_encoder = VisionEncoder(vision_model) # Step 2: Define Projection Layer for Compatibility vision_tokenizer = Projector(in_features=vision_model.config.hidden_size, out_features=768) # Step 3: Initialize LLM and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM llm_tokenizer = AutoTokenizer.from_pretrained("gpt2") llm_model = AutoModelForCausalLM.from_pretrained("gpt2") # Step 4: Build the AnyModal Multimodal Model multimodal_model = MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision_tokenizer, language_tokenizer=llm_tokenizer, language_model=llm_model, input_start_token='<|imstart|>', input_end_token='<|imend|>', prompt_text="Describe this image: " )
Iyi modular setup inogonesa vanogadzira plug uye kutamba nemaencoder akasiyana uye maLLM, kugadzirisa modhi kune akasiyana multimodal mabasa, kubva pamifananidzo yemifananidzo kuenda kumhinduro yemubvunzo.
AnyModal yakatoiswa kune akati wandei makesi ekushandisa, aine mhedzisiro inonakidza:
Nekubvisa kuomarara kwekubata mhando dzakasiyana dzedata, AnyModal inopa simba vanogadzira kukurumidza kuvaka prototypes kana kunatsa masisitimu epamberi pasina mabhodhoro anowanzo kuuya nekubatanidzwa kwemultimodal.
Kana iwe uri kuyedza kuvaka multimodal system, iwe unogona kunge wakasangana nematambudziko aya:
CheroModal inogadzirisa idzi marwadzo mapoinzi nekudzikisa boilerplate, ichipa inoshanduka mamodule, uye kubvumira kukurumidza kugadzirisa. Panzvimbo pekurwa nenyaya dzekuenderana, vagadziri vanogona kutarisa pakuvaka smart masisitimu nekukurumidza uye zvakanyanya.
Rwendo rweAnyModal rwuri kutanga. Ini ndiri kushanda pakuwedzera rutsigiro rwemamwe modalities senge audio captioning uye kuwedzera sisitimu kuti iite kuti iwedzere kuchinjika kune niche makesi ekushandisa. Mhinduro yenharaunda uye mipiro yakakosha mukusimudzira kwayo-kana iwe uchifarira multimodal AI, ndinoda kunzwa mazano ako kana kushandira pamwe.
Kana iwe uchinakidzwa nezve multimodal AI kana kutarisa kugadzirisa maitiro ako ekuvandudza, ipa AnyModal yekuedza. Ngatishandei pamwechete kuti tivhure muganho unotevera weAI innovation.