I-AnyModal iwuhlaka oluklanyelwe ukuhlanganisa “izindlela” eziningi (njengezithombe, umbhalo, noma enye idatha) ibe ukuhamba komsebenzi okukodwa, okuhambisanayo. Esikhundleni sokuhlanganisa imitapo yolwazi ehlukene noma ukubhala ikhodi yangokwezifiso ukuze kuhlanganiswe umbono namamodeli olimi, i-AnyModal inikeza ipayipi elihlelekile lapho ingxenye ngayinye—izifaki khodi zesithombe, amathokheni, amamodeli olimi—ingaxhunywa ngaphandle kokwenza ngokwezifiso okusindayo. Ngokuphatha ukuxhumana okukhona phakathi kwalezi zingcezu, i-AnyModal ikuvumela ukuthi ugxile enqubweni yezinga eliphezulu: ukuphakela isithombe, ngokwesibonelo, nokuthola umphumela wombhalo.
Empeleni, i-AnyModal ingasiza ngemisebenzi efana namagama-ncazo wesithombe, ukuhlukanisa, noma esimweni esiboniswe lapha, i-LaTeX OCR. Ngenxa yokuthi uhlaka luyi-modular, kulula ukushintshanisa imodeli eyodwa nenye (isb, umgogodla wombono ohlukile noma imodeli entsha yolimi), okulenza livumelane nezimo zokuhlola noma zokusetshenziswa okukhethekile.
Ukuguqula isithombe sesisho sezibalo sibe iyunithi yezinhlamvu ye-LaTeX evumelekile kudinga ukuhlanganisa ukubona kwekhompyutha nokucutshungulwa kolimi lwemvelo. Umsebenzi wesifaki khodi sesithombe ukukhipha izici noma amaphethini wesithombe esibalweni, njengokubona okuthi “plus,” “minus,” nezinye izimpawu. Ingxenye yolimi ibe isisebenzisa lezi zici ukubikezela amathokheni afanelekile e-LaTeX ngokulandelana.
I-LaTeX OCR ene-AnyModal empeleni iyinkomba yokuthi ungashesha kangakanani ukubhanqa isishumeki sombono nemodeli yolimi. Nakuba lesi sibonelo sikhuluma ngokuqondile ngezibalo, indlela iyonke inganwetshwa kwezinye izimo zesithombe-kuya-kumbhalo, okufaka phakathi ukubhalwa kwezibalo okuthuthuke kakhulu noma okukhethekile.
Ekupheleni kwalesi sifundo, uzokwazi ukusebenzisa i-AnyModal, kanye ne-Llama 3.2 1B kanye ne-SigLIP ye-Google ukuze udale i-VLM encane yemisebenzi ye-LaTeX OCR:
Qaphela ukuthi izisindo ezikhishwe kwa-AnyModal/LaTeX-OCR-Llama-3.2-1B zitholwa ngokuqeqeshwa ngo-20% kuphela we
Cishe uzothola imodeli engcono ngokuqeqeshwa kuyo yonke idathasethi, nangaphezulu kwenani elikhulu lama-epoch.
Kulabo abanentshisekelo enkulu yokukhiqiza i-LaTeX ngezithombe ezikhona, nakhu ukuboniswa kusetshenziswa izisindo eziqeqeshwe kusengaphambili. Lokhu kugwema isidingo sokuqeqesha noma yini kusukela ekuqaleni, ukunikeza indlela esheshayo yokubona i-AnyModal isebenza. Ngezansi ukubuka konke okufingqiwe kokusetha indawo yakho, ukulanda amamodeli adingekayo, nokusebenzisa okucatshangwayo.
Hlanganisa i-AnyModal Repository:
git clone https://github.com/ritabratamaiti/AnyModal.git
Faka imitapo yolwazi edingekayo:
pip install torch torchvision huggingface_hub PIL
Bese, landa izisindo eziqeqeshwe kusengaphambili ezisingathwe ku-Hugging Face Hub:
from huggingface_hub import snapshot_download snapshot_download("AnyModal/Image-Captioning-Llama-3.2-1B", local_dir="latex_ocr_model")
Lezi zisindo ezithile zingatholakala lapha: I-LaTeX-OCR-Llama-3.2-1B ku-Hugging Face
Okulandelayo, layisha isifaki khodi sombono nemodeli yolimi:
import llm import anymodal import vision from PIL import Image # Load language model and tokenizer tokenizer, model = llm.get_llm("meta-llama/Llama-3.2-1B") # Load vision-related components image_processor, vision_model, vision_hidden_size = vision.get_image_encoder('google/vit-base-patch16-224') vision_encoder = vision.VisionEncoder(vision_model) # Configure the multimodal pipeline multimodal_model = anymodal.MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1), language_tokenizer=tokenizer, language_model=model, prompt_text="The LaTeX expression of the equation in the image is:" ) # Load the pretrained model weights multimodal_model._load_model("latex_ocr_model") multimodal_model.eval()
Ekugcineni, nikeza isithombe bese uthola okukhiphayo kwe-LaTeX:
# Replace with the path to your equation image image_path = "path_to_equation_image.png" image = Image.open(image_path).convert("RGB") processed_image = image_processor(image, return_tensors="pt") processed_image = {k: v.squeeze(0) for k, v in processed_image.items()} latex_output = multimodal_model.generate(processed_image, max_new_tokens=120) print("Generated LaTeX:", latex_output)
Lokhu kulandelana okulula kwezinyathelo kusebenzisa ipayipi lonke—ukuhlaziya isithombe, ukusiveza esikhaleni semodeli yolimi, futhi kukhiqize i-LaTeX ehambisanayo.
Kulabo abafuna ukulawula okwengeziwe, okufana nokujwayela imodeli kudatha entsha noma ukuhlola umakhenikha wepayipi lolimi lombono, inqubo yokuqeqesha inikeza ukuqonda okujulile. Izigaba ezingezansi zibonisa ukuthi idatha ilungiswa kanjani, ukuthi izingxenye zemodeli zihlanganiswa kanjani, nokuthi zithuthukiswa kanjani ngokuhlanganyela.
Kunokuba uthembele ezingxenyeni eziqeqeshwe kusengaphambili zodwa, ungathola idathasethi yokuqeqeshwa yezithombe ezibhanqiwe namalebula e-LaTeX. Isibonelo esisodwa isethi yedatha unsloth/LaTeX_OCR
, equkethe izithombe zezibalo kanye neyunithi yezinhlamvu ze-LaTeX. Ngemva kokufaka okuncikile kanye nokusetha idathasethi yakho, izinyathelo zokuqeqesha zihlanganisa ukudala ipayipi ledatha, ukuqalisa imodeli, nokungena kuma-epochs.
Nalu uhlaka lokulungiselela idathasethi nokuyilayisha:
from torch.utils.data import Subset import vision # Load training and validation sets train_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='train') val_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='test') # Optionally use a smaller subset for faster iteration subset_ratio = 0.2 train_dataset = Subset(train_dataset, range(int(subset_ratio * len(train_dataset)))) val_dataset = Subset(val_dataset, range(int(subset_ratio * len(val_dataset))))
Kuleli qophelo, ungakha noma usebenzise kabusha ipayipi le-AnyModal elichazwe ekuqaleni. Esikhundleni sokulayisha izisindo eziqeqeshwe kusengaphambili, ungaqalisa imodeli ukuze ifunde kusukela ekuqaleni noma ezindaweni zokuhlola eziqeqeshwe ngokwengxenye.
multimodal_model = anymodal.MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1), language_tokenizer=tokenizer, language_model=model, prompt_text="The LaTeX expression of the equation in the image is:" )
Ungabese udala iluphu yokuqeqesha ukuze ulungiselele amapharamitha emodeli. Indlela evamile isebenzisa i-PyTorch's AdamW
optimizer futhi ngokuzikhethela isebenzisa ukuqeqeshwa okunembe okuxubile ukuze kusebenze kahle:
from tqdm import tqdm import torch optimizer = torch.optim.AdamW(multimodal_model.parameters(), lr=1e-4) scaler = torch.cuda.amp.GradScaler() train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True) num_epochs = 5 for epoch in range(num_epochs): for batch_idx, batch in tqdm(enumerate(train_loader), desc=f"Epoch {epoch+1} Training"): optimizer.zero_grad() with torch.cuda.amp.autocast(): logits, loss = multimodal_model(batch) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Ngemva kwenkathi ngayinye noma okungenani lapho ukuqeqeshwa kuphela, ukuhlola imodeli kusethi yokuqinisekisa kusiza ukuqinisekisa ukuthi ihlanganisa idatha entsha:
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False) for batch_idx, batch in enumerate(val_loader): predictions = multimodal_model.generate(batch['input'], max_new_tokens=120) for idx, prediction in enumerate(predictions): print(f"Actual LaTeX: {batch['text'][idx]}") print(f"Generated LaTeX: {prediction}")
Ngokungeziwe ekuqinisekiseni ukusebenza, lesi sinyathelo sokuqinisekisa singaqondisa ukuthuthukiswa okufana nokulungisa ama-hyperparameter, ukushintshela kumodeli yesisekelo ehlukile, noma ukulungisa ukucubungula kusengaphambili kwedatha yakho. Ngokulandela lezi zinyathelo zokuqeqesha, uthola ukuqonda okungcono kokusebenzisana phakathi kwesifaki khodi sombono nemodeli yolimi, futhi ungakwazi ukunweba ukuhamba komsebenzi kumisebenzi eyengeziwe noma izizinda ezikhethekile.