paint-brush
I-AI Framework ikukhave ku-Image-to-Text Workflowsnge@ritabratamaiti
Umlando omusha

I-AI Framework ikukhave ku-Image-to-Text Workflows

nge ritabratamaiti5m2024/12/31
Read on Terminal Reader

Kude kakhulu; Uzofunda

Yini: Guqula izithombe zezibalo zibe yi-LaTeX usebenzisa i-AnyModal's modular vision-language pipeline. Kanjani: Sebenzisa izisindo eziqeqeshwe kusengaphambili ukuze uqonde ngokushesha noma qeqesha imodeli yangokwezifiso ngedathasethi yakho. Kuphi: Thola izibonelo ezigcwele, ikhodi, nezisindo zemodeli ku-GitHub kanye Nobuso BeHugging. Kungani: Hlanganisa kalula izingxenye eziningi ze-AI (umbono + umbhalo) ngaphandle kokubhala ikhodi yebhuloho ebanzi.
featured image - I-AI Framework ikukhave ku-Image-to-Text Workflows
ritabratamaiti HackerNoon profile picture
0-item
1-item

Mayelana ne-AnyModal

I-AnyModal iwuhlaka oluklanyelwe ukuhlanganisa “izindlela” eziningi (njengezithombe, umbhalo, noma enye idatha) ibe ukuhamba komsebenzi okukodwa, okuhambisanayo. Esikhundleni sokuhlanganisa imitapo yolwazi ehlukene noma ukubhala ikhodi yangokwezifiso ukuze kuhlanganiswe umbono namamodeli olimi, i-AnyModal inikeza ipayipi elihlelekile lapho ingxenye ngayinye—izifaki khodi zesithombe, amathokheni, amamodeli olimi—ingaxhunywa ngaphandle kokwenza ngokwezifiso okusindayo. Ngokuphatha ukuxhumana okukhona phakathi kwalezi zingcezu, i-AnyModal ikuvumela ukuthi ugxile enqubweni yezinga eliphezulu: ukuphakela isithombe, ngokwesibonelo, nokuthola umphumela wombhalo.


Empeleni, i-AnyModal ingasiza ngemisebenzi efana namagama-ncazo wesithombe, ukuhlukanisa, noma esimweni esiboniswe lapha, i-LaTeX OCR. Ngenxa yokuthi uhlaka luyi-modular, kulula ukushintshanisa imodeli eyodwa nenye (isb, umgogodla wombono ohlukile noma imodeli entsha yolimi), okulenza livumelane nezimo zokuhlola noma zokusetshenziswa okukhethekile.


I-LaTeX OCR Use Case

Ukuguqula isithombe sesisho sezibalo sibe iyunithi yezinhlamvu ye-LaTeX evumelekile kudinga ukuhlanganisa ukubona kwekhompyutha nokucutshungulwa kolimi lwemvelo. Umsebenzi wesifaki khodi sesithombe ukukhipha izici noma amaphethini wesithombe esibalweni, njengokubona okuthi “plus,” “minus,” nezinye izimpawu. Ingxenye yolimi ibe isisebenzisa lezi zici ukubikezela amathokheni afanelekile e-LaTeX ngokulandelana.


I-LaTeX OCR ene-AnyModal empeleni iyinkomba yokuthi ungashesha kangakanani ukubhanqa isishumeki sombono nemodeli yolimi. Nakuba lesi sibonelo sikhuluma ngokuqondile ngezibalo, indlela iyonke inganwetshwa kwezinye izimo zesithombe-kuya-kumbhalo, okufaka phakathi ukubhalwa kwezibalo okuthuthuke kakhulu noma okukhethekile.


Ekupheleni kwalesi sifundo, uzokwazi ukusebenzisa i-AnyModal, kanye ne-Llama 3.2 1B kanye ne-SigLIP ye-Google ukuze udale i-VLM encane yemisebenzi ye-LaTeX OCR:

Amazwibela Angempela: f ( u ) = u + \ isamba _ {n = o d d } \alpha _ { n } \kwesokunxele[ \frac { ( u - \pi ) } {\pi } \kwesokudla] ^ { n } ,Kwenziwe Amagama-ncazo asebenzisa i-AnyModal/LaTeX-OCR-Llama-3.2-1B: f ( u ) = u + \sum _ {n = o o d } \alpha _ { n } \kwesokunxele[ \frac { ( u - \pi ) ^ { n } } {\pi } \kwesokudla],


Qaphela ukuthi izisindo ezikhishwe kwa-AnyModal/LaTeX-OCR-Llama-3.2-1B zitholwa ngokuqeqeshwa ngo-20% kuphela we isethi yedatha ye-unsloth/LaTeX_OCR .


Cishe uzothola imodeli engcono ngokuqeqeshwa kuyo yonke idathasethi, nangaphezulu kwenani elikhulu lama-epoch.


Isibonelo Esisheshayo Sokukhomba

Kulabo abanentshisekelo enkulu yokukhiqiza i-LaTeX ngezithombe ezikhona, nakhu ukuboniswa kusetshenziswa izisindo eziqeqeshwe kusengaphambili. Lokhu kugwema isidingo sokuqeqesha noma yini kusukela ekuqaleni, ukunikeza indlela esheshayo yokubona i-AnyModal isebenza. Ngezansi ukubuka konke okufingqiwe kokusetha indawo yakho, ukulanda amamodeli adingekayo, nokusebenzisa okucatshangwayo.


Hlanganisa i-AnyModal Repository:

 git clone https://github.com/ritabratamaiti/AnyModal.git


Faka imitapo yolwazi edingekayo:

 pip install torch torchvision huggingface_hub PIL


Bese, landa izisindo eziqeqeshwe kusengaphambili ezisingathwe ku-Hugging Face Hub:

 from huggingface_hub import snapshot_download snapshot_download("AnyModal/Image-Captioning-Llama-3.2-1B", local_dir="latex_ocr_model")


Lezi zisindo ezithile zingatholakala lapha: I-LaTeX-OCR-Llama-3.2-1B ku-Hugging Face


Okulandelayo, layisha isifaki khodi sombono nemodeli yolimi:

 import llm import anymodal import vision from PIL import Image # Load language model and tokenizer tokenizer, model = llm.get_llm("meta-llama/Llama-3.2-1B") # Load vision-related components image_processor, vision_model, vision_hidden_size = vision.get_image_encoder('google/vit-base-patch16-224') vision_encoder = vision.VisionEncoder(vision_model) # Configure the multimodal pipeline multimodal_model = anymodal.MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1), language_tokenizer=tokenizer, language_model=model, prompt_text="The LaTeX expression of the equation in the image is:" ) # Load the pretrained model weights multimodal_model._load_model("latex_ocr_model") multimodal_model.eval()


Ekugcineni, nikeza isithombe bese uthola okukhiphayo kwe-LaTeX:

 # Replace with the path to your equation image image_path = "path_to_equation_image.png" image = Image.open(image_path).convert("RGB") processed_image = image_processor(image, return_tensors="pt") processed_image = {k: v.squeeze(0) for k, v in processed_image.items()} latex_output = multimodal_model.generate(processed_image, max_new_tokens=120) print("Generated LaTeX:", latex_output)


Lokhu kulandelana okulula kwezinyathelo kusebenzisa ipayipi lonke—ukuhlaziya isithombe, ukusiveza esikhaleni semodeli yolimi, futhi kukhiqize i-LaTeX ehambisanayo.


Okokufundisa Ukuqeqesha

Kulabo abafuna ukulawula okwengeziwe, okufana nokujwayela imodeli kudatha entsha noma ukuhlola umakhenikha wepayipi lolimi lombono, inqubo yokuqeqesha inikeza ukuqonda okujulile. Izigaba ezingezansi zibonisa ukuthi idatha ilungiswa kanjani, ukuthi izingxenye zemodeli zihlanganiswa kanjani, nokuthi zithuthukiswa kanjani ngokuhlanganyela.


Kunokuba uthembele ezingxenyeni eziqeqeshwe kusengaphambili zodwa, ungathola idathasethi yokuqeqeshwa yezithombe ezibhanqiwe namalebula e-LaTeX. Isibonelo esisodwa isethi yedatha unsloth/LaTeX_OCR , equkethe izithombe zezibalo kanye neyunithi yezinhlamvu ze-LaTeX. Ngemva kokufaka okuncikile kanye nokusetha idathasethi yakho, izinyathelo zokuqeqesha zihlanganisa ukudala ipayipi ledatha, ukuqalisa imodeli, nokungena kuma-epochs.


Nalu uhlaka lokulungiselela idathasethi nokuyilayisha:

 from torch.utils.data import Subset import vision # Load training and validation sets train_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='train') val_dataset = vision.ImageDataset("unsloth/LaTeX_OCR", image_processor, split='test') # Optionally use a smaller subset for faster iteration subset_ratio = 0.2 train_dataset = Subset(train_dataset, range(int(subset_ratio * len(train_dataset)))) val_dataset = Subset(val_dataset, range(int(subset_ratio * len(val_dataset))))


Kuleli qophelo, ungakha noma usebenzise kabusha ipayipi le-AnyModal elichazwe ekuqaleni. Esikhundleni sokulayisha izisindo eziqeqeshwe kusengaphambili, ungaqalisa imodeli ukuze ifunde kusukela ekuqaleni noma ezindaweni zokuhlola eziqeqeshwe ngokwengxenye.

 multimodal_model = anymodal.MultiModalModel( input_processor=None, input_encoder=vision_encoder, input_tokenizer=vision.Projector(vision_hidden_size, llm.get_hidden_size(tokenizer, model), num_hidden=1), language_tokenizer=tokenizer, language_model=model, prompt_text="The LaTeX expression of the equation in the image is:" )


Ungabese udala iluphu yokuqeqesha ukuze ulungiselele amapharamitha emodeli. Indlela evamile isebenzisa i-PyTorch's AdamW optimizer futhi ngokuzikhethela isebenzisa ukuqeqeshwa okunembe okuxubile ukuze kusebenze kahle:

 from tqdm import tqdm import torch optimizer = torch.optim.AdamW(multimodal_model.parameters(), lr=1e-4) scaler = torch.cuda.amp.GradScaler() train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True) num_epochs = 5 for epoch in range(num_epochs): for batch_idx, batch in tqdm(enumerate(train_loader), desc=f"Epoch {epoch+1} Training"): optimizer.zero_grad() with torch.cuda.amp.autocast(): logits, loss = multimodal_model(batch) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()


Ngemva kwenkathi ngayinye noma okungenani lapho ukuqeqeshwa kuphela, ukuhlola imodeli kusethi yokuqinisekisa kusiza ukuqinisekisa ukuthi ihlanganisa idatha entsha:

 val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False) for batch_idx, batch in enumerate(val_loader): predictions = multimodal_model.generate(batch['input'], max_new_tokens=120) for idx, prediction in enumerate(predictions): print(f"Actual LaTeX: {batch['text'][idx]}") print(f"Generated LaTeX: {prediction}")


Ngokungeziwe ekuqinisekiseni ukusebenza, lesi sinyathelo sokuqinisekisa singaqondisa ukuthuthukiswa okufana nokulungisa ama-hyperparameter, ukushintshela kumodeli yesisekelo ehlukile, noma ukulungisa ukucubungula kusengaphambili kwedatha yakho. Ngokulandela lezi zinyathelo zokuqeqesha, uthola ukuqonda okungcono kokusebenzisana phakathi kwesifaki khodi sombono nemodeli yolimi, futhi ungakwazi ukunweba ukuhamba komsebenzi kumisebenzi eyengeziwe noma izizinda ezikhethekile.


Izinsiza Ezengeziwe