Ukuhlanganiswa kwama-LLM namandla ezwi kudale amathuba amasha ekusebenzelaneni komuntu siqu kwamakhasimende.
Lo mhlahlandlela uzohamba nawe ekusetheni iseva yendawo ye-LLM esekela ukusebenzisana kwezwi okubili kusetshenziswa iPython, Transformers, Qwen2-Audio-7B-Instruct, kanye neBark.
Ngaphambi kokuthi siqale, uzofaka okulandelayo:
I-FFmpeg ingafakwa nge apt install ffmpeg
ku-Linux noma brew install ffmpeg
ku-MacOS.
Ungafaka ukuncika kwePython usebenzisa ipayipi: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy
Okokuqala, ake simise indawo yethu yePython bese sikhetha idivayisi yethu ye-PyTorch:
import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'
Le khodi ihlola ukuthi ingabe i-GPU ehambisana ne-CUDA (Nvidia) iyatholakala futhi isetha idivayisi ngendlela efanele.
Uma ingekho i-GPU enjalo etholakalayo, i-PyTorch izosebenza ku-CPU ehamba kancane kakhulu.
Kumadivayisi amasha we-Apple Silicon, idivayisi ingasethwa futhi ibe yi
mps
ukuze isebenzise i-PyTorch ku-Metal, kodwa ukuqaliswa kwe-PyTorch Metal akuphelele.
Ama-LLM amaningi omthombo ovulekile asekela kuphela okokufaka kombhalo nokuphumayo kombhalo. Kodwa-ke, njengoba sifuna ukudala isistimu yokuphuma ngezwi, lokhu kuzodinga ukuthi sisebenzise amamodeli amabili ngaphezulu ukuze (1) siguqule inkulumo ibe umbhalo ngaphambi kokuthi ifakwe ku-LLM yethu kanye (2) nokuguqula okukhiphayo kwe-LLM kubuye. enkulumweni.
Ngokusebenzisa i-LLM enezimo eziningi njenge-Qwen Audio, singakwazi ukubalekela imodeli eyodwa ukuze sicubungule okokufaka kwenkulumo kube impendulo yombhalo, bese kufanele sisebenzise imodeli yesibili kuphela ukuguqula okukhiphayo kwe-LLM kubuyisele enkulumweni.
Le ndlela yokwenza izinto eziningi ayisebenzi nje kuphela ngokusebenza kahle ngokwesikhathi sokucubungula kanye (V) nokusetshenziswa kwe-RAM, kodwa futhi ngokuvamile iveza imiphumela engcono njengoba umsindo wokufakwayo uthunyelwa ngokuqondile ku-LLM ngaphandle kokungqubuzana.
Uma usebenzisa umsingathi we-GPU wefu njenge -Runpod noma i-Vast , uzofuna ukusetha inkomba ye-HuggingFace home & Bark kusitoreji sakho sevolumu ngokusebenzisa
export HF_HOME=/workspace/hf
&export XDG_CACHE_HOME=/workspace/bark
ngaphambi kokulanda amamodeli.
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)
Sikhethe ukusebenzisa okuhlukile okuncane kwe-7B kochungechunge lwemodeli ye-Qwen Audio lapha ukuze sinciphise izidingo zethu zokubala. Kodwa-ke, kungenzeka ukuthi u-Qwen usekhiphe amamodeli alalelwayo anamandla namakhudlwana ngesikhathi ufunda lesi sihloko. Ungabuka wonke amamodeli we-Qwen ku-HuggingFace ukuze uhlole kabili ukuthi usebenzisa imodeli yawo yakamuva.
Ukuze uthole indawo yokukhiqiza, ungase ufune ukusebenzisa injini ye-inference esheshayo efana ne -vLLM ukuze uthole ukuphuma okuphakeme kakhulu.
I-Bark iyimodeli ye-AI yesimanjemanje yomthombo ovulekile wombhalo-kuya-inkulumo esekela izilimi eziningi kanye nemisindo.
from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()
Ngaphandle kwe-Bark, ungasebenzisa futhi amanye amamodeli omthombo ovulekile noma ophathelene nombhalo-kuya-inkulumo. Khumbula ukuthi nakuba abanikazi bempahla bengase basebenze kakhulu, beza ngezindleko eziphakeme kakhulu. Inkundla ye-TTS igcina ukuqhathanisa kwakamuva .
Ngokulayishwa kokubili kwe-Qwen Audio 7B ne-Bark kunkumbulo, ukusetshenziswa kwe-RAM (V) okulinganiselwe kungu-24GB, ngakho qiniseka ukuthi izingxenyekazi zekhompuyutha zakho ziyakusekela lokhu. Uma kungenjalo, ungasebenzisa inguqulo ye-quantized yemodeli ye-Qwen ukuze ulondoloze kumemori.
Sizodala iseva ye-FastAPI enemizila emibili yokusingatha umsindo ongenayo noma okokufaka kombhalo futhi sibuyisele izimpendulo zomsindo.
from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)
Le seva yamukela amafayela alalelwayo ngezicelo ze-POST endaweni /voice
& /text
.
Sizosebenzisa i-ffmpeg ukucubungula umsindo ongenayo futhi siwulungiselele imodeli ye-Qwen.
from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array
Ngomsindo ocutshunguliwe, singakwazi ukukhiqiza impendulo yombhalo sisebenzisa imodeli ye-Qwen. Lokhu kuzodinga ukuphatha kokubili okokufaka kombhalo nokomsindo.
Iprosesa izoguqula okokufaka kwethu kube isifanekiso sengxoxo semodeli (i-ChatML esimweni sikaQwen).
def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response
Zizwe ukhululekile ukudlala ngamapharamitha esizukulwane njengezinga lokushisa kuhlelo lokusebenza
model.generate
.
Ekugcineni, sizoguqula impendulo yombhalo okhiqiziwe ibuyele enkulumweni.
from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer
Buyekeza izindawo zokugcina ukuze ucubungule umsindo noma okokufaka kombhalo, ukhiqize impendulo, futhi ubuyisele inkulumo ehlanganisiwe njengefayela le-WAV.
@app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")
Ungakhetha futhi ukwengeza umlayezo wesistimu ezingxoxweni ukuze uthole ukulawula okwengeziwe kuzimpendulo zomsizi.
Singasebenzisa curl
ukufaka iseva yethu ngale ndlela elandelayo:
# Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"
Ngokulandela lezi zinyathelo, usethe iseva yasendaweni elula ekwazi ukusebenzisana nezwi lezindlela ezimbili usebenzisa amamodeli asezingeni eliphezulu. Lokhu kusetha kungasebenza njengesisekelo sokwakha izinhlelo zokusebenza ezinamandla kakhulu ezisebenzisa izwi.
Uma uhlola izindlela zokwenza imali ngamamodeli olimi axhaswe yi-AI, cabanga ngalezi zinhlelo zokusebenza ezingaba khona:
import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)