“Ke rile ke nyaka B-Movie dammit!”
O lapišitšwe ke go phetla Netflix ka mo go sa felego, o sa kgonthišege ka seo o swanetšego go se bogela ka morago? Go thwe’ng ge e ba o be o ka aga tshepedišo ya gago ya ditšhišinyo yeo e tlwaelegilego, yeo e laolwago ke AI yeo e bolelago e sa le pele filimi e latelago yeo o e ratago ka go nepagala?
Thutong ye, re tla go hlahla ka tshepedišo ya go hlama tshepedišo ya kgothaletšo ya filimi ka go šomiša dipolokelo tša tshedimošo tša vector (VectorDBs) . O tla ithuta kamoo dientšene tša mehleng yeno tša kgothaletšo ya AI di šomago ka gona gomme wa hwetša phihlelo ya diatla ya go aga tshepedišo ya gago ka Superlinked .
(O nyaka go tlolela thwii go khoutu? Hlahloba repo ya rena go GitHub mo . O ikemišeditše go leka ditshepedišo tša motšhišinyo bakeng sa molato wa gago wa tšhomišo? Hwetša demo mo .)
Re tla be re latela puku ye ya dinoutse sehlogong ka moka. Gape o ka sepetša khoutu thwii go tšwa go sephephediši sa gago o šomiša Colab.
Netflix ya kgothaletšo algorithm e dira mošomo o mobotse kudu wa go šišinya dikagare tše di lebanego - ge go fiwa bophagamo bjo bo feletšego bja dikgetho (~16k difilimi le mananeo a TV ka 2023) le kamoo e swanetšego go šišinya dipontšho go badiriši ka pela ka gona. Netflix e e dira bjang? Ka lentšu le tee, nyakišišo ya semantiki .
Patlo ya semantiki e kwešiša tlhalošo le seemo (bobedi dika le dipaterone tša tšhomišo) ka morago ga dipotšišo tša modiriši le ditlhalošo tša difilimi/pontšho ya TV, gomme ka fao e ka fa go dira gore e be ya motho ka noši ye kaone dipotšišong le ditšhišinyong tša yona go feta mekgwa ya setšo ye e theilwego godimo ga mantšu a bohlokwa.
Eupša nyakišišo ya semantiki e hlola ditlhohlo tše itšego - tšeo di lego pele gare ga tšona: 1) go netefatša dipoelo tša nyakišišo tše di nepagetšego, 2) go hlathollwa, le 3) go kgona go lekanyetša - ditlhohlo tšeo leano lefe goba lefe la kgothaletšo ya diteng ye e atlegilego le tlago swanelwa ke go di rarolla. Ka go šomiša bokgobapuku bja Superlinked, o ka fenya mathata a.
Sehlogong se, re tla go bontšha ka moo o ka šomišago bokgobapuku bja Superlinked go hloma nyakišišo ya gago ya semantiki le go tšweletša lenaneo la difilimi tša maleba go ya ka dikgetho tša gago.
Patlo ya semantiki e fetišetša boleng bjo bontši go nyakišišo ya vector eupša e tšweletša ditlhohlo tše tharo tše bohlokwa tša go tsenya vector go bahlami:
Bokgobapuku bja Superlinked bo go kgontšha go rarolla ditlhohlo tše. Ka tlase, re tla aga mokgoši wa diteng (ka go lebanya bakeng sa difilimi), go thoma ka tshedimošo yeo re nago le yona ka filimi yeo e filwego, ra tsenya tshedimošo ye bjalo ka vector ya mekgwa ye mentši, ra aga tšhupamabaka ya vector yeo e ka nyakišišwago bakeng sa difilimi tša rena ka moka, gomme ka morago ra diriša boima bja potšišo go tweak dipoelo tša rena le go fihla ditšhišinyong tše dibotse tša difilimi. A re tseneng go yona.
Ka tlase, o tla dira nyakišišo ya semantiki go dataset ya filimi ya Netflix o šomiša dielemente tše di latelago tša bokgobapuku bja Superlinked:
Ka katlego kgothaletša difilimi go thata kudu ka gobane go na le dikgetho tše ntši kudu (> 9000 dithaetlele ka 2023), gomme badiriši ba nyaka ditšhišinyo ka nyakego, ka pela. A re tšeeng mokgwa wo o laolwago ke data go hwetša selo seo re nyakago go se bogela. Ka dataset rona ya lifilimi, re tseba le:
Re ka tsenya ditseno tše, gomme ra kopanya tšhupamabaka ya vector godimo ga ditseno tša rena, ra hlola sekgoba seo re ka se nyakago ka semantiki.
Ge re šetše re na le sekgoba sa rena sa vector se se šupago, re tla:
Mogato wa gago wa mathomo ke go tsenya bokgobapuku le go tsenya diklase tše di nyakegago ka ntle.
(Ela hloko: Ka tlase, fetola alt.renderers.enable(“mimetype”)
ho alt.renderers.enable('colab')
haeba u matha sena ka google colab . Boloka “mimetype” haeba u phetha ka github .)
%pip install superlinked==5.3.0 from datetime import timedelta, datetime import altair as alt import os import pandas as pd from superlinked.evaluation.charts.recency_plotter import RecencyPlotter from superlinked.framework.common.dag.context import CONTEXT_COMMON, CONTEXT_COMMON_NOW from superlinked.framework.common.dag.period_time import PeriodTime from superlinked.framework.common.schema.schema import schema from superlinked.framework.common.schema.schema_object import String, Timestamp from superlinked.framework.common.schema.id_schema_object import IdField from superlinked.framework.common.parser.dataframe_parser import DataFrameParser from superlinked.framework.dsl.executor.in_memory.in_memory_executor import ( InMemoryExecutor, InMemoryApp, ) from superlinked.framework.dsl.index.index import Index from superlinked.framework.dsl.query.param import Param from superlinked.framework.dsl.query.query import Query from superlinked.framework.dsl.query.result import Result from superlinked.framework.dsl.source.in_memory_source import InMemorySource from superlinked.framework.dsl.space.text_similarity_space import TextSimilaritySpace from superlinked.framework.dsl.space.recency_space import RecencySpace alt.renderers.enable("mimetype") # NOTE: to render altair plots in colab, change 'mimetype' to 'colab' alt.data_transformers.disable_max_rows() pd.set_option("display.max_colwidth", 190)
Re swanetše gape go prep dataset - hlaloša nako ye e sa fetogego, go beakanya lefelo la URL la datha, go hlama pukuntšu ya polokelo ya data, go bala CSV ka gare ga pandas DataFrame, go hlwekiša dataframe le ya data gore e kgone go tsongwa gabotse, le go dira netefatšo ya ka pela le kakaretšo. (Bona disele 3 le 4 bakeng sa dintlha.)
Bjale ka ge dataset e lokišitšwe, o ka kaonafatša go hwetša ga gago o šomiša bokgobapuku bja Superlinked.
Bokgobapuku bja Superlinked bo na le sete ya diboloko tša go aga tša motheo tšeo re di šomišago go aga tšhupamabaka le go laola go hwetša. O ka bala ka ga dilo tše tša go aga ka botlalo mo .
Sa mathomo, o swanetše go hlaloša Sekema sa gago go botša tshepedišo ka ga datha ya gago.
# accommodate our inputs in a typed schema @schema class MovieSchema: description: String title: String release_timestamp: Timestamp genres: String id: IdField movie = MovieSchema()
Se se latelago, o šomiša Dikgoba go bolela ka moo o nyakago go swara karolo ye nngwe le ye nngwe ya datha ge o tsenya. Ke Dikgoba dife tšeo di šomišwago go ithekgile ka mohuta wa gago wa datha. Sebaka se sengwe le se sengwe se lokišitšwe go tsenya datha gore se bušetše boleng bjo bo phagamego kudu bjo bo kgonegago bja dipoelo tša go hwetša.
Ditlhalošong tša Sebaka, re hlaloša ka moo ditseno di swanetšego go tsenywa ka gona e le gore go bontšhe dikamano tša semantiki ka gare ga datha ya rena.
# textual fields are embedded using a sentence-transformers model description_space = TextSimilaritySpace( text=movie.description, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) title_space = TextSimilaritySpace( text=movie.title, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) genre_space = TextSimilaritySpace( text=movie.genres, model="sentence-transformers/paraphrase-MiniLM-L3-v2" ) # release date are encoded using our recency space # periodtimes aim to reflect notable breaks in our scores recency_space = RecencySpace( timestamp=movie.release_timestamp, period_time_list=[ PeriodTime(timedelta(days=4 * YEAR_IN_DAYS)), PeriodTime(timedelta(days=10 * YEAR_IN_DAYS)), PeriodTime(timedelta(days=40 * YEAR_IN_DAYS)), ], negative_filter=-0.25, ) movie_index = Index(spaces=[description_space, title_space, genre_space, recency_space])
Ge o šetše o hlomile dikgoba tša gago gomme o hlotše tšhupamabaka ya gago, o šomiša dikarolo tša mothopo le tša phethagatšo ya bokgobapuku go beakanya dipotšišo tša gago. Bona disele 10-13 ka pukung ya dinoutse .
Bjale ka ge dipotšišo di lokišitšwe, a re tšweleng pele go sepetša dipotšišo le go kaonafatša go bušetša morago ka go beakanya boima.
Sebaka sa moragorago se go dumelela go fetoša dipoelo tša potšišo ya gago ka go goga ka kgetho ditokollo tša kgale goba tše mpsha go tšwa go sete ya gago ya datha. Re šomiša mengwaga ye 4, 10, le 40 bjalo ka dinako tša rena tša nako gore re kgone go fa mengwaga ye e nago le dithaetlele tše ntši go tsepelela kudu - bona sele ya 5 ).
Hlokomela dikgaotšo tša maemo ka mengwaga ye 4, 10 le 40. Dithaetlele tša go feta mengwaga ye 40 di hwetša maemo a negative_filter
.
A re hlalošeng mošomo wa util wa ka pela go tšweletša dipoelo tša rena ka pukung ya dinoutse.
def present_result( result: Result, cols_to_keep: list[str] = ["description", "title", "genres", "release_year", "id"], ) -> pd.DataFrame: # parse result to dataframe df: pd.DataFrame = result.to_pandas() # transform timestamp back to release year df["release_year"] = [ datetime.fromtimestamp(timestamp).year for timestamp in df["release_timestamp"] ] return df[cols_to_keep]
Bokgobapuku bja Superlinked bo go dumelela go dira mehuta ye e fapanego ya dipotšišo; mo re hlaloša tše pedi. Bobedi bja mehuta ya rena ya potšišo ya potšišo (ye bonolo le ye e tšwetšego pele) e ntumelele go ela dikgoba ka botee (tlhalošo, thaetlele, mohutana, gomme go ba gona morago bjale) go ya ka dikgetho tša ka. Phapano magareng ga bona ke gore ka potšišo ye bonolo , ke beakanya sengwalwa se tee sa potšišo gomme ka morago ke tšwelela dipoelo tše di swanago ka tlhalosong, sehlogong, le dikgoba tša mohutana.
With an advanced query , Ke na le taolo ya mabele a mabotse kudu. Ge ke nyaka, nka tsenya dingwalwa tša dipotšišo tše di fapanego go ye nngwe le ye nngwe ya dikgoba tša tlhalošo, thaetlele, le mohutana. Khoutu ya potšišo ke ye:
query_text_param = Param("query_text") simple_query = ( Query( movie_index, weights={ description_space: Param("description_weight"), title_space: Param("title_weight"), genre_space: Param("genre_weight"), recency_space: Param("recency_weight"), }, ) .find(movie) .similar(description_space.text, query_text_param) .similar(title_space.text, query_text_param) .similar(genre_space.text, query_text_param) .limit(Param("limit")) ) advanced_query = ( Query( movie_index, weights={ description_space: Param("description_weight"), title_space: Param("title_weight"), genre_space: Param("genre_weight"), recency_space: Param("recency_weight"), }, ) .find(movie) .similar(description_space.text, Param("description_query_text")) .similar(title_space.text, Param("title_query_text")) .similar(genre_space.text, Param("genre_query_text")) .limit(Param("limit")) )
Dipotšišong tše bonolo, ke beakanya sengwalwa sa ka sa potšišo gomme ke diriša dikelo tše di fapanego go ya ka bohlokwa bja tšona go nna.
result: Result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=1, genre_weight=1, recency_weight=0, limit=TOP_N, ) present_result(result)
Dipoelo tša rena di na le dithaetlele tše dingwe tšeo ke šetšego ke di bone. Nka šomana le se ka go ela boima bja morago bjale go sekamela dipoelo tša ka go dithaetlele tša morago bjale. Dikelo di tlwaelegile go ba le palomoka ya yuniti (ke gore, boima ka moka bo beakantšwe ka fao di dula di akaretša go fihla go palomoka ya 1), ka fao ga o swanela go tshwenyega ka gore o di beakanya bjang.
result: Result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=1, genre_weight=1, recency_weight=3, limit=TOP_N, ) present_result(result)
Dipoelo tša ka (ka godimo) bjale ka moka ke tša ka morago ga 2021.
Ka go šomiša potšišo ye bonolo, nka ela boima bja sekgoba sefe goba sefe se se itšego (tlhalošo, sehlogo, mohutana, goba sa moragorago) go dira gore se bale kudu ge ke bušetša dipoelo. A re lekeng ka se. Ka fase, re tla fa boima bjo bo oketšegilego go mohutana le sehlogo sa boima bja mmele bja go theoga - sengwalwa sa ka sa potšišo ge e le gabotse ke mohutana fela wo o nago le seemo se sengwe sa tlaleletšo. Ke boloka recency ya ka e le bjalo ka gobane ke sa rata gore dipoelo tša ka di be le leeme go difilimi tša morago bjale.
result = app.query( simple_query, query_text="Heartfelt romantic comedy", description_weight=1, title_weight=0.1, genre_weight=2, recency_weight=1, limit=TOP_N, ) present_result(result)
Potšišo ye e kgoromeletša ngwaga wa go lokollwa morago go se nene go mpha dipoelo tše ntši tše di lekanyeditšwego mohuteng wa mmino (ka fase).
Potšišo e tšwetšego pele e mpha taolo e botse le go feta. Ke boloka taolo ya morago bjale, eupša gape nka laetša sengwalwa sa nyakišišo bakeng sa tlhalošo, sehlogo, le mohutana, gomme ka abela yo mongwe le yo mongwe boima bjo bo itšego go ya ka dikgetho tša ka, ka tlase (le disele 19-21 ), .
result = app.query( advanced_query, description_query_text="Heartfelt lovely romantic comedy for a cold autumn evening.", title_query_text="love", genre_query_text="drama comedy romantic", description_weight=0.2, title_weight=3, genre_weight=1, recency_weight=5, limit=TOP_N, ) present_result(result)
E re dipoelong tša ka tša mafelelo tša filimi, ke hweditše filimi yeo ke šetšego ke e bone gomme nka rata go bona selo se se swanago. A re tšeeng gore ke rata Keresemose e Tšhweu, e lego metlae ya lerato ya 1954 (id = tm16479) yeo e bolelago ka seopedi-batantshi bao ba kopanago bakeng sa pontšho ya sefala go gogela baeng ntlong ya baeti ya Vermont yeo e katana. Ka go tlaleletša ka temana ye e oketšegilego with_vector
(ka paramethara movie_id
) go advanced_query, with_movie_query e ntumelela go nyaka ka go šomiša filimi ye (goba filimi efe goba efe yeo ke e ratago), gomme e mpha taolo ka moka ya mabele a masese ya sengwalwa sa potšišo ya go nyaka ka fase ga ka thoko le boima.
Sa pele, re tlaleletša ka paramethara ya rena ya movie_id:
with_movie_query = advanced_query.with_vector(movie, Param("movie_id"))
Gomme ka morago nka beakanya dipotšišo tša ka tše dingwe tša nyakišišo ya ka fasana e ka ba go lefeela goba eng kapa eng yeo e lego maleba kudu, gotee le dikelo dife goba dife tšeo di kwagalago. A re re potšišo ya ka ya mathomo e bušetša dipoelo tšeo di bontšhago karolo ya tiragatšo ya sefala/sehlopha sa Keresemose ye Tšhweu (bona sele ya 24 ), eupša ke nyaka go bogela filimi yeo e lebanego kudu le lapa. Nka tsenya description_query_text go skew dipoelo tša ka ka tsela yeo ke e nyakago.
result = app.query( with_movie_query, description_query_text="family", title_query_text="", genre_query_text="", description_weight=1, title_weight=0, genre_weight=0, recency_weight=0, description_query_weight=1, movie_id="tm16479", limit=TOP_N, ) present_result(result)
Eupša bjale ka ge ke bona dipoelo tša-ka, ke lemoga gore ge e le gabotse ke maikwelong kudu a go nyaka selo se sengwe se se bofefo le se se segišago. A re beakaneng potšišo ya ka go ya ka fao:
Result = app.query( with_movie_query, description_query_text="", title_query_text="", genre_query_text="comedy", description_weight=1, title_weight=0, genre_weight=2, recency_weight=0, description_query_weight=1, movie_id="tm16479", limit=TOP_N, ) present_result(result)
Go lokile, dipoelo tšeo di kaone. Ke tla kgetha e nngwe ya tše. Apara dipopcorn!
Superlinked e dira gore go be bonolo go leka, go boeletša, le go kaonafatša boleng bja gago bja go bušetša morago. Ka godimo, re go sepedišitše ka mokgwa wa go šomiša bokgobapuku bja Superlinked go dira nyakišišo ya semantiki sebakeng sa vector, ka tsela yeo Netflix e dirago ka yona, le go bušetša dipoelo tša filimi tše di nepagetšego, tše di lebanego. Re bone gape ka moo re ka lokišago dipoelo tša rena gabotse, re tweaking boima le mantšu a go nyaka go fihlela re fihla go sephetho se se nepagetšego fela.
Bjale, leka puku ya dinoutse ka bowena, gomme o bone seo o ka se fihlelelago!
Dientšene tša ditšhišinyo di bopa tsela yeo re utollago diteng ka yona. Go sa šetšwe gore ke difilimi, mmino goba ditšweletšwa, go tsoma ka vector ke bokamoso —gomme bjale o na le didirišwa tša go ikagela tša gago.
Mongwadi: Mór Kapronczay