Vanyori:
(1) Andrew Draganov, Aarhus University uye Vese vanyori vakapa zvakaenzana kutsvagurudzo iyi;
(2) David Saulpic, Université Paris Cité & CNRS;
(3) Chris Schwiegelshohn, Aarhus University.
2 Kutanga uye Basa Rinoenderana
2.2 Mamwe maCoreset Strategies
2.3 Coresets eDatabase Applications
4 Kuderedza Kukanganisa Kwekupararira
4.1 Kuverengera crude upper-bound
4.2 Kubva paApproximate Solution kusvika Yakaderedzwa Kupararira
5 Kukurumidza Kudzvanya Mukuita
5.1 Chinangwa uye Chikamu cheEmpirical Analysis
5.3 Kuongorora Sampling Strategies
5.4 Yekutenderera Setting uye 5.5 Takeaways
8 Humbowo, Pseudo-Code, uye Mawedzero uye 8.1 Humbowo hweCorollary 3.2
8.2 Kudzikiswa kwek-nzira kuenda k-median
8.3 Kufungidzira Kwemutengo Wakakwana Pamuti
8.4 Kuwedzera kune Algorithm 1
Isu tinodzidza iyo theoretical uye inoshanda yekumhanyisa nguva miganho ye k-zvinoreva uye k-median kuungana pamaseti makuru. Sezvo zvinobudirira nzira dzese dzekubatanidza dzinononoka kupfuura iyo nguva inotora kuverenga dhata, iyo inokurumidza nzira ndeye kukurumidza kudzvanya data uye kuita kubatanidza pane yakamanikidzwa inomiririra. Nehurombo, hapana yakanakisa sarudzo yekudzvanya huwandu hwemapoinzi - nepo kusarudzika sampling ichimhanya munguva yepasi uye macoresets achipa vimbiso yetioretical, yekutanga haitevedze chokwadi nepo iyo yekupedzisira ichinonoka sezvo huwandu hwemapoinzi uye masumbu anokura. Chokwadi, zvave zvichifungidzirwa kuti chero senitivity-yakavakirwa coreset kuvaka inoda yepamusoro-mutsara nguva muhukuru hwedhata.
Isu tinoongorora hukama uhwu nekutanga kuratidza kuti kune algorithm inowana macoresets kuburikidza nekunzwa sampling mune zvinobudirira mutsara nguva - mukati me log-zvinhu zvenguva inotora kuverenga iyo data. Chero nzira inonatsiridza zvakanyanya pane izvi inofanira kubva yaenda kune inoshanda heuristics, ichititungamira kuti tifunge nezve spectrum yesampling mazano kune ese echokwadi uye ekugadzira dhataseti mune inomira uye yekutepfenyura marongero. Kuburikidza neizvi, isu tinoratidza mamiriro ayo macoresets anodiwa kuchengetedza cluster kutendeseka pamwe neseseti umo nekukurumidza, cruder sampling mazano akakwana. Nekuda kweizvozvo, tinopa yakazara theoretical uye inoshanda purani yekubatanidza kunoshanda zvisinei nehukuru hwedata. Kodhi yedu inowanikwa pachena kune kwakabva uye ine zvinyorwa zvekugadzira zvakare zviedzo.
Muongorori wemazuva ano wedata haana kushomeka kwekubatanidza maalgorithms ekusarudza kubva asi, zvichipihwa saizi inoramba ichiwedzera yemaseti akakodzera, mazhinji anowanzo kunonoka kuti abatsire. Izvi zvinonyanya kukosha kune hombe-data mapaipi, uko clustering algorithms inowanzoshandiswa kumanikidza. Chinangwa ndechekutsiva dhatabheti rakakura kwazvo nediki, rinogoneka kumabasa epasi, netarisiro inomiririra iyo yepakutanga yekuisa zvakanaka. Lloyd's algorithm [49] yakaunzwa nekuda kwechikonzero ichi uye inoderedza chikanganiso chehuwandu - huwandu hwechinhambwe cheskweya kubva pane imwe neimwe poindi yekuisa kumumiriri wayo mune yakadzvanywa dataset. Sezvineiwo ndiyo inonyanya kufarirwa kusanganisa algorithm, yaLloyd inomhanyisa kudzokorodza kwakawanda kudzamara kusanganiswa nechero iteration inoda O(ndk) nguva, apo n ndiyo nhamba yemapoinzi, d ndiyo nhamba yezvimiro uye k nhamba yemasumbu - kana saizi yekumanikidza kwakanangwa. Pamashandisirwo akadai, huwandu hwemapoinzi hunogona kuve nyore mazana emamiriyoni uye, sezvo kunaka kwekumanikidza kunowedzera ne k, zvibodzwa zvakajairika zvinogona kuve ne k muzviuru [41, 4]. Mune marongero akadai, chero O(ndk) algorithm inononoka.
Mienzaniso yakaita seiyi yakakurudzira kusimuka kwemaalgorithms makuru edata anopa zvese dzidziso uye inoshanda yekumhanyisa nguva yekuvandudza. Maonero ekunzwika kwedzidziso uye kushanda nesimba, zvisinei, kazhinji anopokana. Kune rimwe divi, vimbiso dzedzidziso dzinopa vimbiso yekuti algorithm ichashanda zvisinei nechero rombo rakaipa raanogamuchira. Kune rimwe divi, zvingave zvakaoma kuzvisimbisa kuti uite iyo theoretically optimal algorithm kana paine cruder nzira dzinokurumidza kumhanya uye kuita nemazvo mukuita.
Sezvo datasets inogona kuva yakakura muhuwandu hwemapoinzi n uye/kana nhamba yezvimiro d, hombe-data nzira dzinofanirwa kudzikisira mhedzisiro yezvose. Nezvenzvimbo yenzvimbo, mubvunzo unovharwa zvine mutsindo sezvo mafungidziro asina kurongeka ari kukurumidza (kumhanya nenguva yakatsetseka), inoshanda kuita [50], uye inopa vimbiso dzakasimba pakukura nemhando yekudzvanya. Maonero acho haanyatso kujeka kana uchidzikisa huwandu hwemapoinzi n, uye kune maviri akapatsanurwa paradigms iyo imwe neimwe inopa akasiyana mabhenefiti. Kune rimwe divi, isu tine sampling yakafanana, iyo inomhanya munguva yepasi-pasi asi inogona kupotsa yakakosha subsets yedata uye saka inogona kungovimbisa chokwadi pasi pemamwe fungidziro yakasimba pane data [45]. Nekune rimwe divi, nzira dzakanyanya dzesampling dzinopa vimbiso yakasimba yecoreset, umo mutengo wechero mhinduro pane yakamanikidzwa data iri mukati me-e-chinhu chemutengo wemhinduro pane yekutanga dataset [25].
Mipiro yedu Tinodzidza ese maparadigms (yunifomu sampling uye yakasimba coressets) zvine chekuita nedambudziko rekare - kudzvanya kwe k-nzira uye k-median zvinangwa. Ipo sampling yeyunifomu inopa kukurumidza kukurumidza asi pasina yakaipisisa-yechokwadi vimbiso, zvese zviripo coreset zvivakwa zvine nguva yekumhanya yeinenge Ω˜(nd + nk) kana ichiburitsa miganho yakasimba pahuwandu hushoma hwemasampuli anodiwa kudzvanya chaiko.
Zviri nyore kuratidza kuti chero algorithm inowana vimbiso yekumanikidza inofanira kuverenga yese dataset[1]. Saka mubvunzo wakajeka wakavhurika ndeupi vimbiso dzinowanikwa mumutsara kana mutsara-nguva. Chokwadi, parizvino iripo inokurumidza sampling algorithms yekubatanidza [6, 5] haigone kuwana yakasimba coreset garandi. Munguva pfupi yapfuura, [31] akakurudzira nzira yemacoreset akasimba anomhanya nenguva O˜(nd + nk) uye akafungidzira kuti izvi ndizvo zvakakwana zve k-median uye k-nzira.
Kunyange zvazvo izvi zvakasungwa zvakanyatsonaka kune zviduku zvishoma zve k, kune zvakawanda zvinoshandiswa zvakadai sekuona kwekombiyuta [34] kana algorithmic fairness [18] apo nhamba yezvikwata inogona kuva yakakura kudarika nhamba yezvimiro nemirairo yakawanda yehukuru. Muzvirongwa zvakadaro, mubvunzo wenguva-yakakwana coresset inoramba yakazaruka. Sezvo nyaya yekusarudza coreset yehukuru hwakakwana ichangobva kuvharwa [25, 28, 44], iyi ndiyo ine nharo huru yakavhurika dambudziko mukutsvagisa coreset yepakati-based clustering. Isu tinogadzirisa izvi nekuratidza kuti kune iri nyore-kuita-algorithm inogadzira macoreset muO˜(nd) nguva - chete logarithmic zvinhu kure nenguva yainotora kuverenga mudhatabheti.
Zvakangodaro, izvi hazvivhenekere zvizere nzvimbo pakati pemasampling algorithms ekubatanidza mukuita. Kunyangwe algorithm yedu ichiwana zvese zvakaringana nguva yekumhanya uye kudzvanya kwakaringana, zvinogoneka kuti dzimwe, cruder nzira dzingangoshanda kune zvese zvinoshanda. Isu tinotaura izvi zviri pamutemo mumubvunzo unotevera: Ndeipi nguva yakakwana k-nzira uye k-median coressets inodiwa uye chii chinoshanda tradeoff pakati pekumhanya kwecoreset uye chokwadi?
o pindura izvi, tinoita kuenzanisa kwakaringana mukati menguva yakazara yesampling algorithms inokurumidza kupfuura nzira yedu yatinoda. Kuburikidza neizvi tinoona kuti, nepo nzira dzinokurumidza idzi dzakakwana zvakakwana pamaseti mazhinji epasirese, kune kugoverwa kwedata kunokonzera kutadza kukuru kune yega yega. Zvechokwadi, nyaya idzi dzinogona kudzivirirwa chete neine yakasimba-coreset nzira. Nekudaro, nepo akawanda anoshanda marongero asingade yakazara coreset garandi, munhu haagone kucheka makona kana munhu achida kuva nechivimbo mukumanikidza kwavo. Isu tinoona kuti izvi zvinosvika kune yekutepfenyura paradigm uye inoshanda kune yakadzika nzira yekuunganidza.
Muchidimbu, mipiro yedu ndeiyi inotevera:
• Isu tinoratidza kuti munhu anogona kuwana akasimba macoreset e k-zvinoreva uye k-median muO˜(nd) nguva. Izvi zvinogadzirisa fungidziro pane inodiwa nguva yekumhanya ye k-zvinoreva coresset [31] uye ine theoretically yakakwana kusvika kune log-zvinhu.
• Kuburikidza nekuongorora kwakadzama padatasets, mabasa, uye kutepfenyura/kusina-kutepfenyura paradigms, tinoona kuti pane inodiwa tradeoff pakati pekumhanya nekururama pakati pemitsetse- uye sublinear-nguva-sampling nzira. Izvi zvinopa murapi wekubatanidza bhurandi rekuti riini rekushandisa imwe neimwe yekumanikidza algorithm kune inoshanda mhedzisiro yekuunganidza munguva inokurumidza inobvira.
Iri bepa rinowanikwa pa arxiv pasi peCC BY 4.0 DEED rezinesi.