Leveraging Natural Supervision for Language Representation Learning and Generation: Introduction

Written by escholar | Published 2024/06/01
Tech Story Tags: llm-natural-supervision | llm-self-supervision | llm-language-pretraining | llm-word-prediction | ai-language-modeling | ai-vector-representations | ai-neural-models | ai-sentence-representations

TLDRIn this study, researchers describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.via the TL;DR App

Author:

(1) Mingda Chen.

Table of Links

CHAPTER 1 - INTRODUCTION

Written language is ubiquitous. Humans use writing to communicate ideas and store knowledge in their daily lives. These activities naturally produce traces of human intelligence, resulting in abundant, freely-available textual data: e.g., Wikipedia,[1] Reddit,[2] and Fandom,[3] among others. These data often contain sophisticated knowledge expressed with complex language structures. For example, for encyclopedias, there usually are dedicated structures for connecting pieces of information scattered around different places for the convenience of readers (e.g., hyperlinks that point the same person or events mentioned in other documents to the same place for disambiguation). Aside from explicit structures, corpora have rich implicit structures. For example, the data pair in bilingual text shares the same semantic meaning but differs in syntactic forms. The implicit difference between the data pair allows us to disentangle the semantic and syntactic information implied in the data structure.

Despite the rich structures, recent advances in NLP have been driven by deep neural models trained on a massive amount of plain text, which often strips away the knowledge and structure from the input. This thesis research approaches to better drive supervision from various naturally-occurring textual resources. In particular, we (1) improve ways of transforming plain text into training signals; (2) propose approaches to exploit the rich structures in Wikipedia and paraphrases; and (3) create evaluation benchmarks from fan-contributed websites to reflect real-world challenges. Below we briefly introduce these three areas and summarize our contributions.

This paper is available on arxiv under CC 4.0 license.


[1] https://www.wikipedia.org/, an online collaborative encyclopedia.

[2] https://www.reddit.com/, an online forum for discussion and web content rating.

[3] https://www.fandom.com/, a fan-contributed encyclopedia of movies, films, and other media.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2024/06/01