The Why and How of Dataset Creation

Written by textmodels | Published 2024/06/10
Tech Story Tags: ai-training-data | data-provenance | mitigating-bias-in-ai | ai-transparency | ai-ethics | machine-learning-models | datasheets-for-datasets | ai-data-documentation

TLDRThis section prompts dataset creators to clarify the purpose, team, funding sources, and any additional comments regarding the creation of the dataset, promoting transparency and understanding of its origins.via the TL;DR App

Authors:

(1) TIMNIT GEBRU, Black in AI;

(2) JAMIE MORGENSTERN, University of Washington;

(3) BRIANA VECCHIONE, Cornell University;

(4) JENNIFER WORTMAN VAUGHAN, Microsoft Research;

(5) HANNA WALLACH, Microsoft Research;

(6) HAL DAUMÉ III, Microsoft Research; University of Maryland;

(7) KATE CRAWFORD, Microsoft Research.

Table of Links

1 Introduction

1.1 Objectives

2 Development Process

3 Questions and Workflow

3.1 Motivation

3.2 Composition

3.3 Collection Process

3.4 Preprocessing/cleaning/labeling

3.5 Uses

3.6 Distribution

3.7 Maintenance

4 Impact and Challenges

Acknowledgments and References

Appendix

3.1 Motivation

The questions in this section are primarily intended to encourage dataset creators to clearly articulate their reasons for creating the dataset and to promote transparency about funding interests. The latter may be particularly relevant for datasets created for research purposes.

• For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

• Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

• Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

• Any other comments?

This paper is available on arxiv under CC 4.0 license.


Written by textmodels | We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.
Published by HackerNoon on 2024/06/10