Working on a data science project is almost always equivalent to an amazing clutter in the working directory. Data scientists would most likely have the following materials dumped in their project working directory:
Python/R scripts
Data sets
Reference materials
— includes journal articles, slides, other documentsNotebooks
Notes
Scala sources (if using spark)
Cloned repository of other projects relevant to the current work
— usually, a source of inspiration, methodology or case studiesOther scripts
for data transfer, data clean-up or even for runners.sh to submit jobs on a cluster. I always have a runner.sh that contains yarn settings for spark-submitGiven a project, Data Scientists follows these steps to tackle it;
Requirements gathering
ETL data from sources using python, R or scala
Data calibration
- perform descriptive statistics on data to validate whether it reflects business facts. This takes sometime, even on collaborative environment where business and data scientists are working closely. In addition, data calibration is also needed to further verify business facts.Data Science and Insights Generation
- with data validated and calibrated, A Data Scientist can now start working on generating insights - producing notebooks, scripts or scala jars. Notes, journal articles and other references will add to the clutter in the working directory.Visualization and Reports creation
- reports for business are consolidated in a presentation from outputs of various visualization tools (png files, tableau workbooks)PySpark or Spark jobs sources for operationalization
- if the study is to be operationalized, prototypes are built as Data Engineers guide.This is the directory heirarchy I have for every data science project:
ansible-playbooks:
ansible playbooks are created to automated repeatitive tasksdata:
all data sets (toy, final, intermediate aggregates, etc). I would usually have to sub directories, for (1) datasets generated in the cluster (we're running on a spark environment), (2) locally generatedNotebooks:
with subdirectories for notebooks running on the cluster and locallyReferences:
pdfs, journal articles, referencesrepo: for all python, scala and R scripts, organized as repo/src/python/main/R, repo/src/python/lib (for various utilities), repo/src/main (for scala codes). repo
is organized like this to allow easy compilation of scala codes using maven build.Reports:
all reports goes here
I use git to manage versions and changes. A .gitignore file which ignores everything except for the main directories above keeps accidental inclusion of files not intended for commit to the remote repo.
Here’s my .gitignore file.
/*
**/.DS_Store
**/.ipynb_checkpoints
**/*.log
repo/src/python/lib/
!/resources
!/notebooks
!/repo
!/ansible
!/data
!/.gitignore
How are you de-cluttering your working directory? Get the workspace template here. Feel free to comment and improve.
References:
Banner Image source: https://hortonworks.com/products/partner-solutions/data-science/