Data Analysis with Docker

Data Analysis is not all about reports or visualization. The correctness and reproducibility are also important for scientific research. A consistent environment is critical for reproducibility. There are several ways to achieve that. However, I find out using Docker at any time can repeat the experiment in the same environment. It is easy to scale up and scale horizontally.

My docker image is based on Ubuntu. It includes common Data Science tools such as Jupyter Notebook wiht Python 3 and R kernel. With the help of R Magic, I can run both Python and R in the same .ipynb file. To learn more about R Magic, you can click here. I also installed Nbextensions for Jupyter Notebook. For more information, you can click here

You can find my Dockerfile in my GitHub Repository.

HOW TO USE MY DOCKERFILE

Install Docker

The Docker community have an explicit tutorial about how to install Docker. Please check here

Build

In the terminal, direct to the folder that contains the dockerfile and run the following command:

1

docker build -t data-analyst-notebook .

Don’t forget the “.” at the end. data-analyst-notebook is the name of the image. You can change to whatever you prefer.

Start server

I use following code to start server:

1

docker run --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/:/home/jovyan/work data-analyst-notebook

There is more detailed instruction from User Guide on ReadTheDocs

If you feel like that the command is too long to run. You can add an alias to your .bashrc file like this:

1

alias dslab='docker run -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/:/home/jovyan/work data-analyst-notebook'

Now you can use dslab in the terminal as a replacement for typing the long command.