Docker for Data Scientists

July 16, 2018

Docker is hot in the developer world and although data scientists aren’t strictly software developers, Docker has some very useful features for everything from data exploration and modeling to deployment.

Since major services like AWS support Docker containers, it’s even easier to implement Continuous Integration Continuous Delivery with Docker. In this post, I’ll show you how to use Docker as a data scientist.

What is Docker?

It’s a software container platform that provides an isolated container for us to have everything we need for our experiments to run. Essentially, it’s a light-weight VM that’s built from a script that can be version controlled. We can now version control our data science environment!

Developers use Docker when collaborating on code with coworkers and they also use it to build agile software delivery pipelines to ship new features faster. Any of this sound familiar?

Docker terminology

I have a mathematics background, so I can’t avoid definitions.

Containers: very small user-level virtualization that helps you build, install, and run your code

Images: a snapshot of your container

Dockerfile: a yaml-based file that’s used to build your image; this is what we can version control

Dockerhub: GitHub for your Docker images; you can set up Dockerhub to automatically build an image anytime you update your Dockerfile in GitHub

Why Docker is so awesome for data science

Ever heard these comments from your coworkers?

“Not sure why it’s not working on your computer, it’s working on mine.”
“It’s a pain to install everything from scratch for Linux, Windows, and MacOS, and trying to build the same environment for each OS.”
“Can’t install the package that you used, can you help me out?”
“I need more compute power. I could use AWS but it’ll take so long just to install all those packages and configure settings just like I have it on my machine.”

For the most part, these concerns are easily resolved by Docker. The exception at the moment of posting is GPU support for Docker images, which only run on Linux machines. Other than that, you’re golden.

Docker for Python and Jupyter Notebook

Look at this Dockerfile.

# reference: https://hub.docker.com/_/ubuntu/
FROM ubuntu:16.04

# Adds metadata to the image as a key value pair example LABEL version="1.0"
LABEL maintainer="Bobby Lindsey <[email protected]>"

# Set environment variables
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

# Create empty directory to attach volume
RUN mkdir ~/GitProjects

# Install Ubuntu packages
RUN apt-get update && apt-get install -y \
    wget \
    bzip2 \
    ca-certificates \
    build-essential \
    curl \
    git-core \
    htop \
    pkg-config \
    unzip \
    unrar \
    tree \
    freetds-dev

# Clean up
RUN apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Jupyter config
RUN mkdir ~/.ssh && touch ~/.ssh/known_hosts
RUN ssh-keygen -F github.com || ssh-keyscan github.com >> ~/.ssh/known_hosts
RUN git clone https://github.com/bobbywlindsey/dotfiles.git
RUN mkdir ~/.jupyter
RUN cp /dotfiles/jupyter_configs/jupyter_notebook_config.py ~/.jupyter/
RUN rm -rf /dotfiles

# Install Anaconda
RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh
RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda.sh
RUN /bin/bash ~/anaconda.sh -b -p /opt/conda
RUN rm ~/anaconda.sh

# Set path to conda
ENV PATH /opt/conda/bin:$PATH

# Update Anaconda
RUN conda update conda && conda update anaconda && conda update --all

# Install Jupyter theme
RUN pip install msgpack jupyterthemes
RUN jt -t grade3

# Install other Python packages
RUN conda install pymssql
RUN pip install SQLAlchemy \
    missingno \
    json_tricks \
    bcolz \
    gensim \
    elasticsearch \
    psycopg2-binary

# Configure access to Jupyter
WORKDIR /root/GitProjects
EXPOSE 8888
CMD jupyter lab --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token='data-science'

If you’ve ever installed packages in Ubuntu, this should look very familiar. In short, this Dockerfile is a script to automatically build and setup a light-weight version of Ubuntu with all the necessary Ubuntu packages and Python libraries needed to do [my] data science exploration with Jupyter Notebooks.

The best part is that this will run the same way whether I’m on macOS, Linux, or Windows - no need to code separate install scripts and third-party tools to have the same environment in each operating system.

To build a Docker image from this Dockerfile, all you need to do is execute

docker build -t bobbywlindsey/docker-data-science .

in the command line and Bob’s your uncle. To run the image, you have two options - you can either run the image interactively (which means you’ll see the output of your Jupyter Notebook server in real time) or in detached mode (where you can drop into the image’s terminal and play around).

To run the image interactively on Windows, execute

docker run -it -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science

Otherwise,

docker run -it -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science

To run the image in detached mode for linux:

docker run -d --name data-science -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science
docker exec -it data-science bash

or for MacOS and Windows:

docker run -d --name data-science -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science
docker exec -it data-science bash

Not too bad! I realize those run commands might be a bit much to type up, so there’s a couple options that I see. You can either alias those commands or you can use a docker-compose file instead.

Using multiple containers

I won’t go into Docker Compose much here, but as an example, this a docker-compose file I have to run a Docker image used for a Jekyll blog:

version: '3'
services:
  site:
    environment:
      - JEKYLL_ENV=docker
    image: bobbywlindsey/docker-jekyll
    volumes:
      - ~/Dropbox/me/career/website-and-blog/bobbywlindsey:/root/bobbywlindsey
    ports:
      - 4000:4000
      - 35729:35729

With that file, your run command then becomes:

docker-compose run --service-ports site

But Docker Compose is much more capable than just using it as a substitute for aliasing your run commands. Your docker-compose file can configure multiple images and by using a single command, you create and start all your services at once.

For example, let’s say you build one Docker image to preprocess your data, another to model the data, and another to deploy your model as an API. You can use docker-compose to manage each image’s configurations and run them with a single command.

Conclusion

Even though Docker might require a learning curve for some data scientists, I believe it’s well worth the effort and it doesn’t hurt to brush up those DevOps skills. Have you used Docker for your data science efforts?