Docker is hot in the developer world and although data scientists aren’t strictly software developers, Docker has some very useful features for everything from data exploration and modeling to deployment. And since major services like AWS support Docker containers, it’s even easier to implement Continuous Integration Continuous Delivery with Docker. In this post, I’ll show you how to use Docker as a data scientist.
What is Docker?
It’s a software container platform that provides an isolated container for us to have everything we need for our experiments to run. Essentially, it’s a light-weight VM that’s built from a script that can be version controlled; so we can now version control our data science environment! Developers use Docker when collaborating on code with coworkers and they also use it to build agile software delivery pipelines to ship new features faster. Any of this sound familiar?
I have a mathematics background, so I can’t avoid definitions.
Containers: very small user-level virtualization that helps you build, install, and run your code
Images: a snapshot of your container
Dockerfile: a yaml-based file that’s used to build your image; this is what we can version control
Dockerhub: GitHub for your Docker images; you can set up Dockerhub to automatically build an image anytime you update your Dockerfile in GitHub
Why Docker is So Awesome for Data Science
Ever heard these comments from your coworkers?
- “Not sure why it’s not working on your computer, it’s working on mine.”
- “It’s a pain to install everything from scratch for Linux, Windows, and MacOS, and trying to build the same environment for each OS.”
- “Can’t install the package that you used, can you help me out?”
- “I need more compute power. I could use AWS but it’ll take so long just to install all those packages and configure settings just like I have it on my machine.”
For the most part, these concerns are easily resolved by Docker. The exception at the moment of posting is GPU support for Docker images, which only run on Linux machines. Other than that, you’re golden.
Docker for Python and Jupyter Notebook
Take a look at this Dockerfile.
# reference: https://hub.docker.com/_/ubuntu/ FROM ubuntu:16.04 # Adds metadata to the image as a key value pair example LABEL version="1.0" LABEL maintainer="Bobby Lindsey <email@example.com>" # Set environment variables ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 # Create empty directory to attach volume RUN mkdir ~/GitProjects # Install Ubuntu packages RUN apt-get update && apt-get install -y \ wget \ bzip2 \ ca-certificates \ build-essential \ curl \ git-core \ htop \ pkg-config \ unzip \ unrar \ tree \ freetds-dev # Clean up RUN apt-get clean && rm -rf /var/lib/apt/lists/* # Install Jupyter config RUN mkdir ~/.ssh && touch ~/.ssh/known_hosts RUN ssh-keygen -F github.com || ssh-keyscan github.com >> ~/.ssh/known_hosts RUN git clone https://github.com/bobbywlindsey/dotfiles.git RUN mkdir ~/.jupyter RUN cp /dotfiles/jupyter_configs/jupyter_notebook_config.py ~/.jupyter/ RUN rm -rf /dotfiles # Install Anaconda RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda.sh RUN /bin/bash ~/anaconda.sh -b -p /opt/conda RUN rm ~/anaconda.sh # Set path to conda ENV PATH /opt/conda/bin:$PATH # Update Anaconda RUN conda update conda && conda update anaconda && conda update --all # Install Jupyter theme RUN pip install msgpack jupyterthemes RUN jt -t grade3 # Install other Python packages RUN conda install pymssql RUN pip install SQLAlchemy \ missingno \ json_tricks \ bcolz \ gensim \ elasticsearch \ psycopg2-binary # Configure access to Jupyter WORKDIR /root/GitProjects EXPOSE 8888 CMD jupyter lab --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token='data-science'
If you’ve ever installed packages in Ubuntu, this should look very familiar. In short, this Dockerfile is a script to automatically build and setup a light-weight version of Ubuntu with all the necessary Ubuntu packages and Python libraries needed to do [my] data science exploration with Jupyter Notebooks. The best part is that this will run the same way whether I’m on MacOS, Linux, or Windows - no need to code separate install scripts and third-party tools to have the same environment in each operating system.
To build a Docker image from this Dockerfile, all you need to do is execute
docker build -t bobbywlindsey/docker-data-science .
in the command line and Bob’s your uncle. To run the image, you have two options - you can either run the image interactively (which means you’ll see the output of your Jupyter Notebook server in real time) or in detached mode (where you can drop into the image’s terminal and play around).
To run the image interactively on Windows, execute
docker run -it -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science
docker run -it -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science
To run the image in detached mode for linux:
docker run -d --name data-science -v ~/GitProjects:/root/GitProjects --network=host -i bobbywlindsey/docker-data-science docker exec -it data-science bash
or for MacOS and Windows:
docker run -d --name data-science -v ~/GitProjects:/root/GitProjects -p 8888:8888 -i bobbywlindsey/docker-data-science docker exec -it data-science bash
Not too bad! I realize those run commands might be a bit much to type up, so there’s a couple options that I see. You can either alias those commands or you can use a docker-compose file instead.
Using Multiple Containers
version: '3' services: site: environment: - JEKYLL_ENV=docker image: bobbywlindsey/docker-jekyll volumes: - ~/Dropbox/me/career/website-and-blog/bobbywlindsey:/root/bobbywlindsey ports: - 4000:4000 - 35729:35729
With that file, your run command then becomes:
docker-compose run --service-ports site
But Docker Compose is much more capable than just using it as a substitute for aliasing your run commands. Your docker-compose file can configure multiple images and by using a single command, you create and start all your services at once. For example, let’s say you build one Docker image to preprocess your data, another to model the data, and another to deploy your model as an API. You can use docker-compose to manage each image’s configurations and run them with a single command.
Even though Docker might require a learning curve for some data scientists, I believe it’s well worth the effort and it doesn’t hurt to brush up those DevOps skills. Have you used Docker for your data science efforts?