Docker for reproducible pipeline

来自美国能源部的工程师介绍了什么是docker,以及如何用docker构建一个可以简单复制的应用程序组建集合。有图有真相,还有完整代码,是学习docker入门练手的好参考。
展开查看详情

1.Daniel Udwary Tony Wildish NERSC Data Science Engagement Group Docker for reproducible pipelines

2.This tutorial What is Docker ? Why use it? Docker concepts, components How to build a simple docker container Running docker containers How to get data into/out of a docker container Shifter – docker on Cori, Edison, and (eventually) Genepool How to find containers built by other people Dockerizing a pipeline Best practices - 2 - https:// tinyurl.com / jgicontain

3.What is Docker ? Docker is a ‘container technology’ Linux-specific can’t run Mac OSX, Windows in docker containers But can run docker containers on Mac OSX & Windows Shrink-wrap your software, run it on any Linux platform Not a virtual machine Similar to virtual machines, but more lightweight Smaller, faster to start, easier to maintain and manage Lighter on system resources => vastly more scalable VM-thinking will lead to poor results, avoid it! - 3 - https:// tinyurl.com / jgicontain

4.Why use it? Portability: No need to rebuild your application for a new platform! Build a container once, run it anywhere Cori/Edison/ Genepool / … AWS/GCP/ … S table s/w versions across all platforms, no runtime glitches Think of it as ‘modules-to-go’ Instead of ‘module load PQR’ you ‘ docker pull PQR’ No waiting for modules to be built/deployed for you! Reproducibility: Because your s/w is stable, your pipeline is reproducible Run the exact same binaries again 10 years from now   - 4 - https:// tinyurl.com / jgicontain

5.What can you do with it? Computational workloads Use applications without having to install them Run your applications anywhere; c louds , NERSC, other HPC centres Reproducible pipelines – today’s focus Services Web portals/gateways (R/Shiny, Apache, Jupyter …) Persistent workflow manager interfaces (Fireworks …) Continuous build systems ( Gitlab …) F or prototyping or for production running (databases etc) A ll those things you run in the background on the login nodes today! NERSC internal cloud project, ‘SPIN’ Host docker -based services for any/all NERSC users Now available for beta-testing, looking for interested users Please contact us, tell us about your use-case - 5 - https:// tinyurl.com / jgicontain

6.Docker components The ‘ docker ’ command-line tool A bit of a kitchen-sink, your one-stop shop for everything docker The docker -daemon Works behind the scenes to carry out actions Manages container images, processes Builds containers when requested Runs as root, not a user-space daemon Docker.com All things docker : installation, documentation, tutorials Dockerhub.com Repository of docker containers. Many other repositories exist - 6 - https:// tinyurl.com / jgicontain

7.Docker concepts Image A shrink-wrapped chunk of s/w + its execution environment Image tags Identify different versions of an image A namespace for separating your images from other peoples Image registry A place for sharing images with a wider community Dockerhub.com , plus some domain-specific registries Container A process instantiated from an image Dockerfile A recipe for building an image: download, compile, configure … Can share either the Dockerfile , or the image, or both - 7 - https:// tinyurl.com / jgicontain

8.Docker images: layers & caching Images use the ‘overlay filesystem ’ concept Image is built by adding layers to a base Each command in the Dockerfile adds a new layer Each layer is cached independently Layers can be shared between multiple images Change in one layer invalidates all following layers F orces rebuild (similar to ‘make’ dependencies …) Performance considerations Too many layers can impede performance Too few can cause excessive rebuilding Building production-quality images takes care, practice - 8 - https:// tinyurl.com / jgicontain

9.Building a container: the Dockerfile A recipe for building a container Start with a base image, add software layer by layer Choosing the base image has a big effect on how large your container will be: go small (‘alpine’ or ‘ busybox ’)! A dd metadata describing the container A lways a good idea Set the command to run when starting the container, map network ports, set environment variables Not strictly needed for batch applications, useful for services (web apps, databases …) - 9 - https:// tinyurl.com / jgicontain

10.- 10 - FROM debian:jessie # LABEL lets you specify metadata, visible with docker inspect LABEL Maintainer="Tony Wildish, wildish@lbl.gov " Version=1.0 # I can set environment variables ENV PATH / usr /local/ sbin :/ usr /local/bin:/ usr / sbin :/ usr /bin:/ sbin :/bin # Commands to prepare the container ENV DEBIAN_FRONTEND= noninteractive RUN apt-get update -y RUN apt-get install --assume-yes apt- utils RUN apt-get install -y python RUN apt-get install -y python-pip RUN apt-get clean all RUN pip install bottle # Add local files ADD hello.py / tmp / # open a port EXPOSE 5000 # specify the default command to run CMD ["python", "/tmp/ hello.py“]

11.- 11 - FROM debian:jessie # LABEL lets you specify metadata, visible with docker inspect LABEL Maintainer="Tony Wildish, wildish@lbl.gov " Version=1.0 # I can set environment variables ENV PATH / usr /local/ sbin :/ usr /local/bin:/ usr / sbin :/ usr /bin:/ sbin :/bin # Commands to prepare the container ENV DEBIAN_FRONTEND= noninteractive RUN apt-get update - y RUN apt-get upgrade -y RUN apt-get install --assume-yes apt- utils RUN apt-get install -y python RUN apt-get install -y python-pip RUN apt-get clean all RUN pip install bottle # Add local files ADD hello.py / tmp / # open a port EXPOSE 5000 # specify the default command to run CMD ["python", "/tmp/ hello.py“] Name+version Contact info Heavy lifting, install base tools before our code

12.- 12 - FROM debian:jessie # LABEL lets you specify metadata, visible with docker inspect LABEL Maintainer="Tony Wildish, wildish@lbl.gov " Version=1.0 # I can set environment variables ENV PATH / usr /local/ sbin :/ usr /local/bin:/ usr / sbin :/ usr /bin:/ sbin :/bin # Commands to prepare the container ENV DEBIAN_FRONTEND= noninteractive RUN apt-get update - y RUN apt-get upgrade -y RUN apt-get install --assume-yes apt- utils RUN apt-get install -y python RUN apt-get install -y python-pip RUN apt-get clean all RUN pip install bottle # Add local files ADD hello.py / tmp / # open a port EXPOSE 5000 # specify the default command to run CMD ["python", "/tmp/ hello.py“] Name+version Contact info Heavy lifting, install base tools before our code ‘heavy’ base image: 123 MB Lots of RUN commands means lots of layers, not ideal for the cache Blind update – to what??? Container != VM Final image size: 360 MB

13.- 13 - FROM alpine:3.5 LABEL Maintainer="Tony Wildish, wildish@lbl.gov " Version=1.0 ENV PATH / usr /local/ sbin :/ usr /local/bin:/ usr / sbin :/ usr /bin:/ sbin :/bin RUN apk add --no-cache --update-cache --update python && \ apk add --no-cache --update py2-pip && \ pip install bottle ADD hello.py / tmp / EXPOSE 5000 CMD ["python", "/tmp/hello.py"] Final image size: 53.2MB Base image only 5 MB Command chaining with &&, reduces #layers Install only what we want https:// tinyurl.com / jgicontain

14.Building containers Build your container with ‘ docker build’ docker build -t user/ package:version --f Dockerfile $ dir Tag (-t) not obligatory, but very good idea Build ‘context’ Everything in $ dir is sent to the build as the ‘context’ Use ‘ . dockerignore ’ file to exclude files/directories Can greatly speed build times Upload your container to Dockerhub docker push user/ package:version - 14 - https:// tinyurl.com / jgicontain

15.Running containers Run a container with a default command docker run - i -t ubuntu G ives you a shell prompt, ‘exit’ or CTRL-D to quit - i -t -> use for interactive containers Run a container, specify the command explicitly docker run alpine:3.5 /bin/ ls –l Set an environment variable docker run -e PATH=/bin:/ usr /bin alpine:3.5 ls Open a second terminal in a running container docker run - i -t --name blah ubuntu docker exec - i -t blah /bin/bash Especially useful for debugging services … - 15 - https:// tinyurl.com / jgicontain

16.Getting data into/out of containers Map external directories into a container docker run --volume /external/path:/ internal:path E.g : list your current directory, the docker way! docker run --volume ` pwd `:/ mnt alpine:3.5 /bin/ ls -l / mnt Can map multiple volumes Don’t nest them! - 16 - https:// tinyurl.com / jgicontain

17.Docker at NERSC: Shifter Shifter: docker functionality with extra protection/limitations specific to NERSC Essentially drop-in replacement for docker - 17 - https:// tinyurl.com / jgicontain > ssh denovo.nersc.gov > CHOS=sles12 chos /bin/bash -l > srun – pty /bin/bash $ shifterimg pull alpine: 3.5 $ shifter --image=alpine:3.5 --volume=/global/ projectb :/ mnt /bin/ sh / chos /global $ ls / mnt README.txt group.quota reports scratch shared test fileset.quota iotest sandbox scripts statistics user.quota

18.Finding pre-built containers Q: What’s the best way to build a container? A: Don’t! Find one that’s been built already! Q: How do you know which one to pick? A: trial and error  Look for official builds, #stars. depends on the details of how the container was built - 18 - https:// tinyurl.com / jgicontain > docker search spades NAME DESCRIPTION STARS OFFICIAL AUTOMATED nucleotides / spades 3 [OK] achubaty / r-spades-devel Provides a testing environment for buildin ... 0 [OK] biodckrdev /spades Tools (written in C using htslib ) for mani ... 0 [ OK] ycogne / spades spades tools 0 [ OK] bioboxes / spades St . Petersburg genome assembler 0 [OK] unlhcc / spades 0 [...]

19.Finding pre-built containers Alternative sources Dockerhub.com Same as ‘ docker search’, but can get more information about the build, instructions for use etc Google: “ dockerfile NCBI blast” Ask the authors of your favorite package if they have a container already But check it before using, they may not be experts! docker images | grep <image> # check size docker history --no - trunc <image> # see how it was built Check their github repository! - 19 - https:// tinyurl.com / jgicontain

20.Dockerizing a pipeline: What goes into a container, what doesn’t? No hard and fast rules, here are some guidelines Include … Anything ‘compiled’, i.e. anything with system dependencies Anything that needs ‘installing’ to run Anything that has portability issues i.e. if you can’t install it on another machine without effort, put it in a container Exclude … S imple bash/Perl/Python scripts => install from git etc Need P ython/Perl modules? Include them in the container A nything static: big reference DBs etc A nything you could install by just copying to the filesystem - 20 - https:// tinyurl.com / jgicontain

21.Dockerizing a pipeline: Q: how many containers for a pipeline with 25 steps ? A: That depends on what your pipeline does  Not good: putting the whole pipeline in a single container M aintenance overhead, can’t optimize workflow R emember, a container is not a VM!  B etter: one container per (related set of) executable(s) Y our pipeline then invokes one container after another Y ou can re-use containers built by other people E .g. O ne container for blast, including blastp, blastn, blastx, tblastn, tblastx is reasonable B ut do you use all of them? Or only one? Pick what you need! D o you need the other binaries or files that come with it? B last+2.6.0: 26 MB/binary for those, but 980 MB total installation. - 21 - https:// tinyurl.com / jgicontain

22.Best practices Document your containers Use LABEL to add metadata Tag your images, don’t use ‘latest’ by default Keep your containers small Start from small image (alpine!) Add only what you need Use one container for one function/functionality Avoid VM-think! Optimize your builds Put stable build-commands at the top of your Dockerfile Combine layers where possible (‘&&’ chaining) Check for bloat: (size of your code)/(size of image) Share your containers Put image in dockerhub , Dockerfiles in git , tell us, tell your colleagues... (w e’re looking into a JGI repository in dockerhub) - 22 - https:// tinyurl.com / jgicontain

23.Summary Docker containers allow great portability Because there’s nothing to port anymore! Building good docker containers requires care Not difficult, well worth taking the effort We hope to have some tools to help with this soon We can help! File tickets, come to office hours, send email … T ell us what you want to achieve - 23 - https:// tinyurl.com / jgicontain