DREAM infrastructure challenge

Arvados submissions for the GA4GH-DREAM Workflow Execution Challenge


md5sum challenge

hello world
GATK Haplotype Caller Project

This tutorial demonstrates how to run the GATK Haplotype Caller pipeline using GenomeAnalysisTK-3.2-2 from the Broad Institute. These pipelines currently support GATK version 3.2-2 (md5 3163cbeef8fd50d8cb85096758b801a3) (keep content hash 2e98fdc8e90f4c48a0714b711767c9ce+76). You must obtain your own GATK jar file in order to run this pipeline. You can obtain this software by going to the Broad Institute’s GATK licensing site. Further instructions on how to upload your file can be found on the Arvados documentation page or the tutorial below. If you run into any problems, please contact


This is a complete GATK workflow written in CWL-v1.0
Docker is used for the tools in the workflow.

BOSC 2016
FoG Boston 2016 Work Project
Old Ancestry Mapper Runs
Other scripts
bcbio test runs

bcbio CWL test runs:

test parent project
bcbio CWL
GATK bcbio style
Mason Lab - Methylkit

MethylKit is an R package for DNA methylation analysis from high-throughput bisulfite sequencing. It has many features, coverage/methylation statistics, differential methylation analysis, feature annotation, reading methylation calls.

Public Bioinformatics tools

Binaries of some Bioinformatics tools

lobSTR v.3 (Public)

lobSTR is a tool for profiling Short Tandem Repeats (STRs) from high throughput sequencing data.

UMC Public Pipeline (BOSC 2015)

A BWA-GATK Pipeline by the UMC Utrecht Community. Used in the Poster for Developing an Arvados BWA-GATK pipeline at BOSC 2015.

GATK2 Unified Genotyper (Public)

Run GATK2 on paired end reads and perform variant calls using Unified Genotyper. To run this pipeline, click on Run a pipeline and select “Demo GATK2 Pipeline”. Feel free to use "PGP HU34D5B9 “FASTQ” exome" as the input data set, which is 2 sets of paired end fastq files.

PGP hu826751

Complete Genomics whole genome sequencing raw data for Harvard Personal Genome Project participant hu826751 (2014-10-17).

Outputs of PATHOMAP_P00553.vcf

Mason Lab – Pathomap Output data for PATHOMAP_P00553.vcf

Docker Images

Mason Lab – Pathomap Docker Images

Output Demo Data

Mason Lab – Pathomap Output Data

GATK3 Haplotype Caller (Public)

Run GATK3 Best Practices pipeline on paired end reads and perform variant calls using both Haplotype Caller and Unified Genotyper. To run this pipeline, click on Run a pipeline and select “Demo GATK3 Haplotype Caller Pipeline” from the GATK3 Haplotype Caller Project. Feel free to use "PGP HU34D5B9 “FASTQ” exome" as the input data set, which is 2 sets of paired end fastq files.

Bcbio-nextgen (Public)

The bcbio-nextgen project was created by Brad Chapman from the Harvard School of Public Health.

Mason Lab - Pathomap / Ancestry Mapper (Public)

Part of the Pathomap Project developed by the Mason Lab at Weill Cornell Medical College.

Public Datasets / Collections
Platypus (Public)

Input fastq files and call variants using Platypus!
In order to run this pipeline, you can create a free account on the Curoverse home page. Then follow the instructions in the tutorial to use the test data or input your own data!

Sample Public Pipelines

A list of all Public Pipelines currently runs on Arvados.
Want your pipeline here? Email for help to get started!

RNA-seq/Tuxedo (Public)

A RNA-Seq pipeline consisting of Bowtie 2, Tophat 2, and Cufflinks.

Public GA4GH Collection


PCA of 174 whole genomes from the Personal Genome Project

Principal component analysis of 174 whole genome sequences (chromosomes 13 and 17) from the Personal Genome Project. From this project, you can explore the inputs (numpy files, path lengths, and human population data), the environment (docker image), the code (under pipeline templates), the tests ran (under pipelines), and their output. To rerun any analysis or alter inputs, sign in, create an account, and create a copy of this project.

Arvados Tutorial

Running a pipeline tutorial