Arvados Technology
Open source infrastructure for managing, processing, and
sharing genomic and other biomedical data.
Open source infrastructure for managing, processing, and
sharing genomic and other biomedical data.
Arvados is deployed with a multilayer integrated stack of technologies built with proven open source software. All the layers work together as a complete solution based on modern distributed computing patterns.
Arvados is designed to run on an elastic computing foundation, which can be provided by a cloud or off-the-shelf hardware running virtualization and related services.
The System Services layer provides core Arvados services: the Keep data manager and the Crunch job management.
All the services in the system are accessed through a RESTful API, and there are SDKs for Python, Perl, Ruby, Java, and Go.
At the Interface layer, Arvados provides a number of different ways for users and admins to access the capabilities of the system.
Security is woven throughout the system at every layer.
The Arvados data management system, Keep, is a content-addressable storage system. It can manage data on commodity drives or use a wide range of other underlying file systems including object/blob stores. Keep does for data what Git does for code.
Quickly put files into a data set of any size without moving or copying them using data set management tools instead of folders.
Ensure reliable and durable data retrieval with content addressing that automatically verifies a hash of every file.
Interact through the API or load datasets as network drives so you can interact with file collections using traditional file paths.
Eliminate duplicate data storage by automatically checking for duplication on write using content addresses.
Track the origin of datasets and how they are used across the system by recording each pipeline run as metadata.
Move computations to data and optimize disk access with a single reader and writer for each spindle.
Manage data across different tiers of storage from production to archive on-premise or in the cloud.
The Arvados job manager, Crunch, is a containerized workflow engine that provides a flexible way to define and run computational pipelines, which can be reliably reproduced. It takes advantage of Git, Docker, and other technologies to make life easier.
Define pipelines in an easy-to-use JSON document or script (soon Common Workflow Language).
Use Docker images to define run-time environments for individual jobs.
Let Crunch manage provisioning compute nodes, installing containers, and software.
Reliably reproduce every job and pipeline you run.
Automatically recover from disk and node failures.
Move computations to data and optimize disk access.
Easily move computations between Arvados instances.
Run jobs without assistance in cluster management.
Access job status reports during and after job execution.
Save time and money by skipping jobs that don’t need to be re-run.
Crunch can launch web applications or stand up databases as part of pipeline.
Easily scale jobs to run in parallel on multiple nodes.
Arvados is designed to provide a highly flexible environment for getting your work done, so it has a variety of different interfaces.
The entire system can be accessed through REST APIs from any programming language.
If you like working on the command line, that’s always an option.
SDKs for Python, Perl, Java, Go, and Ruby. (R coming soon.)
Workbench is a web application that makes it easy to use Arvados from your browser.
Arvados lets you organize work into projects to make it easier to keep track of the datasets and pipelines you’re using. Everything in the system can be easily tagged with metadata.
In a typical configuration, your Arvados cluster will have virtual machines set up for each user so they have their own environment to test and develop work.
Arvados is designed to empower people to securely collaborate, share data, and publish their work.
Add multiple users to a project so individuals can share and collaborate on the work.
If you want to share your work publicly or provide a URL for methods in a paper, you can publish public projects that anyone can see.
A single command reliably copies every aspect of a project from one Arvados cluster to another.
Currently, Arvados uses OAuth2 for authentication, but can be integrated with LDAP and Active Directory.
If you want to collaborate across clusters, you can move pipelines between environments, which enables secure data sharing without moving data around.
The data manager makes it possible to apply access control permissions at the dataset level, which is much more flexible than using traditional files and directory level permissions.