Our scientific world is expanding. With each passing day, new discoveries are made, old discoveries are verified, and more data are generated. Large data sets are now the norm for omics experiments such as genomics, transcriptomics, proteomics, and metabolomics. Modern techniques enable omics research to scan for the identities and quantities of hundreds or thousands of different genes, transcripts, proteins, and small molecules within a single experiment. This alone can generate data files that are so large that they are difficult to manipulate with traditional hardware and desktop computers. As described in a recent article, compared with the amounts of data generated from astronomy, YouTube, and Twitter, the demands within genomics alone for data acquisition, storage, distribution, and analysis are equivalent to or surpass the demands for these other ‘big data’ generators (1). Additionally, by considering the possibility that a significant fraction of the world’s human population will have their genomes sequenced in the future, the researchers estimate that between 100 million to 2 billion human genomes could be sequenced by 2025, representing up to 5 orders of magnitude growth in ten years and far exceeding the growth for the other big data domains. Add to that the complexity associated with handling replicates, comparing different conditions (e.g., time course studies or different disease states), and correlating and integrating across multiple omics disciplines (e.g., proteomics and transcriptomics), and the problem quickly escalates to intractable proportions.
Through broader adoption, the amount and complexity of omics data generated will only increase as we continue to push our technologies to smaller sample sizes and lower costs. Additionally, now more than ever we are a global scientific community. Scientists across the world are finding new partners for collaboration. While sharing a static picture of results can be quite easy, sharing actual data can be problematic. If the non-local scientist wants to probe the data interactively or integrate their own research findings, it requires access to the original data file or results – which can necessitate the installation of a duplicate computer environment within their own lab. Generating the data is no longer the pain point. Storing, analyzing, sharing, and integrating data are now the bottlenecks to progress in life science research.
Cloud-based computing offers a solution. Just recently within the field of genomics, researchers from Canada, Europe, and the US have called for major funding agencies to establish a cloud-based global data commons (2). As stated by Dr. Peter Campbell, Head of Cancer Genomics at the Wellcome Trust Sanger Institute. “We have now reached a stage where these data sets are too large to move around – cloud computing offers us the flexibility to hold the data in one virtual location and unleash the world’s researchers on it all together.” (3).
As defined by Google – Cloud Computing is “the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer.” (4). The benefits of cloud computing are enormous and include:
- Increased global access of data to collaborators
- Increased access to data while traveling
- Accelerated tool development for handling and processing the data (due to expanded access)
- More streamlined integration and aggregation of multiple data sources
- Increased scalability with virtually unlimited computational and storage power
- Reduced cost of maintaining local servers (no long-term investment in local computer infrastructure necessary)
- Secure environment (access is granted to authorized personnel only)
The last bullet has been somewhat contentious with many researchers feeling a general sense of unease around the security of cloud based solutions. However, contrary to what some may think, the cloud can actually be more secure than a local storage environment (5). Data are encrypted and private, back-ups are performed routinely, and user activity is tracked and monitored. Cloud businesses face tougher standards and have to build secure data centers that are independently audited and adhere to specific service organization controls. Hundreds to thousands of clients depend upon them daily and it is the very nature of their business to provide security as an inherent part of their product.
Examples of cloud-based applications include Dropbox, Google Docs, LinkedIn, and email on your mobile phone. Users can store, view, manage, and interact with documents and files as if they were native applications on their computer and the files were stored locally. Another example is The OneOmics™ Project.
The OneOmics Project is a partnership between SCIEX, Illumina, and life science researchers to enable the storage, processing and integration of multi-omics data sets in the cloud. OneOmics takes advantage of the BaseSpace® environment hosted by Illumina. BaseSpace is built using the cloud platform provider Amazon Web Services (AWS) which provides cloud-based services to a diverse set of clients such as Nasa, Pfizer, and Comcast (6). Within BaseSpace, next-generation proteomics (NGP) data produced from SWATH® acquisition-based experiments on a SCIEX TripleTOF® instrument can be stored and processed, and then integrated with next-generation sequencing (NGS) data from an Illumina instrument as the data is in the same environment.
Just like Google Play or the Apple App Store, users can browse and use a variety of applications on BaseSpace for data processing and analysis. Apps can be built by SCIEX, Illumina, or other commercial, academic, or open source providers. For example, the Protein Expression Workflow application developed by SCIEX enables data quality review using Analytics, while the Browser enables users to visualize results in biological context. The SWATHAtlas Ion Library Generator from ISB enables fast and simplified access to human, yeast, and Mycobacterium tuberculosis reference libraries for use in NGP SWATH experiments. The RNASeq Translator application from Yale University translates the output from an Illumina RNASeq experiment into a protein database to better enable the identification and quantitation of splice variants in future SWATH NGP projects. The iPathwayGuide from Advaita Bioinformatics enables biochemical pathway analysis, gene ontology analysis, miRNA prediction, and drug and disease analyses of SWATH NGP and Illumina RNASeq data, integrating proteomics and transcriptomics data sets. And coming in the future are applications for analyzing and integrating metabolomics and lipidomics data.
With cloud computing, the ability to store, aggregate, and combine data and then use the results to obtain deep biological understanding has become more accessible than ever. OneOmics cloud computing delivers those advantages to the life science researcher and provides a means to climb and conquer the mountain of data.
- Big Data: Astronomical or Genomical? ZD Stephens, SY Lee, F Faghri, RH Campbell, C Zhai, MJ Efron, R Iyer, MC Schatz, S Sinha, GE Robinson. (07 July 2015). PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
- Data Analysis: Create a Cloud Commons. LD Stein, BM Knoppers, P Campbell, G Getz, JO Korbel. (2015). Nature, Volume: 523, Pages: 149–151 Date published: (09 July 2015) DOI: doi:10.1038/523149a: http://www.nature.com/news/data-analysis-create-a-cloud-commons-1.17916
- Researchers Call for Support for Data in the Cloud to Facilitate Genomics Research. Eurekalert (09 July 2015): http://www.eurekalert.org/pub_releases/2015-07/oifc-rcf070915.php
- The Great IT Myth: Is Cloud Really Less Secure Than On-Premise? Ben Rossi, Information Age (09 March 2015: http://www.information-age.com/technology/security/123459135/great-it-myth-cloud-really-less-secure-premise#sthash.HYJIPe7v.dpuf
- Amazon Web Services, Customer Stories: https://aws.amazon.com/solutions/case-studies/all/