Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.


As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Labels: , , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Monday, February 23, 2009

Three Themes from AGBT and ABRF Part III: The IT Problem

The power of Next Generation DNA Sequencing (NGS) technology come from the fact that a massive amount of data, sampling millions of individual molecules, is collected in a massively parallel format. This power also limits the potential wide-spread adoption of the technology because of the IT (Information Technology) challenges that result from the massive amount of data created with each sequencer run.

IT challenges form the third technical theme from the AGBT and ABRF conferences. The previous two posts underscored the need for good laboratory practices and rich bioinformatics support to make NGS experiments successful. This post discusses the experiences communicated by the early adopters of NGS technology with respect to the computing infrastructure.

Surprises

Throughout the literature and NGS presentations, the data management issues created by NGS play a central role. Recent editorials in Nature Methods [1] and Nature Biotechnology [2] speak to the problem and express researchers' frustrations in dealing with the lack of IT infrastructures. At the ABRF workshop, we had two presentations specifically focused on the IT challenges, describing two different experiences.

In the first case, the group implementing NGS had a number of surprises after the NGS system was installed and running. They learned that these systems not only require a lot of storage and computing support, they also use up a lot of bandwidth when data are transferred. The bandwidth problem led to the need for a revised network architecture to isolate the NGS data flow from other network activity.

This talk brought similar surprises to mind. In other labs, NGS “surprises” have led to groups needing to upgrade server rooms by installing backup power, air conditioning, and other equipment. Of course these surprises are manageable if you have an IT group and a server room in the first place. In some cases, groups start with even less and find that the IT costs makes the NGS endeavor very expensive. Even with support and space the IT costs for bringing in NGS can quickly grow into six figures (above $100,000) for infrastructure alone.

The second presentation was given by a group who was well prepared for NGS. Their university had made a previous commitment to building an IT infrastructure to support data intensive genomics research, so adding NGS was a step up in their view. Their experience allowed them to develop a strong implementation plan that called for a number of systems upgrades that included upgrading network hardware. While total costs were less than the six figure surprises others experienced, they did spend many tens of thousands of dollars on new file servers, CPUs, network switches, and server room upgrades.

The conclusion from both of the presentations was that if you are going to set up an NGS infrastructure three things are important: planning, planning, planning. Also, institutional support is critically important since renovations and new building may need to ramp up too. Personnel with network, systems administration, and unix experience are also essential. Finally, as the second speaker put it, you need to encourage researchers to invest in the infrastructure. If they are not involved in the process and contributing time and money, the endeavor can quickly fail.

These talks bring me to my favorite marketing slogan where one of Illumina’s customers put an NGS instrument in their mail room. Whenever I hear that, or see the ad, it makes me think, “yes, you can turn a mail room into a genome center, but where will you put the data center?

There is a solution


For those thinking about NGS technology, or running an NGS experiment where the samples are submitted to a lab, and the data returned, even contemplating the IT requirements can be discouraging. But, it does not have to be this way. Over the past ten years, an immense infrastructure of data centers has emerged . Today, there are many options and price points available for storage, computing, and backup systems. Groups can save significant time and money using on-line services because costs scale with need. Moreover, on-line services eliminate the need for dedicated systems and data administrators putting more money in the budget for experiments. You have a choice. Jump in and do some interesting science or work hard to have your campus facilities remodeled.

Geospiza is taking advantage of the Internet’s infrastructure to offer our clients cost effective ways to get NGS running in their lab. GeneSifter Laboratory Edition can be delivered through a SaaS (Software as a Service) model to get labs up and running quickly. Just sign up, get access, and you are ready to go. GeneSifter Analysis Edition solves the IT problem for research groups who get their sequencing done through core labs or other service providers. In these cases, you upload you data and with a few clicks, process your data and analyze the results. Because the infrastructure is built, overall costs for IT and bioinformatics are much lower, and you do not have to experience a remodeling project.

References
1. 2008. Byte-ing off more than you can chew. Nat Methods 5, 577.
2. 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.

Labels: , , , ,

Thursday, February 19, 2009

Three Themes from AGBT and ABRF Part II: The Bioinformatics Bottleneck

In my last post, I summarized the presentations and conversations regarding Next Gen Sequencing (NGS) challenges in terms of three themes: You have to pay attention to details in the laboratory, bioinformatics is a bottleneck, and the IT burden is significant. In that post, I discussed the issues related to the laboratory and how GeneSifter Lab Edition overcomes those challenges.

This post tackles the second theme: the bioinformatics bottleneck.

In the Sanger days, bioinformatics was really a challenge for only the highest throughput facilities like genome centers. In these labs, streamlined workflows (pipelines) were developed for the different kinds of sequencing (genomes, ESTs[expressed sequence tags, 1], SAGE [serial analysis of gene expression, 2]). Because, Sanger sequencing was high cost and low throughput, compared to NGS, the cost of developing the bioinformatics pipelines was low relative to the cost of collecting the data. Thus, large-scale projects such as whole genome shotgun sequencing, ESTs, or resequencing studies, could be supported by a handful of pipelines that changed infrequently. In addition, small-scale Sanger projects could be handled well by desktop software and services like NCBI BLAST.

NGS breaks the Sanger paradigm. A single NGS instrument has the same throughput as an entire warehouse of Sanger instruments. To illustrate this point in a striking way, we can look at dbEST - NCBI’s database of ESTs. Today, there are approximately 59 million reads in this database, representing the total accumulation of sequencing projects over a 10 year period. Considering that one run of an Illumina GA or SOLiD can produce between 80 and 180 million reads in week or two we can now, in a single week, produce up to three times more ESTs than we have seen deposited over the past 10 years. These numbers also dwarf the amount of data collected from other gene expression analysis systems like microarrays and sequencing techniques like SAGE.


The emergence of the bioinformatics bottleneck

The bioinformatics bottleneck is related to the fact that NGS platforms are general purpose; they collect sequence data. That’s it. Because they collect a lot of data very quickly, we can use sequences as data points for many different kinds of measurements. When we think this way, an extremely wide range of experiments can be conceived.

From sequencing complete genomes, to sampling genes in the environment, to measuring mutations in cancer, to understanding epigenomics, to measuring gene expression and the transcriptome, NGS applications are proliferating at a rapid pace. However, each experiment requires a specialized bioinformatics pipeline and the algorithms used within a bioinformatics pipelines must be tuned for the data produced from the different sequencing platforms and questions being asked. When these considerations are combined with other issues like what reference data to use for sequence comparisons the number of bioinformatics pipelines can grow in a combinatorial fashion.

The early recommendation is that each lab wanting to do NGS work needs to have a dedicated bioinformatics professional. In more than one talk, presenters even quantified bioinformatics support in terms of FTEs (full time equivalents) per instrument. Bioinformatics is needed in both the sequencing laboratory, to develop and maintain quality control pipelines, and in the research environment, to process (align) the data, mine the output for interesting features, and perform comparative analyses between datasets.

But this won’t work

It is clear that bioinformatics is critical to understanding the data being produced. However, the current recommendation that any group planning NGS experiments should also have a dedicated bioinformatician is impractical for several reasons.

First, the model of a bioinformatician for every lab is simply not scalable. Fundamentally, there are not enough people that understand the science, programming, statistics, and other resources such as different forms of reference data, algorithms, and data types needed to make sense of NGS data. We see plenty of evidence, in the literature and presentations, that there are many outstanding people doing this work and contributing to the community, the problem is that they already have jobs!

Even if we consider that the above model is workable, hiring people takes significant time, is expensive, and ongoing costs are going to be high. These time and cost investments only become reasonable when a significant number of experiments are planned. One or two instruments will produce between 25 and 50 runs worth of data per year. If you calculate instrument costs, reagents, salary, and overhead costs, you are quickly into many thousands of dollars per sample. Indeed, a theme expressed in the bioinformatics bottleneck is that bioinformatics is becoming the single largest ongoing cost of NGS. Add in the IT computer support (next post) and you better have a plan for running a lot more than 50 runs per year. Remember the first issue - good bioinformaticians with NGS analysis experience have jobs.

If you have access to bioinformatics support, or can hire an individual, that person will quickly become overwhelmed with work. The biggest reason is that the software infrastructures needed to quickly develop new pipelines, automate them, and deliver data in ways that can be consumed by non-programming scientists are typically lacking. The result is that scientific programming efforts generally turn into lengthy software development projects because without an infrastructure, the numbers and kinds of experiments quickly grow past beyond the capacity of a single individual.

So, What can be done?

Geospiza solves the bioinformatics challenge in multiple ways. GeneSifter Lab and Analysis editions provide a platform that delivers the complete infrastructures needed to deploy NGS data processing pipelines and deliver results through web-based interfaces. These systems include pipelines for many of the common NGS applications such as transcription analysis, small RNA detection, ChIP-Seq and other assays. The system architecture and accompanying API creates a framework to quickly add new pipelines and make the results available to biologists running the experiments.

For those with access to bioinformatics help, GeneSifter will make your team more productive because developers will be freed of the burden of having to create the automation and delivery infrastructure, enabling them to focus on new scientific programming problems. For those without access to such resources, we have many pipelines ready to go. Moreover, because we have a platform and the infrastructure already built, as well as deep bioinformatics experience, we can create and deliver new analysis pipelines quickly. Finally, our product development roadmap is well-aligned with the most common NGS assays which means we you can probably do your bioinformatics analysis today!

References: 

1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., 1993. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4, 373-380.

2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W., 1995. Serial analysis of gene expression. Science 270, 484-487.

Labels: , , , ,

Monday, February 2, 2009

Next Gen Laboratory Software Systems for Core Facilities

Do you have a core lab? Considering adding Next Generation DNA sequencing capacity to your lab? Then you will be interested in visiting our both and checking out our poster at the annual Association for Biomolecular Research Facilities (ABRF) meeting next week in Memphis TN. We'll be at booth 408, and presenting poster number V27-S1.

Poster Abstract

Throughout the past year, as next generation sequencing (NGS) technologies have emerged in the marketplace, their promise of what can be done with massive amounts of sequence data has been tempered with the reality that performing experiments and working with the data is extremely challenging. As core labs contemplate acquiring NGS technologies, they must consider how the new technologies will affect their current and future operations. The old model of collecting and delivering data is likely to change to one where the core lab becomes an active participant in advising and helping clients set up experiments and analyze the data. However, while many labs want to utilize NGS, few have the Information Technology (IT) infrastructures and procedures in place to successfully make use of these systems.

In the case of gene expression, NGS technologies are being evaluated as complementary or replacement technologies for microarrays. Assays like RNA-Seq and tag profiling, that focus on measuring relative gene expression, require that researchers and core labs must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with many steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present solutions to these challenges by showing results from a complete workflow system that includes data collection, processing, and analysis for RNA-seq suited for the core laboratory.

In the poster we'll walk through the laboratory and data analysis issues one needs to think about to perform a two cell expression comparison with RNA-Seq. Below is a snippet from the poster. I'll post the full presentation when I return.

Labels: , , , ,

Wednesday, January 21, 2009

The Experts Agree

It depends what you are trying to do. That is the take home message in Genome Technology’s (GT) trouble-shooting guide on picking assembly and alignment algorithms for Next-Gen sequence data.

In the guide, the GT team asked nine Next-Gen sequencing and bioinformatics experts to answer six questions:
  1. How do you choose which alignment algorithm to use?
  2. How do you optimize your alignment algorithm for both high speed and low error rate?
  3. What approach do you use to handle mismatches or alignment gaps?
  4. How do you choose which assembly algorithm to use?
  5. Do you use mate-paired reads for de novo assembly? how?
  6. What impact does the quality of raw read data have on alignment or assembly? how do your algorithms enhance this?
Even a quick look at the questions shows us that many factors need to be considered in setting up a Next-Gen sequencing lab. Questions 1 and 4 point out that aligning sequences is different from assembling them. Other questions address issues related to the size of the data sets being compared, the quality of the data being analyzed, the kinds of information that can be obtained, and the computational approaches being used for different problems.

What the experts said

First, they all agree that different problems require different approaches and have different requirements. In the first question about which aligner to use, the most common response was “for what application and which instrument?” Fundamentally, SOLiD data are different from Illumina GA data which are different from 454 data. While the end results may all be sequences of A's, G's, C's, and T's; the data are derived in different ways because of the platform-specific twists in collecting the data (recall “Color Space, Flow Space, Sequence Space, or Outer Space). Not only are there platform-specific methods for interpreting raw data, multiple programs have been developed for each instrument with their own strengths and weaknesses in terms of speed, sensitivity, the kinds of data they use (color, base, or flow spaces, quality values, and paired end data), and the information that is finally produced. Hence, in addition to choosing a sequencing platform you also have to think about the sequencing application, or the kind of experiment, that will be performed. In gene expression studies, for example, an RNA-Seq experiment has different requirements in terms of aligning the data and interpreting the output than an experiment with Tag Profiling.

Overall the trouble-shooting guide discussed 17 total algorithms, eight for alignment, and nine for assembly (two of which were for Sanger methods). Even this selection wasn't a comprehensive list. When other sites [1, 2] and articles [3] are included and proprietary methods are factored in, over 20 algorithms are available. So what to do? Which is best?

That depends

Yes, the choice of algorithm ultimately depends on what you are trying to do. While we can agree that there is no best solution, we also know that is not a helpful response. What is needed is a way to test the suitability of different algorithms for different kinds of experiments and to represent data in standard ways so that the features of specific algorithms can be evaluated. Also, as this is a new field, standard requirements for how data should be aligned, defining a correct alignment, and what kinds of information are the most informative in describing alignments are still emerging. Some of the early programs are helping to define these requirements.

One program we've used, at Geopsiza, for identifying requirements is MAQ, a program for sequence alignment. As noted in previous blogs [MAQ attack], MAQ is a great general purpose tool. It provides comprehensive information about the data being aligned and details about alignments. MAQ works well for many applications including RNA-Seq, Tag Profiling, ChIP-Seq, and resequencing assays focused on SNP discovery. In performance tests, MAQ is slower than some of the newer programs, one of which is being developed by MAQ’s author, but MAQ is a good model for getting the right kinds of information, formatted in a decent way. Indeed MAQ was the most cited program in the GT guide.

Let’s return to the bigger issue. That is, how can we easily compare between algorithms? For that we need a system where one can easily define a standardized dataset and reference sequence, and have a platform where a new algorithm can be added and run from a common interface. Standard reports that present features of the alignments could then be used to compare programs and parameters.

The laboratory edition of GeneSifter supports these kinds of comparisons. The distributed system architecture allows one to quickly develop control scripts to run programs and format their output in figures and tables that make comparisons possible. With this kind of system in place, the challenges move from which program to run and how to run it, to how to get the right kinds of information and best display the data. To address these issues, Geospiza’s research and development team is working on projects focused on using technologies like HDF5 to create scalable standardized data models for storing information from alignment and assembly programs. Ultimately this work will make it easy to optimize Next-Gen sequencing applications and assays and compare between assorted programs.

References
1. http://en.wikipedia.org/wiki/Sequence_alignment_software,
2. http://www.massgenomics.org/2009/01/short-read-aligners-update-at-agbt.html
3. Shendure J., Ji H., 2008. Next-generation DNA sequencing. Nat Biotechnol 26, 1135-1145.

Labels: , , , ,

Friday, December 12, 2008

Papers, Papers, and more Papers

Next Gen Sequencing is hot, hot, hot! You can tell by the numbers and frequency in which papers are being published.

A few posts ago, I wrote about a couple of grant proposals that we were preparing on methods to detect rare variants in cancer and improve the tools and methods to validate datasets from quantitative assays that utilize Next Gen data, like RNA-Seq, ChIP-Seq, or Other-Seq experiments. Besides the normal challenges of getting two proposals written and uploaded to the NIH, there was an additional challenge. Nearly everyday, we opened the tables-of-contents in our e-mail and found a new papers highlighting Next Gen Sequencing techniques, applications, or biological discoveries made through Next Gen techniques. To date, over 200 Next Gen publications have been produced. During the last two months alone more than 30 papers have been published. Some of these (listed in the figure below) were relevant to the proposals we were drafting.

The papers highlighted many of the themes we've touched on here, including the advantages of Next Gen sequencing and challenges with dealing with the data. As we are learning, these technologies allow us to explore the genome and genomics of systems biology at significantly higher resolutions than previously imagined. In one of the higher profile efforts, teams at the Washington University School of Medical and Genome Center compared a leukemia genome to a normal genome using cells from the same patient. This first intra-person whole genome analysis identified acquired mutations in ten genes, eight of which were new. Interestingly, the eight genes have unknown functions and might be important some day for new therapies.

Next Gen technologies are also confirming that molecular biology is more complicated than we thought. For example, the four most recent papers in Science show us that not only is 90% of the genome actively transcribed, but many genes have both sense and anti-sense RNA expressed. It is speculated that the anti-sense transcripts have a role in regulating gene expression. Also, we are seeing that nearly every gene produces alternatively spiced transcripts. The most recent papers indicate that between 92% and 97% of transcripts are alternatively spliced. My guess is that the only genes, not alternatively spliced are those lacking introns, like olfactory receptors. Although, when alternative transcription starts and alternative polyadenylation sites are considered, we may see that all genes are processed in multiple ways. It will be interesting to see how the products of alternative splicing and anti-sense transcription might interact.

This work has a number of take home messages.
  1. Like astronomy, when we can see deeper we see more. Next Gen technologies are giving us the means to interrogate large collections of individual RNA or DNA molecules and speculate more on functional consequences.
  2. Our limits are our imaginations. The reported experiments have used a variety of creative approaches to study genomic variation, sample expressed molecules from different strands of DNA, and measure protein DNA/RNA interaction.
  3. Good hands do good science. As pointed out in the paper from the Sanger Center on their implementation of Next Gen sequencing, the processes are complex and technically demanding. You need to have good laboratory practices with strong informatics support for all phases (laboratory, data management, and data analysis) of the Next Gen sequencing processes.
The final point is very important and Geospiza’s lab management and data analysis products will simplify your efforts in getting Next Gen systems running to make your major investment pay off and quickly publish results.

To see how, join us for a webinar next Wednesday, Dec. 17 at 10 am PDT, for RNA Expression Analysis with Geospiza.


Click on the figure to enlarge the text.

Labels: , , , , , ,

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Labels: , , , , , ,

Wednesday, October 22, 2008

Journal Club: Focus on Next Gen Sequencing

Yesterday I received my issue of Nature Biotechnology. This month it features Next-Generation (Next Gen) Sequencing. One editorial, one profile, three news features, a commentary, two perspectives, and two reviews discuss the origins, trials, tribulations and what’s coming next in Next Gen. For now, I'll focus on the editorial.

Bioinformatics is a big big issue

“If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow” was the closing statement in the lead editorial “Prepare for the deluge.

Reminds me of something I said a few months back.

In the editorial, Nature Biotechnology (NBT) makes a number of important points starting with how the launch of the Roche/454 pyrosequencer in 2005 could generate as much data as more than 50 ABI capillary sequencers. Since that launch, we have seen new instruments emerge that are producing ever increasing amounts of data by orders of magnitude. Or as NBT put it “The overwhelming amounts of data being produced are the equivalent of taking a drink from a fire hose.”

It's like they read our web site (we ran the image below at the beginning of the year).


The volumes of data and new ways in which it must be worked with are creating many challenges. To begin, there is the conundrum of what to keep; do you keep raw images and processed reads? Or do you just keep the reads? If you keep raw images, the costs are significant. The cost of storing all that information must be considered in the context of the likelihood of whether you will ever need to go back to these data. We call this the data life cycle.

From raw images, the next challenge is the computational infrastructure needed to process reads and obtain meaningful information. This is a complex process that involves many steps and high performance computers. NBT made the accurate and important point that the instrument manufacturer only provide the software to analyze what comes off of the machine for common applications. A great deal of bioinformatics support is needed for downstream analysis once the initial data alignments or assemblies are completed. Also, standards for comparing data between instrument platforms are lacking. This makes it difficult to compare results from different instruments.

While more is needed in terms of bioinformatics support, being able to get tools for alignment and assembly is a good starting point and NBT lauded ABI’s SOLiD community program as a step in the right direction. This kind of approach is also needed by the other instrument vendors. Presently Illumina and Roche include their tools with an instrument purchase. This is fine for the laboratory, but it makes a hard problem harder for any researchers who might be getting data sets from different labs. This could lead to threads of frustration.

As the article continued, the "overwhelmed" scale increased to dire.

NBT stated:
“What all of this means is that for the foreseeable future, next-generation sequencing platforms may remain out of the hands of labs lacking the deep pockets needed for bioinformatics support.”
They also added,
“Thus, if the next-generation platforms are to truly democratize sequencing—bringing genomics out of the main sequencing centers and into the laboratories of single investigators or small academic consortia—much more effort needs to be expended in developing cost-effective software and data management solutions.

NBT offered some solutions, including getting the instrument vendors to develop community based solutions, and encouraging the grant funding organizations to fund bioinformatics as much as they fund sequencing.

Is Next Gen for everyone?


The NBT editors made a lot of great points, but we do not see the world in as dire terms as they do. Yes, a great challenge to Next Gen and getting up and running with this equipment includes preparing for the informatics challenges that await. Next Gen is not Sanger. You cannot look at every read to figure out what your data mean and you will need a serious computational infrastructure to store, organize and work with the data. Also, not mentioned in the article, but incredibly important, you will need a laboratory information management system to organize your experimental information and track the many steps needed to prepare good DNA libraries for sequencing.

And, there are solutions.

Geospiza’s FinchLab combined with our Software as a Service (SaaS) delivery, provides immediate access to the necessary software and hardware infrastructure to run these new instruments.

FinchLab delivers the software infrastructure to support laboratory workflows for all the platforms, links the resulting data to samples, and - through a growing list of data analysis pipelines and visualization interfaces - provides the necessary bioinformatics for a wide range of sequencing applications. Further, our bioinformatics approach is community-based. We are working with the best tools as they emerge and are collaborating with multiple groups to advance additional research and development.

SaaS delivers the computing infrastructure on demand. With our SaaS model, the computer infrastructure is always available and grows with your needs. You do not have to set up a large computer system, or build a new building, or risk over or under investing to deal with the data.

With FinchLab, the vision of next-generation platforms truly democratizing sequencing can be realized.


Labels: , ,

Wednesday, October 8, 2008

Road Trip: AB SOLiD Users Meeting

Wow! That's the best way to summarize my impressions from the Applied Biosystems (AB) SOLiD users conference last week, when AB launched their V3 SOLiD platform. AB claims that this system will be capable of delivering a human genome's worth of data for about $10,000 US.

Last spring, the race to the $1000 genome leaped forward when AB announced that they sequenced a human genome at 12-fold coverage for $60,000. When the new system ships in early 2009, that same project can be completed for $10,000. Also, this week others have claimed progress towards a $5000 human genome.

That's all great, but what can you do with this technology besides human genomes?

That was the focus of the SOLiD users conference. For a day and a half, we were treated to presentations from scientists and product managers from AB as well as SOLiD customers who have been developing interesting applications. Highlights are described below.

Technology Improvements:

Increasing Data Throughput - Practically everyone is facing the challenge of dealing with large volumes of data, and now we've learned the new version of the SOLiD system will produce even more. A single instrument run will produce between 125 million to 400 million reads depending on the application. This scale up is achieved by increasing the bead density on a slide, dropping the overall cost per individual read. Read lengths are also increasing, making it possible to get between 30 and 40 gigabases of data from a run. And, the amount of time required for each run is shrinking; not only can you get all of these data, you can do it again more quickly.

Increasing Sample Scale - Many people like to say, yes, the data is a problem, but at least the sample numbers are low, so sample tracking is not that hard.

Maybe they spoke too soon.

AB and the other companies with Next Gen technologies are working to deliver "molecular barcodes" that allow researchers to combine multiple samples on a single slide. This is called "multiplexing." In multiplexing, the samples are distinguished by tagging each one with a unique sequence, the barcode. After the run, the software uses the sequence tags to sort the data into their respective data sets. The bottom line is that we will go from a system that generates a lot of data from a few samples, to a system that generates even more data from a lot of samples.

Science:

What you can do with 100's of millions of reads: On the science side, there were many good presentations that focused on RNA-Seq and variant detection using the SOLiD system. Of particular interest was Dr. Gail Payne's presentation on the work, recently published in Genome Research, entitled "Whole Genome Mutational Profiling Using Next Generation Sequencing Technology." In the paper, the 454, Illumina, and SOLiD sequencing platforms were compared for their abilities to accurately detect mutations in a common system. This is one of the first head to head to head comparisons to date. Like the presidential debates, I'm sure each platform will be claimed to be the best by its vendor.

From the presentation and paper, the SOLiD platform does offer a clear advantage in its total throughput capacity. 454 showed showed the long read advantage in that approximately 1.5% more of the yeast genome studied was covered by 454 data than with shorter read technology. And, the SOLiD system, with its dibase (color space) encoding, seemed to provide higher sequence accuracy. When the reads were normalized to the same levels of coverage, a small advantage for SOLiD, can be seen.

When false positive rates of mutation detection were compared, SOLiD had zero for all levels of coverage (6x, 8x, 10x, 20x, 30x, 175x [full run of two slides]), Illumina had two false positives at 6x and 13x, and zero false positives for 19x and 44x (full run of one slide) coverage, and 454 had 17, six, and one false positive for 6x, 8x, and 11x (full run) coverage, respectively.

In terms of false negative (missed) mutations, all platforms did a good job. At coverages above 10x, none of the platforms missed any mutations. The 454 platform missed a single mutation at 6x and 8x coverage and Illumina missed two mutations at 6x coverage. SOLiD, on the other hand, missed four and five at 8x and 6x coverage, respectively.

What was not clear from the paper and data, was the reproducibility of these results. From what I can tell, single DNA libraries were prepared and sequenced; but replicates were lacking. Would the results change if each library preparation and sequencing process was repeated?

Finally, the work demonstrates that it is very challenging to perform a clean "apples to apples" comparison. The 454 and Illumina data were aligned with Mosiak and the SOLiD data were aligned with MapReads. Since each system produces different error profiles and the different software programs each make different assumptions about how to use the error profiles to align data and assess variation, the results should not be over interpreted. I do, however, agree with the authors, that these systems are well-suited for rapidly detecting mutations in a high throughput manner.

ChIP-Seq / RNA-Seq: On the second day, Dr. Jessie Gray presented work on combining ChIP-Seq and RNA-Seq to study gene expression. This is important work because it illustrates the power of Next Gen technology and creative ways in which experiments can be designed.

Dr. Gray's experiment was designed to look at this question: When we see that a transcription factor is bound to DNA, how do we know if that transcription factor is really involved in turning on gene expression?

ChIP-Seq allows us to determine where different transcription factors are bound to DNA at a given time, but it does not tell us whether that binding event turned on transcription. RNA-Seq tells us if transcription is turned on, after a given treatment or point in time, but it doesn't tell us which transcription factors were involved. Thus, if we can combine ChiP-Seq and RNA-Seq measurements, we can elucidate a cause and effect model and find where a transcription factor is binding and which genes it potentially controls.

This might be harder than it sounds:

As I listened to this work, I was struck by two challenges. On the computational side, one has to not only think about how to organize and process the sequence data into alignments and reduce those aligned datasets into organized tables that can be compared, but also how to create the right kind of interfaces for combining and interactively exploring the data sets.

On the biochemistry side, the challenges presented with ChIP-Seq reminded me of the old adage of trying to purify disapearase - "the more you purify the less there is." ChIP-Seq and other assays that involve multiple steps of chemical treatments and purification, produce vanishingly small amounts of material for sampling. The later challenge complicates the first challenge, because in systems where one works with "invisible" amounts of DNA, a lot of creative PCR, like "in gel PCR" is required to generate sufficient quantities of sample for measurement.

PCR is good for many things, including generating artifacts. So, the computation problem expands. A software system that generates alignments, reduces them to data sets that can be combined in different ways, and provides interactive user interfaces for data exploration, must also be able to understand common artifacts so that results can be quality controlled. Data visualizations must also be provided so that researchers can distinguish biological observations from experimental error.

These are exactly the kinds of problems that Geospiza solves.

Labels: , , , , , ,

Thursday, September 18, 2008

Road Trip: 454 Users Conference

Quiz: What can sequence small genomes in a single run? What can more than double or triple the EST database for any organism?
Answer: The Roche (454) Genome Sequencer FLX™ System.

Last week I had the pleasure of attending the Roche 454 users conference where the new release (Titanium) of the 454 sequencer was highlighted . This upgrade produces more, longer reads so that more than 600 million bases can be generated in each run. When compared to previous versions, the FLX Titanium produces about five times more data. The conference was well attended and outstanding with informative presentations on science, technology, and practical experiences.

In the morning of the first full day, Bill Farmerie, from the University of Florida, presented on how he got into DNA sequencing as a service and how he sees Next Gen sequencing changing the core lab environment. Back in 1998 he set out to establish a genomics service and talked to many groups about what to do. They told him two important things:
  1. "Don't sweat the sequencing part - this is what we are trained for."
  2. "Worry about information management - this we are not trained for."
From here, he discussed how Next Gen got started in his lab and related his experiences over the past three years and made these points:
  • The first two messages are still true. Sequencing gets solved, the problem is informatics.
  • DNA sequencing is expanding, more data are being produced faster at lower costs.
  • This is democratizing genomics - many groups now have access to high throughput technology that provides "genome center" capabilities.
  • The next bioinformatics challenge is enabling the research community, the groups with the sequencing projects, to make use of their data and information. This is not like Sanger, core labs need to deliver results with data.
  • The way to approach new problems and increase scale is to relieve bioinformatics staff of the burden of doing routine things so they can focus on developing novel applications.
  • To accomplish the above point, buy what you can and build what you have to.
Other speakers made similar points. The informatics challenge begins in the lab, but quickly becomes a major problem for the end researcher.

Bill has been following his points successfully for many years now. We starting working with him on his first genomics service and continue to support his lab with Next Gen. Our relationship with Bill and his group has been a great experience.

Other highlights from the meeting included:

A talk on continuous process improvements in DNA sequencing at the Broad Institute. Danielle Perrin presented work on how the Broad tackles process optimization issues during production to increase throughput, decrease errors, or save costs. In my perspective, this presentation really stresses the importance of coupling laboratory management with data analysis.

Multiple talks on microbial genomics. A strength of the 454 platform is how it generates long reads making this a platform of choice for sequencing smaller genomes and performing metagenomic surveys. We were also introduced to the RAST (Rapid Annotation using Subsystem Technology) server, an ideal tool for working with your completed genome or metagenome data set.

Many examples of how having millions of reads makes new gene expression and variation analysis discoveries possible when compared to other platforms like microarrays. In these talks speakers were occasionally asked which is better, long 454 reads or short reads from Illumina or SOLiD? The speakers typically said you need both, they complement each other.

The Wolly Mammoth. Steven Schuster from Penn State presented his and colleagues' work on sequencing mammoth DNA and its relatedness over 1000's of years. Next Gen is giving us a new "omics," Museomics.

And, of course, our poster demonstrating how FinchLab provides an end to end workflow solution for 454 DNA sequencing. In the poster (you have to click the image to get the BIG picture), we highlighted some new features coming out at the end of the month. These include the ability to collect custom data during lab processing, coupling Excel to FinchLab forms, and work on 454 data analysis. Now you will be able to enter the bead counts, agarose images, or whatever else you need to track lab details to make those continuous process improvements. Excel coupling makes data entry though FinchLab forms even easier. The 454 data analysis complements our work with Sanger, SOLiD, and Illumina data to make the FinchLab platform complete for any genomics lab.

Labels: , , , ,

Thursday, September 4, 2008

The Ends Justify the DNA

In Next Gen experiments, libraries of DNA fragments are created in different ways, from different samples, and sequenced in a massively parallel format. The preparation of libraries is a key step in these experiments. Understanding and validating the results requires knowing how the libraries were created and where the samples came from.

Background

In the last post, I introduced the concept that nearly all Next Gen sequencing applications are fundamentally quantitative assays that utilize DNA sequences as data points.

In Sanger sequencing, the new DNA molecules are synthesized, beginning at a single starting point determined by the primer. If the sequencing primer binds to heterogeneous molecules that contain the same binding site, for example, two slightly different viruses in a mixed population, a single read from Sanger sequencing could represent a mixture of many different molecules in the population, with multiple bases at certain positions. Next Gen sequencing, on the other hand, produces single reads from single individual molecules. This difference between the two methods allows one to simultaneously collect millions of sequence reads in a massively parallel format from single samples.

An additional benefit of massively parallel sequencing is that it eliminates the need to clone DNA, or create numerous PCR products. Although this change reduces the complexity of tracking samples, it increases the need to track experiments with greater detail and think about how we work with the data, how we analyze the data, and how we validate our observations to generate hypotheses, make discoveries, and identify new kinds of systematic artifacts.

Making Libraries

To better understand the significance of what a Next Gen experiment measures, we need to understand what DNA libraries are and how they are prepared. For this discussion we'll define a DNA library as a random collection of DNA molecules (or fragments) that can be separated and identified.

Before we do any kind of Next Gen experiment, we want to know something about the kinds of results we’d expect to see from our library. To begin, let’s consider what we would see from a genomic library consisting of EcoRI restriction fragments. If the digestion is complete, EcoRI will cut DNA between an G and A every time it encounters the sequence: 5'-GAATTC-3'. Every fragment in this library would have the sequence 5'-AATT-3' at every 5’ end. The average length of the fragments will be 4096 bases (~5 kbp). However, the distribution of fragment lengths follows Poisson statistics [1], so the actual library will have a few very large fragments (>> 5 kbp) and numerous small fragments

You may ask “why is this useful?”

Our EcoRI library example helps us to think about our expectations for Next Gen experimental results. That is, if we collect 10 million reads from a sample, what should we expect to see when we compare our data to reference data? We need to know what kinds of results to expect in order to determine if our data represent discoveries, or artifacts. Artifacts can be introduced during sample preparation, sample tracking, library preparation, or from the data collection instruments. If we can’t distinguish between artifacts and discoveries, the artifacts will slow us down and lead to risky publications.

In the case of our EcoRI digest, we can use our predictions to validate our results. If we collected sequences from the estimated 732,000 fragments and aligned the resulting data back to a reference genome, we would expect to see blocks of aligned reads at every one of the 732,000 restriction sites. Further, for each site there should be two blocks, one showing matches to the "forward" strand and one showing matches to the "reverse" strand.

We could also validate our data set by identifying the positions of EcoRI restriction sites in our reference data. What we'd likely see is that most things work perfectly. In some cases, however, we would also see alignments, but no evidence of a restriction site. In other cases, we would see a restriction site in the reference genome, but no alignments. These deviations would identify differences between the reference sequence and the sequence of the genome we used for the experiment. Those differences could either result from errors in the sequence of the reference data or a true biological difference. In the latter case, we would examine the bases and confirm the presence of a restriction length fragment polymorphism (RFLPs). From this example, we can see how we can define the expected results, and use that prediction to validate our data and determine whether our results correspond to interesting biology or experimental error.

Digital Gene Expression

Of course what we expect to see in the data is a function of the kind of experiment we are trying to do. To illustrate this point I'll compare two different kinds of Next Gen experiments that are both used to measure gene expression: Tag Profiling and RNA-Seq.

In Tag Profiling, mRNA is attached to a bead, converted to cDNA, and digested with restriction enzymes. The single fragments that remain attached to the beads are isolated and ligated to adaptor molecules, each one containing a type II restriction site. The fragments are further digested with the type II restriction enzyme and ligated to a sequencing adaptor to create a library of cDNA ends with 17 unique bases, or tags. Sequencing such a library will, in theory, yield a collection of reads that represents the population of RNA molecules in the starting material. Highly expressed genes will be represented by a larger number of tagged sequences than genes expressed at lower levels.

Both Tag profiling and RNA-Seq begin with an mRNA purification step, but after that point the procedures differ. Rather than synthesize a single full-length cDNA for every transcript, RNA-Seq uses random six-base primers to initiate cDNA synthesis at many different positions in each RNA molecule. Because these primers represent every combination of six base sequences, priming with these sequences produces a collection of overlapping cDNA molecules. Starting points for DNA synthesis will be randomly distributed, giving high sequence coverage for each mRNA in the starting material. Like Tag Profiling, genes expressed at high levels will have more sequences present in the data than genes expressed at low levels. Unlike Tag Profiling, any single transcript will produce several cDNAs aligning at different locations.

When the sequence data sets for Tag Profiling and RNA-seq are compared, we can see how the different methods for preparing the DNA libraries contrast with one another. In this example, Tag Profiling [2] and RNA-seq [3] data sets were aligned to human mRNA reference sequences (RefSeq, NCBI). The data were processed with Maq [4] and results displayed in FinchLab. In both cases, relative gene expression can be estimated by the number of sequences that align. If we know the origins of the libraries, the kinds of genes and their expression can give us confidence that the results fit the expression profile we expect. For example the RNA-seq data set is from mouse brain and we see genes at the top of the list that we expect to be expressed in this kind of tissue (last figure below).

The Tag Profiling and RNA-seq data sets also show striking differences that reflect how the libraries are prepared. In each report, the second column gives information about the distribution of alignments in the reference sequence. For Tag Profiling this is reported as "Tags." The number of Tags corresponds to the number of positions along the reference sequence where the tagged sequences align. In an ideal system, we would expect one tag per molecule of RNA. Next Gen experiments however, are very sensitive, so we can also see tags for incomplete digests. Additionally, sequencing errors, and high mismatch tolerance in the alignments can sometimes place reads incorrectly and give unusually high numbers of tags. When the data are more closely examined, we do see that the distribution of alignments follows our expectations more closely. That is, we generally see a high number of reads at one site, with the other tag sites showing a low number of aligned reads.


For RNA-seq, on the other hand, we display the second column (Read Map) as an alignment graph. For RNA-seq data, we expect that the number of alignment start points will be very high and randomly distributed throughout the sequence. We can see that this expectation matches our results by examining the thumbnail plots. In the Read Map graphs, the x-axis represents the gene length and the y-axis is the base density. Presently, all graphs have their data plotted on a normalized x-axis, so the length of an mRNA sequence corresponds to the density of data points in the graph. Longer genes have points that are closer together. You can also see gaps in the plots; some are internal and many are at the 3'-end of the genes. When the alignments are examined more closely, and we incorporate our knowledge of the exon structure or polyA addition sites, we can see that many of these gaps either show potential sites for alternative splicing or data annotation issues.


In summary, Next Gen experiments use DNA sequencing to identify and count molecules, from libraries, in a massively parallel format. The preparation of the libraries allows us to define expected outcomes for the experiment and choose methods for validating the resulting data. FinchLab makes use of this information to display data in ways that make it easy to quickly observe results from millions of sequence data points. With these high-level views and links to drill down reports and external resources, FinchLab provides researchers with the tools needed to determine whether their experiments are on track to creating new insights, or if new approaches are needed to avoid artifacts.

References

[1] The distribution of restriction enzyme sites in Escherichia coli. G A Churchill, D L Daniels, and M S Waterman. Nucleic Acids Res. 1990 February 11; 18(3): 589–597.

[2] Tag Profile dataset was obtained from Illumina.

[3] Mapping and quantifying mammalian transcriptomes by RNA-Seq. A Mortazavi, BA Williams, K McCue K, L Schaeffer, B Wold. Nat Methods. 2008 Jul;5(7):621-8. Epub 2008 May 30.
Data available at: http://woldlab.caltech.edu/rnaseq/

[4] Mapping short DNA sequencing reads and calling variants using mapping quality scores. H Li, J Ruan, R Durbin. Genome Res. 2008 Aug 19. [Epub ahead of print]

Labels: , , , , ,

Tuesday, August 26, 2008

Maq in the Literature

Kudos to Heng Li and team at the Sanger Center. Today Genome Research published their paper on Maq. Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger Center, for assembling Next Gen reads to a reference sequence. MassGenomics sums up why they like Maq and we could not agree more. I also agree that Maq is better name name than mapASS.

One of the things we like best is how versatile the program is for Next Gen applications. Whether you are performing Tag Profiling, ChIP-Seq, RNA-Seq (transcriptome analysis) resequencing, or other applications, its output contains a wide variety of useful information as we will show in coming posts. If you want to know right now, give us a call and we'll show you why Geospiza, Sanger, Washington University and many others think Maq is a great place to start working with Next Gen data.

Labels: , ,

Wednesday, August 20, 2008

Next Gen DNA Sequencing Is Not Sequencing DNA

In the old days, we used DNA sequencing primarily to learn about the sequence and structure of a cloned gene. As the technology and throughput improved, DNA sequencing became a tool for investigating entire genomes. Today, with the exception of de novo sequencing, Next Gen sequencing has changed the way we use DNA sequences. We're no longer looking for new DNA sequences. We're using Next Gen technologies to perform quantitative assays with DNA sequences as the data points. This is a different way of thinking about the data and it impacts how we think about our experiments, data analysis, and IT systems.

In de novo sequencing, the DNA sequence of a new genome, or genes from the environment is elucidated. De novo sequencing ventures into the unknown. Each new genome brings new challenges with respect to interspersed repeats, large segmented gene duplications, polyploidy and interchromosomal variation. The high redundancy samples obtained from Next Gen technology lower the cost and speed this process because less time is required for getting additional data to fill in gaps and finish the work.

The other ultra high throughput DNA sequencing applications, on the other hand, focus on collecting sequences from DNA or RNA molecules for which we already have genomic data. Generally called "resequencing," these applications involve collecting and aligning sequence reads to genomic reference data. Experimental information is obtained by tabulating the frequency, positional information, and variation of the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw conclusions.

DNA sequences are information rich data points

EST (expressed sequence tag) sequencing was one of the first applications to use sequence data in a quantitative way. In EST applications, mRNA from cells was isolated, converted to cDNA, cloned, and sequenced. The data from an EST library provided both new and quantitative information. Because each read came from a single molecule of mRNA, a set of ESTs could be assembled and counted to learn about gene expression. The composition and number of distinct mRNAs from different kinds of tissues could be compared and used to identify genes that were expressed at different time points during development, in different tissues, and in different disease states, such as cancer. The term "tag" was invented to indicate that ESTs could also be used to identify the genomic location of mRNA molecules. Although the information from EST libraries was been informative, lower cost methods such as microarray hybridization and real time-PCR assays replaced EST sequencing over time, as more genomic information became available.

Another quantitative use of sequencing has been to assess allele frequency and identify new variants. These assays are commonly known as "resequencing" since they involve sequencing a known region of genomic DNA in a large number of individuals. Since the regions of DNA under investigation are often related to health or disease, the NIH has proposed that these assays be called "Medical Sequencing." The suggested change also serves to avoid giving the public the impression that resequencing is being carried out to correct mistakes.

Unlike many assay systems (hybridization, enzyme activity, protein binding ...) where an event or complex interaction is measured and described by a single data value, a quantitative assay based on DNA sequences yields a greater variety of information. In a technique analogous to using an EST library, an RNA library can be sequenced, and the expression of many genes can be measured at once, by counting the number of samples that align to a given position or reference. If the library is prepared from DNA, a count of the aligned reads could measure the copy number of a gene. The composition of the read data itself can be informative. Mismatches in aligned reads can help discern alleles of a gene, or members of a gene family. In a variation assay, reads can both assess the frequency of a SNP and discover new variation. DNA sequences could be used in quantitative assays to some extent with Sanger sequencing, but the cost and labor requirements prevented wide spread adoption.

Next Gen adds a global perspective and new challenges

The power of Next Gen experiments comes from sequencing DNA libraries in a massively parallel fashion. Traditionally, a DNA library was used to clone genes. The library was prepared by isolating and fragmenting genomic DNA, ligating the pieces to a plasmid vector, transforming bacteria with the ligation products, and growing colonies of bacteria on plates with antibiotics. The plasmid vector would allow a transformed bacterial cell to grow in the presence of an antibiotic so that transformed cells could be separated from other cells. The transformed cells would then be screened for the presence of a DNA insert or gene of interest through additional selection, colorimetric assay (e.g. blue / white), or blotting. Over time, these basic procedures were refined and scaled up in factory style production to enable high throughput shotgun sequencing and EST sequencing. A significant effort and cost in Sanger sequencing came from the work needed to prepare and track large numbers of clones, or PCR-products, for data linking and later retrieval to close gaps or confirm results.

In Next Gen sequencing, DNA libraries are prepared, but the DNA is not cloned. Instead other techniques are used to "separate," amplify, and sequence individual molecules. The molecules are then sequenced all at once, in parallel, to yield large global data sets in which each read represents a sequence from an individual molecule. The frequency of occurrence of a read in the population of reads can now be used to measure the concentration of individual DNA molecules. Sequencing DNA libraries in this fashion significantly lowers costs, and makes previously cost prohibitive experiments possible. It also changes how we need to think about and perform our experiments.

The first change is that preparing the DNA library is the experiment. Tag profiling, RNA-seq, small RNA, ChIP-seq, DNAse hypersensitivity, methylation, and other assays all have specific ways in which DNA libraries are prepared. Starting materials and fragmentation methods define the experiment and how the resulting datasets will be analyzed and interpreted. The second change is that large numbers of clones no longer need to be prepared, tracked, and stored. This reduces the number of people needed to process samples, and reduces the need for robotics, large number of thermocyclers, and other laboratory equipment. Work that used to require a factory setting can now be done in a single laboratory, or mailroom if you believe the ads.

Attention to details counts

Even though Next Gen sequencing gives us the technical capabilities to ask detailed and quantitative questions about gene structure and expression, successful experiments demand that we pay close attention to the details. Obtaining data that are free of confounding artifacts and accurately represent the molecules in a sample, demands good technique and a focus on detail. DNA libraries no longer involve cloning, but their preparation does require multiple steps performed over multiple days. During this process, different kinds of data ranging from gel images to discrete data values, may be collected and used later for trouble shooting. Tracking the experimental details requires that a system be in place that can be configured to collect information from any number and kind of process. The system also needs to be able to link data to the samples, and convert the information from millions of sequence data points to tables, graphics and other representations that match the context of the experiment and give a global view of how things are working. FinchLab is that kind of system.

Labels: , , , , ,

Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.

ChIP-Seq

ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.

FinchLab

To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.


Reference

ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)

Labels: , , , , ,

Thursday, July 31, 2008

Questions from our mailbag: How do I cite FinchTV?

One of the questions that appears in our mailbox from time to time concerns citing FinchTV or other Geospiza products. A quick search with Google Scholar for "FinchTV" finds 42 examples where FinchTV was cited in research publications. Most of the citations seem to follow the same conventions.

We recommend citing FinchTV as you would any other experimental software tool, instrument, or reagent. The citation should include the version of the program, the company, the location, and the web site. Other Geospiza products (FinchLab, Finch Suite, and iFinch) may be cited in similar manner.

In our case, a citation would most likely read:

FinchTV 1.4.0 (Geospiza, Inc.; Seattle, WA, USA; http://www.geospiza.com)

If you're not sure which version of FinchTV you're using, open the About menu. The version number will appear on the page.

It would also be a good idea to check with the journal where you plan to submit the article. Most journals have a set of instructions for authors where they provide example citations.

Labels: , , ,

Friday, June 13, 2008

Finch 3, Linking Samples and Data

One of the big challenges with Next Gen sequencing is linking sample information with data. People tell us: "It's a real problem." "We use Excel, but it is hard." "We're losing track."

Do you find it hard to connect sample information with all the different types of data files? If so you should look at FinchLab.

A review:

About a month ago, I started talking about our third version of the Finch platform and introduced the software requirements for running a modern lab. To review, labs today need software systems that allow them to:

1. Set up different interfaces to collect experimental information
2. Assign specific workflows to experiments
3. Track the workflow steps in the laboratory
4. Prepare samples for data collection runs
5. Link data from the runs back to the original samples
6. Process data according to the needs of the experiment

In FinchLab, order forms are used to first enter sample information into the system. They can be created for specific experiments and the samples entered will, most importantly, be linked to the data that are produced. The process is straightforward. Someone working with the lab, a customer or collaborator, selects the appropriate form and fills out the requested information. Later, an individual in the lab reviews the order and, if everything is okay, chooses the "processing" state from a menu. This action "moves" the samples into the lab where the work will be done. When the samples are ready for data collection they are added to an "Instrument run." The instrument run is Finch's way of tracking which samples go in what well of a plate or lane/chamber on a slide. The samples are added to the instrument and data are collected.

The data

Now comes the fun part. If you have a Next Gen system you'll ultimately end up with 1000's of files scattered in multiple directories. The primary organization for the data will be in unix-style directories, which are like Mac or Windows folders. Within the directories you will find a mix of sequence files, quality files, files that contain information about run metrics and possibly images. You'll have to make decisions about what to save for long-term use and what to archive, or delete.

As noted, the instrument software organizes the data by the instrument run. However, a run can have multiple samples, and the samples can be from different experiments. A single sample can be spread over multiple lanes and chambers of a slide. If you are running a core lab, the samples will come from different customers and your customers often belong to different lab groups. And there is the analysis. The programs that operate on the data require specific formats for input files and produce many kinds of output files. Your challenge is to organize the data so that it is easy to find and access in a logical way. So what do you do?

Organizing data the hard way

If you do not have a data management system, you'll need to write down which samples go with which person, group or experiment. That's pretty simple. You can tape a piece of paper on the instrument and write this down, or you can diligently open a file, commonly an Excel spreadsheet, and record the info there. Not too bad, after all there are only a handful of partitions on a slide (2, 8, 16) and you only run the instrument once or twice a week. If you never upgrade your instrument, or never try and push too many samples through, then you're fine. Of course the less you run your instrument the more your data cost and the goal is to get really good at running your instrument, as frequently as possible. Otherwise you look bad at audit time.

Let's look at a scenario where the instrument is being run at maximal throughput. Over the course of a year, data from between 200 and 1000 slide lanes (chambers) may be collected. These data may be associated with 100's or 1000's of samples and belong to a few or many users in one or many lab groups. The relevant sequence files are between a few hundred megabytes to gigabytes in size; they exist in directories with run quality metrics and possibly analysis results. To sort this out you could have committee meetings to determine whether data should be organized by sample, experiment, user, or group, or you could just pick an organization. Once you've decided on your organization you have to set up access. Does everyone get a unix account? Do you set up SAMBA services? Do you put the data on other systems like Macs and PCs? What if people want to share? The decisions and IT details are endless. Regardless, you'll need a battery of scripts to automate moving data around to meet your organizational scheme. Or you could do something easier.

Organizing data the Finch way

One of FinchLab's many strengths is how it organizes Next Gen data. Because the system tracks samples and users, and has group and permissions models, issues related to data access and sharing are simplified. After a run is complete, the system knows which data files go to what samples. It also knows which samples were submitted by each user. Thus data can be maintained in the run directories that were created by the instrument software to simplify file-based organization. When a run is complete in FinchLab a data link is made to the run directory. The data link informs the system which files go with a run. Data processing routines in the system sort the data into sequences, quality metric files, and other data. At this stage data are associated with samples. Once this is done, the lab has easy access to the data via web pages. The lab can also make decisions about access to data and how to analyze the data. These last two features make FinchLab a powerful system for core labs and research groups. With only few clicks your data are organized by run, user, group, and experiment - and you didn't have to think about it.



Labels: , , , , , , ,

Monday, May 26, 2008

Finch 3: Managing Workflows

Genetic analysis workflows begin with RNA or DNA samples and end with results. In between, multiple lab procedures and steps are used to transform materials, move samples between containers, and collect the data. Each kind of data collected and each data collection platform requires that different laboratory procedures are followed. When we analyze the procedures, we can identify common elements. A large number of unique workflows can be created by assembling these elements in different ways.

In the last post, we learned about the FinchLab order form builder and some of its features for developing different kinds of interfaces for entering sample information. Three factors contribute to the power of Finch orders. First, labs can create unique entry forms by selecting items like pull down menus, check boxes, radio buttons, and text entry fields for numbers or text, from a web page. No programming is needed. Second, for core labs with business needs, the form fields can be linked to diverse price lists. Third, the subject of this post, is that the forms are also linked to different kinds of workflows.

What are Workflows?

A workflow is a series of series of steps that must be performed to complete a task. In genetic analysis, there are two kinds of workflows: those that involve laboratory work, and those that involve data processing and analysis. The laboratory workflows prepare sample materials so that data can be collected. For example, in gene expression studies, RNA is extracted from a source material (cells, tissue, bacteria), and converted to cDNA for sequencing. The workflow steps may involve purification, quality analysis on agarose gels, concentration measurements, and reactions where materials are further prepared for additional steps.

The data workflows encompass all the steps involved in tracking, processing, managing, and analyzing data. Sequence data are processed by programs to create assemblies and alignments that are edited or interrogated to create genomic sequences, discover variation, understand gene expression, or perform other activities. Other kinds of data workflows such as microarray analysis, or genotyping involve developing and comparing data sets to gain insights. Data workflows involve file manipulations, program control, and databases. The challenge for the scientist today, and the focus of Geospiza's software development is to bring the laboratory and data workflows together.

Workflow Systems

Workflows can be managed or unmanaged. Whether you work at the bench or work with files and software, you use a workflow any time you carry out a procedure with more than one step. Perhaps you wite the steps in your notebook, check them off as you go, and tape in additional data like spectrophotometer readings or photos. Perhaps you write papers in Word and format the bibliography with Endnote or resize photos with Photoshop before adding them to a blog post. In all these cases you performed unmanaged workflows.

Managing and tracking workflows becomes important as the number of activities and number of individuals performing them increase in scale. Imagine your lab bench procedures performed multiple times a day with different individuals operating particular steps. This scenario occurs in core labs that perform the same set of processes over and over again. You can still track steps on paper, but it's not long before the system becomes difficult to manage. It takes too much time to write and compile all of the notes, and it's hard to know which materials have reached which step. Once a system goes beyond the work of a single person, paper notes quit providing the right kinds of overviews. You now need to manage your workflows and track them with a software system.

A good workflow system allows you to define the steps in your protocols. It will provide interfaces to move samples through the steps and also provide ways to add information to the system as steps are completed. If the system is well-designed, it will not allow you do things at inappropriate times or require too much "thinking" as the system is operated. A well-designed system will also reduce complexity and allow you to build workflows through software interfaces. Good systems give scientists the ability to manage their work, they do not require their users to learn arcane programming tools or resort to custom programming. Finally, the system will be flexible enough to let you create as many workflows as you need for different kinds of experiments and link those workflows to data entry forms so that the right kind of information is available to right process.

FinchLab Workflows

The Geospiza FinchLab workflow system meets the above requirements. The system has a high level workflow that understands that some processes require little tracking (a quick test) and other's require more significant tracking ("I want to store and reuse DNA samples"). More detailed processes are assigned workflows that consist of thee parts: A name, a "State," and a "Status." The "State" controls the software interfaces and determines which information are presented and accessed at different parts of a process. A sequencing or genotyping reaction, for example, cannot be added to a data collection instrument until it is "ready." The other part specifies the steps of the process. The steps of the process (Statuses) are defined by the lab and added to a workflow using the web interfaces. When a workflow is created, it is given a name, as many steps as needed, and it is assigned a State. The workflows are then assigned to different kinds of items so that the system always knows what to do next with the samples that enter.

A workflow management system like FinchLab makes it just as easy to track the steps of Sanger DNA sequencing, as it is to track the steps of a Solexa, SOLiD, or 454 sequencing processes. You can also, in the same system, run genotyping assays and other kinds of genetic analysis like microarrays and bead assays.


Next time, we'll talk about what happens in the lab.

Labels: , , , , ,

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:
  1. Allow you set up different interfaces to collect experimental information
  2. Assign specific workflows to experiments
  3. Track the workflow steps in the laboratory
  4. Prepare samples for data collection runs
  5. Link data from the runs back to the original samples
  6. Process data according to the needs of the experiment
I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:
  • One Illumina Solexa Genome Analyzer
  • One Applied Biosystems SOLiD System
  • One Illumina Bead Array station
  • Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis















This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Labels: , , , ,

Friday, April 25, 2008

Managing Digital Gene Expression Workflows with FinchLab

Last Wed (4/23) Illumina hosted a Geospiza presentation featuring how FinchLab supports mRNA tag profiling experiments. We had a great turnout and the presentation is posted on the Illumina web site.

In the webninar we talked about:
  • Next Gen sequencing applications
  • How the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive by looking at some features of mRNA Tag Profiling data sets with FinchLab
  • Setting up and tracking laboratory workflows with FinchLab
  • Why it is important to link the laboratory work and data analysis work
  • Setting up data analysis and reviewing results with FinchLab
  • Using hosted solutions to overcome the significant data management challenges that accompany Next Gen technologies
Over the coming weeks and months we'll explore the above points through multiple posts. In the meantime, get the presentation and enjoy.

From Sample to Results: Managing Illumina Data Workflow with FinchLab

Labels: , , , , , , , ,

Monday, April 21, 2008

Sneak Peak: Managing Next Gen Digital Gene Expression Workflows

This Wednesday, April 23rd, Illumina will host a webinar featuring the Geospiza FinchLab.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing how the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we will talk about the general applications of Next Gen sequencing and focus on using the Illumina Genome Analyzer to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling. Throughout the talk we will give specific examples about collecting and analyzing tag profiling data and show how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , ,

Wednesday, April 16, 2008

Expectations Set the Rules

Genetic analysis workflows are complex. Biology is non-deterministic, so we continually experience new problems. Lab processes and our data have natural uncertainty. These factors conspire against us to make our world rich in variability and processes less than perfect.

That keeps things interesting.

In a previous post, I was able to show how sequence quality values could be used to summarize the data for a large resequencing assay. Presenting "per read" quality values in a grid format allowed us to visualize samples that had failed as well as observe that some amplicons contained repeats that led to sequencing artifacts. We also were able to identify potential sample tracking issues and left off with an assignment to think about how we might further test sample tracking in the assay.

When an assay is developed there are often certain results that can be expected. Some results are defined explicitly with positive and negative controls. We can also use assay results to test that the assay is producing the right kinds of information. Do the data make sense? Expectations can be derived from the literature, an understanding of statistical outcomes, or internal measures.

Genetic assays have common parts

A typical genetic resequencing assay is developed from known information. The goal is to collect sequences from a defined region of DNA for a population of individuals (samples) and use the resulting data to observe the frequency of known differences (variants) and identify new patterns of variation. Each assay has three common parts:

Gene Model - Resequencing and genotyping projects involve comparative analysis of new data (sequences, genotypes) to reference data. The Gene Model can be a chromosomal region or specific gene. A well-developed model will include all known genotypes, protein variations, and phenotypes. The Gene Model represents both community (global) and laboratory (local) knowledge.

Assay Design - The Assay Design defines the regions in the Gene Model that will be analyzed. These regions, typically prepared by PCR are bounded by unique DNA primer sequences. The PCR primers have two parts: one part is complementary to the reference sequence (black in the figure), the other part is "universal" and is complementary to a sequencing primer (red in the figure). The study includes information about patient samples such as their ethnicity, collection origin, and phenotypes associated with the gene(s) under study.

Experiments / Data Collection / Analysis - Once the study is designed and materials arrive, samples are prepared for analysis. PCR is used to amplify specific regions for sequencing or genotyping. After a scientist is confident that materials will yield results, data collection begins. Data can be collected in the lab or the lab can outsource their sequencing to core labs or service companies. When data are obtained, they are processed, validated, and compared to reference data.

Setting expectations

A major challenge for scientists doing resequencing and genotyping projects arises when trying to evaluate data quality and determine the “next steps.” Rarely does everything work. We've already talked about read quality, but there are also the questions of whether the data are mapping to their expected locations, and whether the frequencies of observed variation are expected. The Assay Design can be used to verify experimental data.

The Assay Design tells us where the data should align and how much variation can be expected. For example, if the average SNP frequency is 1/1300 bases, and an average amplicon length is 580 bases, we should expect to observe one SNP for every two amplicons. Furthermore, in reads where a SNP may be observed, we will see the difference in a subset of the data because some, or most, of the reads will have the same allele as the reference sequence.

To test our expectations for the assay, the 7488 read data set is summarized in a way that counts the frequency of disagreements between read data and their reference sequence. The graph below shows a composite of read discrepancies (blue bar graph) and average Q20/rL, Q30/rL, Q40/rL values (colored line graphs). Reads are grouped according to the number of discrepancies observed (x-axis). For each group, the count of reads (bar height) and average Q20/rL (green triangles), Q30/rL (yellow squares), and Q40/rL (purple circles) are displayed.


In the 7488 read data set, 95% (6914) of the reads gave alignments. Of the aligned data, 82% of the reads had between 0 and 4 discrepancies. If we were to pick which traces to review and which to samples to redo, we would likely focus our review on the data in this group and queue the rest (18%) for redos to see if we could improve the data quality.

Per our previous prediction, most of the data (5692 reads) do not have any discrepancies. We also observe that the number of discrepancies increases as the overall data quality decreases. This is expected because the quality values are reflecting the uncertainty (error) in the data.

Spotting tracking issues

We can also use our expectations to identify sample tracking issues. Once an assay is defined, the positions of all of the PCR primers are known, hence we should expect that our sequence data will align to the reference sequence in known positions. In our data set, this is mostly true. Similar to the previous quality plots of samples and amplicons, an alignment "quality" can be defined and displayed in a table where the rows are samples and columns are amplicons. Each sample has two rows (one forward and one reverse sequence). If the cells are colored according to alignment start positions (green for expected, red for unexpected, white for no alignment) we can easily spot which reads have an "unexpected" alignment. The question then becomes, where/when did the mixup occur?

From these kinds of analyses we can get a feel for whether a project is on track and whether there are major issues that will make our lives harder. In future posts I will comment on other kinds of measures that can be made and show you how this work can be automated in FinchLab.

Labels: , , , ,

Monday, April 14, 2008

Digital Gene Expression with Next Gen Sequencing

Next Gen Sequencing is changing how we approach problems ranging from whole genome shotgun sequencing, to variation analysis, to gene expression, to structural genomics. Next week, April 23rd, Geospiza will present a webinar on managing Digital Gene Expression experiments and data with FinchLab. The webinar is hosted by Illumina as part of their ongoing webinar series on Next Gen sequencing.

Abstract

Next Gen sequencers enable researchers to perform new and exciting experiments like digital gene expression. Next Gen sequencers, however, also expose researchers to unprecedented experimental data volume and the need for new tools to support these projects. A single run of the Illumina Genome Analyzer, for example, can generate terabytes of data and 100s of thousands of files. To manage these projects effectively, researchers will need new software systems to quickly track samples, access and analyze the key results files produced by these runs and focus on the science, rather than IT.

In this webinar, Geospiza will demonstrate how the FinchLab Next Gen Edition workflow software can be used track samples, quality review data, and characterize the biological significance of an Illumina dataset while streamlining the entire process from sample to result for a Digital Gene Expression experiment.

Hope to see you there.

Labels: , , , , , ,

Tuesday, April 8, 2008

Exceptions are the Rule

Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.

In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.

First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.

Sample issues

Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?


Amplicon issues

In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?


Getting to the bottom

In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.

Could something in the sequence be interfering with the sequencing reaction?

To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.

In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?

How might we determine if our samples are being properly tracked?

Labels: , , , , ,

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Labels: , , , , , , ,

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.

Labels: , , , , , , ,

Monday, March 31, 2008

Next Gen, Next Step

Congratulations! You just got approval to purchase your next generation sequencer! What are you going to do next?

Today, there is a lot being written about the data deluge accompanying Next Gen sequencers. It's true, they produce a lot of data. But even more important are the questions about how you plan to set up the lab and data workflows to turn those precious samples into meaningful information. The IT problems, while significant, are only the tip of the iceberg. If you operate a single lab, you will need to think about your experiments, how to track your samples, how to prepare DNA for analysis, how to move the data around for analysis, and how to do your analyses to get meaningful information out of the data. If you operate a core lab, you have all the same problems, but you're providing that service for a whole community of scientists. You'll need to keep their samples and data separated and secure. You also have to figure out how to get the data to your customers and how you might help them with their analyses.

Never mind that you need multi terabytes of storage and a computer cluster. Without a plan and strategy for running your lab, organizing the data, running multistep analysis procedures, and sifting through 100's of thousands of alignments, you'll just end up with a piece of lab art: a Next Gen sequencer, a big storage system and a computer cluster. (By the way, have you found a place for this yet?) It may look nice, but that's probably not what you had in mind.

To get the most of out of your investment, you'll need to think about workflows, and how to manage those workflows.


The cool thing about Next Gen technology are the kinds of questions that can be asked with the data. This requires both novel ways to work with DNA and RNA and novel ways to work with the data. We call those procedures "workflows." Simply put, a workflow describes a multistep procedure and its decision points. In each step, we work with materials and the materials may be "transformed" in the step. You can also describe a workflow as a series of steps that have inputs and outputs. Workflows are run both in the lab and on the computer.

In a protocol for isolating DNA , we can take tissue (the input) lyse the cells with detergent, bind the DNA to a resin, wash away junk, and elute purified DNA (the output). The purified DNA may then become an input to a next step, like PCR, to create an output, like a collection of amplicons. Similar processes can be used with RNA. In a Next Gen lab workflow, you fragment the DNA, ligate adaptors, and use the adaptors to attach DNA to beads or regions of a slide. From a few basic lab workflows, we can prepare genetic material for whole genome analysis, expression analysis, variation analysis, gene regulation, and other experiments in both discovery, and diagnostic assays.

In a software workflow, data are the material. Input data, typically packaged in files, are processed by programs to create output data. These data or information can also be packaged in files or even stored in databases. Software programs execute the steps and scripts often automate series of steps. Digital photography, multimedia production, and business processes all have workflows. So does bioinformatics. The difference is that bioinformatics workflows lack standards so many people work harder than needed and spend a lot of time debugging things.

As the scale increases, the lab and analysis workflows must be managed together.


A common laboratory practice has been to collect the data, and then analyze the data in separate independent steps. Lab work is often tracked on paper, in Excel spreadsheets, or in a LIMS (Laboratory Information Management System). The linkage between lab processes, raw data, and final results, is typically poor. In small projects, this is manageable. File naming conventions can track details and computer directories (folders) can be used to organize data files. But as the scale grows, the file names get longer and longer, people spend considerable time moving and renaming data, the data start to get mixed up, become harder to find, and for some reason files start to replicate themselves. Now, the lab investigates tracking problems and lost data, instead of doing experiments.

Why? Because the lab and data analysis systems are disconnected.

The good news is that Geospiza Finch products can link your lab procedures and the data handling procedures to create complete workflows for genetic analysis.

Labels: , , , , , ,

Monday, March 24, 2008

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Next generation sequencing will transform sequencing assays and experiments. Understanding how the data are generated is important for interpreting results.

In my last post, I discussed how all measurement systems have uncertainty (error) and how error probability is determined in Sanger sequencing. In this post, I discuss the current common next generation (Next Gen) technologies and their measurement uncertainty.

The Levels of Data - To talk about Next Gen data and errors, it is useful to have a framework for describing these data. In 2005, the late Jim Gray and colleagues published a report addressing scientific data management in the coming decade. The report defines data in three levels (Level 0, 1, and 2). Raw data (Level 0), the immediate product of analytical instruments, must be calibrated and transformed into Level 1 data. Level 1 data are then combined with other data to facilitate analysis (Level 2 datasets). Interesting science happens when we make different combinations of Level 2 data and have references to Level 1 data for verification.

How does our data level framework apply to the DNA sequencing world?

In Sanger sequencing, Level 0 data are the raw analog signal data that are collected from the laser and converted to digital information. Digital signals, displayed as waveforms, are mobility corrected and basecalled to produce Level 1 data files that contain information about the collection event, the DNA sequence (read), quality values, and averaged intensity signals. Read sequences are then combined together, or with reference data, to produce Level 2 data in the form of aligned sequence datasets. These Level 2 data have many context specific meanings. They could be a list of annotations, a density plot in a genome browser view, an assembly layout, or a panel of discrepancies that use quality values (from Level 1) to distinguish true genetic variation from errors (uncertainty) in the data.

Next Gen data are different.

When the "levels of data" framework is used to explore Next Gen sequencing technology, we see fundamental differences between Level 0 and Level 1 data . In Sanger sequencing, we perform a reaction to create a mixture of fluorescently tagged molecules that differ in length by a single base. The mixture is then resolved by electrophoresis with continual detection of these size separated tagged DNA molecules to ultimately create a DNA sequence. Uncertainties in the basecalls are related to electrophoresis conditions, compressions due to DNA secondary structure, and the fact that some positions have mixed bases because the sequencing reactions contain a collection of molecules.

In Next Gen sequencing, molecules are no longer separated by size. Sequencing is done "in place" on DNA molecules that were amplified from single molecules. These amplified DNA molecules are either anchored to beads (that are later randomly bound to slides or picoliter wells) or anchored to random locations on slides. Next Gen reactions then involve multiple cycles of chemical reaction (flow) followed by detection. The sample preparation methods and reaction/detection cycles are where the current Next Gen technologies greatly differ and have their unique error profiles.

Next Gen Level 0 data changes from electropherograms to huge collections of images produced through numerous cycles of base (or DNA) extensions and image analysis. Unlike Sanger sequencing, where both raw and processed data can be stored relatively easily in single files, Next Gen Level 0 data sets can be quite large and multiple terabytes can be created per run. Presently there is active debate on the virtues of storing Level 0 data (the Data Life Cycle). The Level 0 image collections are used to create Level 1 data. Level 1 data are a continuum of information that are ultimately used to produce reads and quality values. Next Gen quality values reflect the basic underlying sequencing chemistry and primary error modes, and thus calculations need to be optimized for each platform. For those of you who are familiar with phred, these are analogous to the settings in the phredpar.dat file. The other feature of Level 1 data are that they can be expressed differently.


The Spaces

Flow Space - The Roche 454 technology, is based on pyrosequencing [1] and the measured signals are a function of how many bases are incorporated in a base addition cycle. In pyrosequencing, we measure the pyrophosphate (PPi) that is released when a nucleotide is added to a growing DNA chain. If multiple bases are added in a cycle, two or more A's for example, proportionally more (PPi) is released and the light detected is more intense. As more bases are added in a row, the relative increase in light decreases; an 11/10 change ratio for example, is much lower than 2/1. Consequently, when there are longer sequences with the same base (e.g. AAAAA), it becomes harder to count the number of bases accurately, and the error rate increases. Flow space describes a sequence in terms of base incorporations. That is, the data are represented as an ordered string of bases plus the number of bases at each base site. The 454 software performs alignment calculations in flow space.

Color Space - The Applied Biosystems SOLiD technology uses DNA ligation [2] with collections of synthetic DNA molecules (oligos) that contain two nested known bases with a fluorescent dye. In each cycle these two bases are read at intervals of five bases. Getting full sequence coverage (25, 35, or more bases), requires multiple ligation and priming cycles such that each base is sequenced twice. The SOLiD technology uses this redundancy to decrease the error probability. The Level 1 sequence is reported in a “color” space, where the numbers 0,1,2,3 are used to represent one of the fluorescent colors. Color space must be decoded into a DNA sequence at the final stage of data processing. Like flow space, it is best to perform alignments using data in color space. Since each color represents two bases, decoding color space requires that the first base be known. This base came from the adapter.

Sequence Space - Illumina’s Solexa technology uses single base extensions of fluorescent-labeled nucleotides with protected 3'-OH groups. After base addition and detection, the 3'-OH is deprotected and the cycle repeated. The error probabilities are calculated in different ways by first analyzing the raw intensity files (remember firecrest and bustard) and then by alignment with the ELAND program to compute an empirical probability error. With Solexa data, errors occur more frequently at the ends of reads.

So, what does this all mean?

Next Gen technologies are all different in the way data are produced, they all produce different kinds of Level 1 data, and the Level 1 data are best analyzed within their own unique "space." This can have data integration implications depending on how you might want to combine data sets (Level 1 vs Level 2). Clearly, it is important to have easy access to quality values and error rates to help troubleshoot issues with runs and samples. The challenge is sifting out the important data from a morass of sequence files, quality files, data encodings, and other vagaries of the instruments and their software. Geospiza is good at this and we can help.

In terms of science, many assays and experiments, like Tag and Count, or whole genome assembly, can use redundancy to overcome random data errors. Data redundancy can also be used to develop rules to validate variations in DNA sequences. But, this is the tip of the iceberg. With Next Gen's extremely high sampling rates and molecular resolution, we can begin to think of finding very rare differences in complex mixtures of DNA molecules and use sequencing as a quantitative assay. In these cases understanding how the data are created and their corresponding uncertainties are a necessary step toward making full use of the data. The good news is that there is some good work being done in this area.

Remember, the differences we measure must be greater than the combined uncertainty of our measurements.

Further reading
1. Ronaghi, M., M. Uhlen, and P. Nyren, "A sequencing method based on real-time pyrophosphate." Science, 1998. 281(5375): p. 363, 365.

2. Shendure, J., G.J. Porreca, N.B. Reppas, et al., "Accurate multiplex polony sequencing of an evolved bacterial genome." Science, 2005. 309(5741): p. 1728-32.

3. "Quality scores and SNP detection in sequencing-by-synthesis systems." http://www.genome.org/cgi/content/abstract/gr.070227.107v2

4. "Scientific data management in the coming decade." http://research.microsoft.com/research/pubs/view.aspx?tr_id=860

5. Solexa quality values. http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
http://rcdev.umassmed.edu/pipeline/What%20do%20the%20different%20files%20in%20an%20analysis%20directory%20mean.html

Labels: , , , ,

Monday, March 17, 2008

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Next generation DNA sequencing introduces new concepts like color space, flow space, and sequence space. You might ask, what's a space? How do I deal with these spaces? Why are they important?

In this two part blog, I will first talk about error analysis in DNA sequencing. Next I will talk about how we might think about error analysis in next generation sequencing.

Last week I came across a story about an MIT physics professor, Walter Lewin, who captivates his student audiences with his lectures and creative demonstrations. MIT and iTunes have 100 of his lectures on line. I checked out the first one - your basic first college physics lecture that focuses on measurement and dimensional analysis - and agree, Lewin is captivating. I watched the entire lecture, and it made me think about DNA sequencing.

In the lecture, Lewin, proves "physics works!" and how his grandmother was right when she said that you are inch taller when laying down than when standing up. He used a student subject and measured his length laying down and standing up. Sure enough, the student was an inch longer laying down. But that was not the point. The point was - Lewin proved his grandmother was right because the change in the student's length was greater than the uncertainty of his measuring device (the ruler). Every measurement we make has uncertainty, or error, and for a comparison to be valid the difference in measures have to be greater than their combined uncertainties.

What does this have to do with DNA sequencing?

Each time we collect DNA sequence data we are making many measurements. That is, we are determining the bases of a DNA sample template in an in vitro replication process that allows us to "read" each base of the sequence. The measurements we collect, the string of DNA bases, therefore have uncertainty. We call this uncertainty in base measurement the error probability. In Sanger sequencing, Phil Green and Brent Ewing developed the Phred basecalling algorithm to measure per base error probabilities.

Error probabilities are small numbers (1/100, 1/10,000, 1/1,000,000). Rather than work with small fractions and decimal values with many leading zeros, we express error probabilities as positive whole integers, called quality values (QVs), by applying a transformation:

QV = -l0*log(P), where P is the error probability.

With this transformation our 1/100, 1/10,000, and 1/1,000,000 error probabilities become QVs of 20, 40, and 60, respectively.

The Phred basecalling algorithm has had a significant impact on DNA sequencing because it demonstrated that we could systematically measure the uncertainty of each base determination in a DNA sequence. Over the past 10 years, Phred quality values have been calibrated through many resequencing projects and are thus statistically powerful. An issue with Phred, and any basecaller, however is that it must be calibrated for different electrophoresis instruments (measurement devices) and that is why different errors and error rates can be observed with different combinations of basecallers and instruments.

Sequencing redundancy also reduces error probabilities

The gold standard in DNA sequencing is to sequence both strands of a DNA molecule. This is for good reason. Each stand represents an independent measurement. If our measurements agree, they can be better trusted, and if they disagree one needs to look more closely at the underlying data, or remeasure. This concept was also incorporated into Green's assembly program Phrap (unpublished).

Within the high throughput genomics community it is well understood that increasing the redundancy of data collection reduces error. In theory, one can automate the interpretation of DNA sequencing experiments, or assays, by collecting data at sufficient redundancy. The converse is also true, and I see people work the hardest with manually reviewing data when they do not collect enough. This is most common with variant detection resequencing assays.

Why isn't high redundancy data collection routine?

The challenges with high redundancy data collection in Sanger sequencing involve the high relative costs of collecting data and higher costs of collecting data from single molecules. Next generation (Next Gen) sequencing changes this landscape.

The higher throughput rates and lower costs of Next Gen sequencing hold great promise for revolutionizing genomics research and molecular diagnostics. In a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal because the large numbers of sequences that can be obtained make statistical calculations more valid.

Making statistical calculations valid, however, requires that we understand the inherit uncertainty of our measuring device. In this case, the different Next Gen genetic analyzers. That's where color space, flow space, and other spaces come into play.

Further Reading
1. Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res, 2007. 35(15): p. e97.
2. Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6.
3. Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, 454 sequencing put to the test using the complex genome of barley. BMC Genomics, 2006. 7: p. 275.
4. Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 2007. 67(18): p. 8511-8.
5. Highlander, S.K., K.G. Hulten, X. Qin, et al., Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus. BMC Microbiol, 2007. 7(1): p. 99.
6. Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.
7. Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

Labels: , , ,

Thursday, March 13, 2008

What's a Bustard?

For that matter, what's a Firecrest? or a Gerald? Many with an Illumina Genome Analyzer are now learning these are the directories that have the data they may be interested in.

What's in those directories?

In this post, we explore some of the data in the directories, talk about what data might be important, and use FinchLab Next Gen Edition (FinchLab NG) to look at some of the files. In the Next Gen world we are also going to be learning about the data life cycle. When you are thinking about how to store three or four or ten terabytes (TB) of data for each run, and considering that you might run your instrument 40 or 50 times or more in the next year, you might stop and ask the question, "how much of that data is really important and for how long?" That's the data life cycle. It's going to be important.

To begin our understanding, let's look at the data being created in a run. When an Illumina Genome Analyzer (aka Solexa) collects data, many things happen. First, images are collected for each cycle in a run and tile in a lane on a slide. They're pretty small, but there are a lot, maybe 360,000 or so and they add up to the terabytes we talk about. These images are analyzed to create tens of thousands (about 100 gigabytes [GB] worth) of "raw intensity files" that go in the Firecrest directory. Next, a base-calling algorithm reads the raw intensity files to create sequence, quality and other files (about 80 GB worth) that go in the Bustard directory. The last step is the Eland program pipleline. It reads the Bustard files, aligns their data to reference sequences, makes more quality calculations, and creates more files. These data go in the Gerald directory to give about 25 or 30 GB of sequence and quality data.

So, what's the best data to work with? That depends on what problem you are trying to solve. Specialists developing new basecalling tools or alignment tools might focus on the data in Firecrest and Bustard. Most researchers, however, are going to work with data in the Gerald directory. That reduces our TB problem down to a tens of GB problem. That's a big difference!

FinchLab NG can help.

FinchLab NG gives you the LIMS capabilities to run your Next Gen laboratory workflows and track which samples go on which slides and where on the slide the samples go. We call this part the run. When a run is complete you can link the data on your filesystem to FinchLab NG and use the web interfaces to explore the data. You can also link specific data files to samples. So, if you are sharing data or operating a core lab your researchers can easily access their data through their Finch account.

The screen shot below gives an example of how the HTML quality files can be explored. It shows two windows, the one on the left is the FinchLab NG with data for a Solexa run. You can see that the directory has 3606 files and a number are htm summary files. You can find these 12 files in that directory of 3606 files entering "htm" in the Finch Finder.The window on the right was obtained by clicking the "Intensity Plots" link that is directly below the info table and just above the data list. In this example the intensity plots are shown for each tile of the 8 lanes on the slide. To see this better click the image and zoom in with your browser.

Labels: , , , , , ,

Sunday, March 9, 2008

Using the Finder

FinchLab, iFinch, Finch Suite, and FinchLab Next Gen Edition all use a tool that we call the Finder to help you locate data by selected criteria.

This video shows some quick tips on using the Finder with iFinch as an example.


Using the finder from Sandra Porter on Vimeo.

Labels: , ,

Sunday, February 24, 2008

Next Gen Sequencing Software

In my last post, I indicated that the next generation (Next Gen) of DNA sequencers was creating a lot of excitement in DNA sequencing. In the next couple of posts I want to share some of our plans for supporting Next Gen by discussing the poster that we presented at the AGBT and ABRF conferences.

The general goal of the poster was to share our thoughts on how Next Gen data will have to be dealt with in order to develop more scalable and interoperable data processing software. It presented on our work with HDF (hierarchical data format) technology and how that fit's into Geospiza's plans for meeting Next Gen data management challenges. The first phase, now complete, provides our customers with a solution that links samples to their data and puts in place the foundation needed for the second phase which focuses on developing and integrating the scientific data processing applications that will make sense of the data.

Once a lab is up and running with Next Gen technology they quickly face the data management problem. Basic file system technology and file servers allow groups to store their data in nested directory structures. After a few runs, however, people realize that it gets really hard to know what data go with which run or with which sample - the Excel file storing that information gets lost, or the README file didn't written. The situation becomes even worse when Next Gen instruments are run in the context of a core lab. Now the problem is exacerbated because you need to make the data available to your customers. Do you set up an FTP site? Or, do you make unix accounts on the file server for your end users? Or, do you deliver data on firewire drives or multigigabyte flash drives? Or do you just do the work for your client and hope that they do not want to reanalyze their data?

Geospiza has solved the first part of the problem. Our new product FinchLab Next Gen Edition allows labs to track sample preparation (DNA library construction, reagent kit tracking, and workflow organization) and link data to runs and samples. FinchLab Next Gen Edition also provides interfaces so that core labs can create a variety of order forms for any kind of service to link data to runs, samples, and orders making data accessible to outside parties (customers) through the FinchLab web browser user interface. And, all of this can be done without any custom programming services. Over the next few weeks, I'll fill in the details on how we can do that. For now, I'll focus on the poster, but make a final important point that FinchLab Next Gen Edition not only interoperates with all of the current Next Gen instruments, it also allows labs to integrate these data with current Sanger sequencing technologies in a common web interface.

With sample tracking and data management under control, the next challenge becomes what to do with the data that are collected. The scientific community is just at the beginning of this journey. In the past, bioinformatics efforts have emphasized algorithm (tools) development over the implementation details associated with data management, organization, and intuitive user interfaces. The result is software systems, built from point solutions, that do not adequately address problems outside of “expert” organizations. If a scientist wants to work with sequence data to understand a biological research problem, they must overcome the challenges of moving data between complex software programs, reformatting unstructured text files, traversing web sites, and writing programs and scripts to mine new output files before they can even begin to gain insights into their problem of interest. While formats and standards have always been discussed and debated many people working with Next Gen data understand the "single point" solution approaches of the past will not scale to today's problems.

That's where HDF fits in. It is clear to the community that new software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets being created by new technology. Geospiza is working with The HDF Group (THG, www.hdfgroup.org) to deliver these capabilities by building on a recognized technology that has proven its ability to meet similar scalability demands in other areas of science [1]. We call the extensible domain-specific data technologies that will be built "BioHDF.” BioHDF will provide the DNA sequencing community with a standardized infrastructure to support high-throughput data collection and analysis and engage an informatics group (THG) that is highly experienced in large-scale data management issues. The technology will make it possible to overcome current computational barriers that hinder analyses. Computer scientists will not have to “reinvent” file formats when developing new computational tools or interfaces to meet scalability demands and biologists will have new programs with the improved performance needed to work with larger datasets.

In the next post, I'll present the case for HDF.

Reference: [1] HDF-EOS "HDF-EOS Tools and Information Center," http://hdfeos.org

Get the poster: NextGenHDF.pdf

Labels: , , ,