Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.

ChIP-Seq

ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.

FinchLab

To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.


Reference

ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)

Labels: , , , , ,

Thursday, July 31, 2008

Questions from our mailbag: How do I cite FinchTV?

One of the questions that appears in our mailbox from time to time concerns citing FinchTV or other Geospiza products. A quick search with Google Scholar for "FinchTV" finds 42 examples where FinchTV was cited in research publications. Most of the citations seem to follow the same conventions.

We recommend citing FinchTV as you would any other experimental software tool, instrument, or reagent. The citation should include the version of the program, the company, the location, and the web site. Other Geospiza products (FinchLab, Finch Suite, and iFinch) may be cited in similar manner.

In our case, a citation would most likely read:

FinchTV 1.4.0 (Geospiza, Inc.; Seattle, WA, USA; http://www.geospiza.com)

If you're not sure which version of FinchTV you're using, open the About menu. The version number will appear on the page.

It would also be a good idea to check with the journal where you plan to submit the article. Most journals have a set of instructions for authors where they provide example citations.

Labels: , , ,

Friday, June 13, 2008

Finch 3, Linking Samples and Data

One of the big challenges with Next Gen sequencing is linking sample information with data. People tell us: "It's a real problem." "We use Excel, but it is hard." "We're losing track."

Do you find it hard to connect sample information with all the different types of data files? If so you should look at FinchLab.

A review:

About a month ago, I started talking about our third version of the Finch platform and introduced the software requirements for running a modern lab. To review, labs today need software systems that allow them to:

1. Set up different interfaces to collect experimental information
2. Assign specific workflows to experiments
3. Track the workflow steps in the laboratory
4. Prepare samples for data collection runs
5. Link data from the runs back to the original samples
6. Process data according to the needs of the experiment

In FinchLab, order forms are used to first enter sample information into the system. They can be created for specific experiments and the samples entered will, most importantly, be linked to the data that are produced. The process is straightforward. Someone working with the lab, a customer or collaborator, selects the appropriate form and fills out the requested information. Later, an individual in the lab reviews the order and, if everything is okay, chooses the "processing" state from a menu. This action "moves" the samples into the lab where the work will be done. When the samples are ready for data collection they are added to an "Instrument run." The instrument run is Finch's way of tracking which samples go in what well of a plate or lane/chamber on a slide. The samples are added to the instrument and data are collected.

The data

Now comes the fun part. If you have a Next Gen system you'll ultimately end up with 1000's of files scattered in multiple directories. The primary organization for the data will be in unix-style directories, which are like Mac or Windows folders. Within the directories you will find a mix of sequence files, quality files, files that contain information about run metrics and possibly images. You'll have to make decisions about what to save for long-term use and what to archive, or delete.

As noted, the instrument software organizes the data by the instrument run. However, a run can have multiple samples, and the samples can be from different experiments. A single sample can be spread over multiple lanes and chambers of a slide. If you are running a core lab, the samples will come from different customers and your customers often belong to different lab groups. And there is the analysis. The programs that operate on the data require specific formats for input files and produce many kinds of output files. Your challenge is to organize the data so that it is easy to find and access in a logical way. So what do you do?

Organizing data the hard way

If you do not have a data management system, you'll need to write down which samples go with which person, group or experiment. That's pretty simple. You can tape a piece of paper on the instrument and write this down, or you can diligently open a file, commonly an Excel spreadsheet, and record the info there. Not too bad, after all there are only a handful of partitions on a slide (2, 8, 16) and you only run the instrument once or twice a week. If you never upgrade your instrument, or never try and push too many samples through, then you're fine. Of course the less you run your instrument the more your data cost and the goal is to get really good at running your instrument, as frequently as possible. Otherwise you look bad at audit time.

Let's look at a scenario where the instrument is being run at maximal throughput. Over the course of a year, data from between 200 and 1000 slide lanes (chambers) may be collected. These data may be associated with 100's or 1000's of samples and belong to a few or many users in one or many lab groups. The relevant sequence files are between a few hundred megabytes to gigabytes in size; they exist in directories with run quality metrics and possibly analysis results. To sort this out you could have committee meetings to determine whether data should be organized by sample, experiment, user, or group, or you could just pick an organization. Once you've decided on your organization you have to set up access. Does everyone get a unix account? Do you set up SAMBA services? Do you put the data on other systems like Macs and PCs? What if people want to share? The decisions and IT details are endless. Regardless, you'll need a battery of scripts to automate moving data around to meet your organizational scheme. Or you could do something easier.

Organizing data the Finch way

One of FinchLab's many strengths is how it organizes Next Gen data. Because the system tracks samples and users, and has group and permissions models, issues related to data access and sharing are simplified. After a run is complete, the system knows which data files go to what samples. It also knows which samples were submitted by each user. Thus data can be maintained in the run directories that were created by the instrument software to simplify file-based organization. When a run is complete in FinchLab a data link is made to the run directory. The data link informs the system which files go with a run. Data processing routines in the system sort the data into sequences, quality metric files, and other data. At this stage data are associated with samples. Once this is done, the lab has easy access to the data via web pages. The lab can also make decisions about access to data and how to analyze the data. These last two features make FinchLab a powerful system for core labs and research groups. With only few clicks your data are organized by run, user, group, and experiment - and you didn't have to think about it.



Labels: , , , , , , ,

Monday, May 26, 2008

Finch 3: Managing Workflows

Genetic analysis workflows begin with RNA or DNA samples and end with results. In between, multiple lab procedures and steps are used to transform materials, move samples between containers, and collect the data. Each kind of data collected and each data collection platform requires that different laboratory procedures are followed. When we analyze the procedures, we can identify common elements. A large number of unique workflows can be created by assembling these elements in different ways.

In the last post, we learned about the FinchLab order form builder and some of its features for developing different kinds of interfaces for entering sample information. Three factors contribute to the power of Finch orders. First, labs can create unique entry forms by selecting items like pull down menus, check boxes, radio buttons, and text entry fields for numbers or text, from a web page. No programming is needed. Second, for core labs with business needs, the form fields can be linked to diverse price lists. Third, the subject of this post, is that the forms are also linked to different kinds of workflows.

What are Workflows?

A workflow is a series of series of steps that must be performed to complete a task. In genetic analysis, there are two kinds of workflows: those that involve laboratory work, and those that involve data processing and analysis. The laboratory workflows prepare sample materials so that data can be collected. For example, in gene expression studies, RNA is extracted from a source material (cells, tissue, bacteria), and converted to cDNA for sequencing. The workflow steps may involve purification, quality analysis on agarose gels, concentration measurements, and reactions where materials are further prepared for additional steps.

The data workflows encompass all the steps involved in tracking, processing, managing, and analyzing data. Sequence data are processed by programs to create assemblies and alignments that are edited or interrogated to create genomic sequences, discover variation, understand gene expression, or perform other activities. Other kinds of data workflows such as microarray analysis, or genotyping involve developing and comparing data sets to gain insights. Data workflows involve file manipulations, program control, and databases. The challenge for the scientist today, and the focus of Geospiza's software development is to bring the laboratory and data workflows together.

Workflow Systems

Workflows can be managed or unmanaged. Whether you work at the bench or work with files and software, you use a workflow any time you carry out a procedure with more than one step. Perhaps you wite the steps in your notebook, check them off as you go, and tape in additional data like spectrophotometer readings or photos. Perhaps you write papers in Word and format the bibliography with Endnote or resize photos with Photoshop before adding them to a blog post. In all these cases you performed unmanaged workflows.

Managing and tracking workflows becomes important as the number of activities and number of individuals performing them increase in scale. Imagine your lab bench procedures performed multiple times a day with different individuals operating particular steps. This scenario occurs in core labs that perform the same set of processes over and over again. You can still track steps on paper, but it's not long before the system becomes difficult to manage. It takes too much time to write and compile all of the notes, and it's hard to know which materials have reached which step. Once a system goes beyond the work of a single person, paper notes quit providing the right kinds of overviews. You now need to manage your workflows and track them with a software system.

A good workflow system allows you to define the steps in your protocols. It will provide interfaces to move samples through the steps and also provide ways to add information to the system as steps are completed. If the system is well-designed, it will not allow you do things at inappropriate times or require too much "thinking" as the system is operated. A well-designed system will also reduce complexity and allow you to build workflows through software interfaces. Good systems give scientists the ability to manage their work, they do not require their users to learn arcane programming tools or resort to custom programming. Finally, the system will be flexible enough to let you create as many workflows as you need for different kinds of experiments and link those workflows to data entry forms so that the right kind of information is available to right process.

FinchLab Workflows

The Geospiza FinchLab workflow system meets the above requirements. The system has a high level workflow that understands that some processes require little tracking (a quick test) and other's require more significant tracking ("I want to store and reuse DNA samples"). More detailed processes are assigned workflows that consist of thee parts: A name, a "State," and a "Status." The "State" controls the software interfaces and determines which information are presented and accessed at different parts of a process. A sequencing or genotyping reaction, for example, cannot be added to a data collection instrument until it is "ready." The other part specifies the steps of the process. The steps of the process (Statuses) are defined by the lab and added to a workflow using the web interfaces. When a workflow is created, it is given a name, as many steps as needed, and it is assigned a State. The workflows are then assigned to different kinds of items so that the system always knows what to do next with the samples that enter.

A workflow management system like FinchLab makes it just as easy to track the steps of Sanger DNA sequencing, as it is to track the steps of a Solexa, SOLiD, or 454 sequencing processes. You can also, in the same system, run genotyping assays and other kinds of genetic analysis like microarrays and bead assays.


Next time, we'll talk about what happens in the lab.

Labels: , , , , ,

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:
  1. Allow you set up different interfaces to collect experimental information
  2. Assign specific workflows to experiments
  3. Track the workflow steps in the laboratory
  4. Prepare samples for data collection runs
  5. Link data from the runs back to the original samples
  6. Process data according to the needs of the experiment
I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:
  • One Illumina Solexa Genome Analyzer
  • One Applied Biosystems SOLiD System
  • One Illumina Bead Array station
  • Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis















This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Labels: , , , ,

Friday, April 25, 2008

Managing Digital Gene Expression Workflows with FinchLab

Last Wed (4/23) Illumina hosted a Geospiza presentation featuring how FinchLab supports mRNA tag profiling experiments. We had a great turnout and the presentation is posted on the Illumina web site.

In the webninar we talked about:
  • Next Gen sequencing applications
  • How the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive by looking at some features of mRNA Tag Profiling data sets with FinchLab
  • Setting up and tracking laboratory workflows with FinchLab
  • Why it is important to link the laboratory work and data analysis work
  • Setting up data analysis and reviewing results with FinchLab
  • Using hosted solutions to overcome the significant data management challenges that accompany Next Gen technologies
Over the coming weeks and months we'll explore the above points through multiple posts. In the meantime, get the presentation and enjoy.

From Sample to Results: Managing Illumina Data Workflow with FinchLab

Labels: , , , , , , , ,

Monday, April 21, 2008

Sneak Peak: Managing Next Gen Digital Gene Expression Workflows

This Wednesday, April 23rd, Illumina will host a webinar featuring the Geospiza FinchLab.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing how the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we will talk about the general applications of Next Gen sequencing and focus on using the Illumina Genome Analyzer to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling. Throughout the talk we will give specific examples about collecting and analyzing tag profiling data and show how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , ,

Wednesday, April 16, 2008

Expectations Set the Rules

Genetic analysis workflows are complex. Biology is non-deterministic, so we continually experience new problems. Lab processes and our data have natural uncertainty. These factors conspire against us to make our world rich in variability and processes less than perfect.

That keeps things interesting.

In a previous post, I was able to show how sequence quality values could be used to summarize the data for a large resequencing assay. Presenting "per read" quality values in a grid format allowed us to visualize samples that had failed as well as observe that some amplicons contained repeats that led to sequencing artifacts. We also were able to identify potential sample tracking issues and left off with an assignment to think about how we might further test sample tracking in the assay.

When an assay is developed there are often certain results that can be expected. Some results are defined explicitly with positive and negative controls. We can also use assay results to test that the assay is producing the right kinds of information. Do the data make sense? Expectations can be derived from the literature, an understanding of statistical outcomes, or internal measures.

Genetic assays have common parts

A typical genetic resequencing assay is developed from known information. The goal is to collect sequences from a defined region of DNA for a population of individuals (samples) and use the resulting data to observe the frequency of known differences (variants) and identify new patterns of variation. Each assay has three common parts:

Gene Model - Resequencing and genotyping projects involve comparative analysis of new data (sequences, genotypes) to reference data. The Gene Model can be a chromosomal region or specific gene. A well-developed model will include all known genotypes, protein variations, and phenotypes. The Gene Model represents both community (global) and laboratory (local) knowledge.

Assay Design - The Assay Design defines the regions in the Gene Model that will be analyzed. These regions, typically prepared by PCR are bounded by unique DNA primer sequences. The PCR primers have two parts: one part is complementary to the reference sequence (black in the figure), the other part is "universal" and is complementary to a sequencing primer (red in the figure). The study includes information about patient samples such as their ethnicity, collection origin, and phenotypes associated with the gene(s) under study.

Experiments / Data Collection / Analysis - Once the study is designed and materials arrive, samples are prepared for analysis. PCR is used to amplify specific regions for sequencing or genotyping. After a scientist is confident that materials will yield results, data collection begins. Data can be collected in the lab or the lab can outsource their sequencing to core labs or service companies. When data are obtained, they are processed, validated, and compared to reference data.

Setting expectations

A major challenge for scientists doing resequencing and genotyping projects arises when trying to evaluate data quality and determine the “next steps.” Rarely does everything work. We've already talked about read quality, but there are also the questions of whether the data are mapping to their expected locations, and whether the frequencies of observed variation are expected. The Assay Design can be used to verify experimental data.

The Assay Design tells us where the data should align and how much variation can be expected. For example, if the average SNP frequency is 1/1300 bases, and an average amplicon length is 580 bases, we should expect to observe one SNP for every two amplicons. Furthermore, in reads where a SNP may be observed, we will see the difference in a subset of the data because some, or most, of the reads will have the same allele as the reference sequence.

To test our expectations for the assay, the 7488 read data set is summarized in a way that counts the frequency of disagreements between read data and their reference sequence. The graph below shows a composite of read discrepancies (blue bar graph) and average Q20/rL, Q30/rL, Q40/rL values (colored line graphs). Reads are grouped according to the number of discrepancies observed (x-axis). For each group, the count of reads (bar height) and average Q20/rL (green triangles), Q30/rL (yellow squares), and Q40/rL (purple circles) are displayed.


In the 7488 read data set, 95% (6914) of the reads gave alignments. Of the aligned data, 82% of the reads had between 0 and 4 discrepancies. If we were to pick which traces to review and which to samples to redo, we would likely focus our review on the data in this group and queue the rest (18%) for redos to see if we could improve the data quality.

Per our previous prediction, most of the data (5692 reads) do not have any discrepancies. We also observe that the number of discrepancies increases as the overall data quality decreases. This is expected because the quality values are reflecting the uncertainty (error) in the data.

Spotting tracking issues

We can also use our expectations to identify sample tracking issues. Once an assay is defined, the positions of all of the PCR primers are known, hence we should expect that our sequence data will align to the reference sequence in known positions. In our data set, this is mostly true. Similar to the previous quality plots of samples and amplicons, an alignment "quality" can be defined and displayed in a table where the rows are samples and columns are amplicons. Each sample has two rows (one forward and one reverse sequence). If the cells are colored according to alignment start positions (green for expected, red for unexpected, white for no alignment) we can easily spot which reads have an "unexpected" alignment. The question then becomes, where/when did the mixup occur?

From these kinds of analyses we can get a feel for whether a project is on track and whether there are major issues that will make our lives harder. In future posts I will comment on other kinds of measures that can be made and show you how this work can be automated in FinchLab.

Labels: , , , ,

Monday, April 14, 2008

Digital Gene Expression with Next Gen Sequencing

Next Gen Sequencing is changing how we approach problems ranging from whole genome shotgun sequencing, to variation analysis, to gene expression, to structural genomics. Next week, April 23rd, Geospiza will present a webinar on managing Digital Gene Expression experiments and data with FinchLab. The webinar is hosted by Illumina as part of their ongoing webinar series on Next Gen sequencing.

Abstract

Next Gen sequencers enable researchers to perform new and exciting experiments like digital gene expression. Next Gen sequencers, however, also expose researchers to unprecedented experimental data volume and the need for new tools to support these projects. A single run of the Illumina Genome Analyzer, for example, can generate terabytes of data and 100s of thousands of files. To manage these projects effectively, researchers will need new software systems to quickly track samples, access and analyze the key results files produced by these runs and focus on the science, rather than IT.

In this webinar, Geospiza will demonstrate how the FinchLab Next Gen Edition workflow software can be used track samples, quality review data, and characterize the biological significance of an Illumina dataset while streamlining the entire process from sample to result for a Digital Gene Expression experiment.

Hope to see you there.

Labels: , , , , , ,

Tuesday, April 8, 2008

Exceptions are the Rule

Genetic analysis workflows are complex. You can expect that things will go wrong in the laboratory. Biology also manages to interfere and make things harder than you think they should be. Your workflow management system needs to show the relevant data, allow you to observe trends, and have flexible points were procedures can be repeated.

In the last few posts, I introduced genetic analysis workflows, concepts about lab and data workflows, and discussed why it is important to link the lab and data workflows. In this post I expand on the theme and show how a workflow system like the Geospiza FinchLab can be used to troubleshoot laboratory processes.

First, I'll review our figure from last week. Recall that it summarized 4608 paired forward / reverse sequence reads. Samples are represented by rows, and amplicons by column, so that each cell represents a single read from a sample and one of its amplicons. Color is used to indicate quality, with different colors showing the the number of Q20 bases divided by the read length (Q20/rL). Green is used for values between 0.60 and 1.00, blue for values between 0.30 and 0.59, and red for values less than 0.29. The summary showed patterns that, indicated lab failures and biological issues. You were asked to figure them out. Eric from seqanswers (a cool site for Next Gen info) took a stab at this, and got part of the puzzle solved.

Sample issues

Rows 1,2 and 7,8 show failed samples. We can spot this because of the red color across all the amplicons. Either the DNA preps failed to produce DNA, or something interfered with the PCR. Of course there are those pesky working reactions for both forward and reverse sequence in sample 1 column 8. My first impression is that there is a tracking issue. The sixth column also has one reaction that worked. Could this result indicate a more serious problem in sample tracking?


Amplicon issues

In addition to the red rows, some columns show lots of blue spots; these columns correspond to amplicons 7, 24 and 27. Remember that blue is an intermediate quality. An intermediate quality could be obtained if part of the sequence is good and part of the sequence is bad. Because the columns represent amplicons, when we see a pattern in a column it likely indicates s systematic issue for that amplicon. For example, in column 7, all of the data are intermediate quality. Columns 24 and 27 are more interesting because the striping pattern indicates that one sequencing reaction results in data with intermediate quality while the other looks good. Wouldn't it be great if we could drill down from this pattern and see a table of quality plots and also get to the sequence traces?


Getting to the bottom

In FinchLab we can drill down and view the underlying data. The figure below summarizes the data for amplicon 24. The panel on the left is the expanded heat map for the data set. The panel on right is a folder report summarizing the data from 192 reads for amplicon 24. It contains three parts: An information table that provides an overview of the details for the reads. A histogram plot that counts how many reads have a certain range of Q20 values, and a data table that summarizes each read in a row containing its name, the number of edit revisions, its Q20, Q20/rLen values, and a thumbnail quality plot showing the quality values for each base in the read. In the histogram, you can see that two distinct peaks are observed. About half the data have low Q20 values and half have high Q20 values, producing the striping pattern in the heat map. The data table shows two reads; one is the forward sequence and the other is its "reverse" pair. These data were brought together using the table's search function, in the "finder" bar. Note how the reads could fit together if one picture was reversed.

Could something in the sequence be interfering with the sequencing reaction?

To explore the data further, we need to look at the sequences themselves. We can do this by clicking the name and viewing the trace data online in our web browser, or we could click the FinchTV icon and view the sequence in FinchTV (bottom panel of the figure above). When we do this for the top read (left most trace) we see that, sure enough, there is a polyT track that we are not getting through. During PCR such regions can cause "drop outs" and result in mixtures of molecules that differ in size by one or two bases. A hallmark of such a problem is a sudden drop in data quality at the end of the poly nucleotide track because the mixture of molecules creates a mess of mixed bases. This explanation confirmed by the other read. When we view it in FinchTV (right most trace) we see poor data at the end of the read. Remember these data are reversed relative to the first read so when we reverse complement the trace (middle trace), we see that it "fits" together with the first read. A problem for such amplicons is that we now have only single stranded coverage. Since this problem occurred at the end of the read, half of the data are good and the other half are poor quality. If the problem occurred in the middle of the read, all of the data would show an intermediate quality like amplicon 7.

In genetic analysis data quality values are an important tool for assessing many lab and sample parameters. In this example, we were able to see systematic sample failures and sequence characteristics that can lead to intermediate quality data. We can use this information to learn about biological issues that interfere with analysis. But what about our potential tracking issue?

How might we determine if our samples are being properly tracked?

Labels: , , , , ,

Friday, April 4, 2008

Lab work without data analysis and management is doo doo

As we begin to contemplate next generation sequence data management, we can use Sanger sequencing to teach us important lessons. One of which, is the value of linking laboratory and data workflows to be able to view information in the context of our assays and experiments.

I have been fortunate to hear J. Michael Bishop speak on a couple of occasions. He ended these talks by quoting one of his biochemistry mentors, "genetics without biochemistry is doo doo." In a similar vein, lab work without data analysis and management is doo doo. That is when you separate the lab from the data analysis, you have to work through a lot of doo to figure things out. Without a systematic way to view summaries of large data sets, the doo is overwhelming.

To illustrate, I am going to share some details about a resequencing project we collaborated on. We came to this project late, so much of the data had been collected, and there were problems, lots of doo. Using Finch however, we could quickly organize and analyze the data, and present information in summaries with drill downs to the details to help troubleshoot and explain observations that were seen in the lab.

10,686 sequence reads: forward / reverse sequences from 39 amplicons from 137 individuals

The question being asking in this project was: are there new variants in a gene that are related to phenotypes observed in a specialized population? This is the kind of question medical researchers ask frequently. Typically they have a unique collection of samples that come from a well understood population of individuals. Resequencing is used to interrogate the samples for rare variants, or genotypes.

In this process, we purify DNA from sample material (blood), and use PCR with exon specific probes to amplify small regions of DNA within the gene. The PCR primers have regions called universal adaptors. Our sequencing primers will bind to those regions. Each PCR product, called an amplicon, is sequenced twice, once from each strand to give double coverage of the bases.

When we do the math, we will have to track the DNA for 137 samples and 5343 amplicons. Each amplicon is sequenced, at a minimum twice, to give us 10,686 reads. From a physical materials point of view that means 137 tubes with sample; 56, 96-well plates for PCR; and 112, 96-well plates for sequencing. In a 384-well format we could have used 14 plates for PCR and 28 plates for sequencing. For a genome center, this level of work is trivial, but for a small lab this is significant work and things can happen. Indeed as not all the work is done in a single lab the process can be more complex. And you need to think about how you would lay this out - 96 does not divide by 39 very well.

From a data perspective, we can use sequence quality values to identify potential laboratory and biological issues. The figure below summarizes 4608 reads. Each pair of rows is one sample (forward / reverse sequence pairs, alternating gray and white - 48 total). Each column is an amplicon. Each cell in the table represents a single read from an amplicon and sample. Color is used to indicate quality. In this analysis, quality is defined as the ratio of Q20 to read length (Q20/rL), which works very well for PCR amplicons. The better the data, the closer this ratio is to one. In the table below, green indicates Q20/rL values between 0.60 and 1.00, blue indicates values between 0.30 and 0.59, and red indicates Q20/rL values less than 0.29. The summary shows patterns that, as we will learn next week, show lab failures and biological issues. See if you can figure them out.

Labels: , , , , , , ,

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.

Labels: , , , , , , ,

Monday, March 31, 2008

Next Gen, Next Step

Congratulations! You just got approval to purchase your next generation sequencer! What are you going to do next?

Today, there is a lot being written about the data deluge accompanying Next Gen sequencers. It's true, they produce a lot of data. But even more important are the questions about how you plan to set up the lab and data workflows to turn those precious samples into meaningful information. The IT problems, while significant, are only the tip of the iceberg. If you operate a single lab, you will need to think about your experiments, how to track your samples, how to prepare DNA for analysis, how to move the data around for analysis, and how to do your analyses to get meaningful information out of the data. If you operate a core lab, you have all the same problems, but you're providing that service for a whole community of scientists. You'll need to keep their samples and data separated and secure. You also have to figure out how to get the data to your customers and how you might help them with their analyses.

Never mind that you need multi terabytes of storage and a computer cluster. Without a plan and strategy for running your lab, organizing the data, running multistep analysis procedures, and sifting through 100's of thousands of alignments, you'll just end up with a piece of lab art: a Next Gen sequencer, a big storage system and a computer cluster. (By the way, have you found a place for this yet?) It may look nice, but that's probably not what you had in mind.

To get the most of out of your investment, you'll need to think about workflows, and how to manage those workflows.


The cool thing about Next Gen technology are the kinds of questions that can be asked with the data. This requires both novel ways to work with DNA and RNA and novel ways to work with the data. We call those procedures "workflows." Simply put, a workflow describes a multistep procedure and its decision points. In each step, we work with materials and the materials may be "transformed" in the step. You can also describe a workflow as a series of steps that have inputs and outputs. Workflows are run both in the lab and on the computer.

In a protocol for isolating DNA , we can take tissue (the input) lyse the cells with detergent, bind the DNA to a resin, wash away junk, and elute purified DNA (the output). The purified DNA may then become an input to a next step, like PCR, to create an output, like a collection of amplicons. Similar processes can be used with RNA. In a Next Gen lab workflow, you fragment the DNA, ligate adaptors, and use the adaptors to attach DNA to beads or regions of a slide. From a few basic lab workflows, we can prepare genetic material for whole genome analysis, expression analysis, variation analysis, gene regulation, and other experiments in both discovery, and diagnostic assays.

In a software workflow, data are the material. Input data, typically packaged in files, are processed by programs to create output data. These data or information can also be packaged in files or even stored in databases. Software programs execute the steps and scripts often automate series of steps. Digital photography, multimedia production, and business processes all have workflows. So does bioinformatics. The difference is that bioinformatics workflows lack standards so many people work harder than needed and spend a lot of time debugging things.

As the scale increases, the lab and analysis workflows must be managed together.


A common laboratory practice has been to collect the data, and then analyze the data in separate independent steps. Lab work is often tracked on paper, in Excel spreadsheets, or in a LIMS (Laboratory Information Management System). The linkage between lab processes, raw data, and final results, is typically poor. In small projects, this is manageable. File naming conventions can track details and computer directories (folders) can be used to organize data files. But as the scale grows, the file names get longer and longer, people spend considerable time moving and renaming data, the data start to get mixed up, become harder to find, and for some reason files start to replicate themselves. Now, the lab investigates tracking problems and lost data, instead of doing experiments.

Why? Because the lab and data analysis systems are disconnected.

The good news is that Geospiza Finch products can link your lab procedures and the data handling procedures to create complete workflows for genetic analysis.

Labels: , , , , , ,

Monday, March 24, 2008

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Next generation sequencing will transform sequencing assays and experiments. Understanding how the data are generated is important for interpreting results.

In my last post, I discussed how all measurement systems have uncertainty (error) and how error probability is determined in Sanger sequencing. In this post, I discuss the current common next generation (Next Gen) technologies and their measurement uncertainty.

The Levels of Data - To talk about Next Gen data and errors, it is useful to have a framework for describing these data. In 2005, the late Jim Gray and colleagues published a report addressing scientific data management in the coming decade. The report defines data in three levels (Level 0, 1, and 2). Raw data (Level 0), the immediate product of analytical instruments, must be calibrated and transformed into Level 1 data. Level 1 data are then combined with other data to facilitate analysis (Level 2 datasets). Interesting science happens when we make different combinations of Level 2 data and have references to Level 1 data for verification.

How does our data level framework apply to the DNA sequencing world?

In Sanger sequencing, Level 0 data are the raw analog signal data that are collected from the laser and converted to digital information. Digital signals, displayed as waveforms, are mobility corrected and basecalled to produce Level 1 data files that contain information about the collection event, the DNA sequence (read), quality values, and averaged intensity signals. Read sequences are then combined together, or with reference data, to produce Level 2 data in the form of aligned sequence datasets. These Level 2 data have many context specific meanings. They could be a list of annotations, a density plot in a genome browser view, an assembly layout, or a panel of discrepancies that use quality values (from Level 1) to distinguish true genetic variation from errors (uncertainty) in the data.

Next Gen data are different.

When the "levels of data" framework is used to explore Next Gen sequencing technology, we see fundamental differences between Level 0 and Level 1 data . In Sanger sequencing, we perform a reaction to create a mixture of fluorescently tagged molecules that differ in length by a single base. The mixture is then resolved by electrophoresis with continual detection of these size separated tagged DNA molecules to ultimately create a DNA sequence. Uncertainties in the basecalls are related to electrophoresis conditions, compressions due to DNA secondary structure, and the fact that some positions have mixed bases because the sequencing reactions contain a collection of molecules.

In Next Gen sequencing, molecules are no longer separated by size. Sequencing is done "in place" on DNA molecules that were amplified from single molecules. These amplified DNA molecules are either anchored to beads (that are later randomly bound to slides or picoliter wells) or anchored to random locations on slides. Next Gen reactions then involve multiple cycles of chemical reaction (flow) followed by detection. The sample preparation methods and reaction/detection cycles are where the current Next Gen technologies greatly differ and have their unique error profiles.

Next Gen Level 0 data changes from electropherograms to huge collections of images produced through numerous cycles of base (or DNA) extensions and image analysis. Unlike Sanger sequencing, where both raw and processed data can be stored relatively easily in single files, Next Gen Level 0 data sets can be quite large and multiple terabytes can be created per run. Presently there is active debate on the virtues of storing Level 0 data (the Data Life Cycle). The Level 0 image collections are used to create Level 1 data. Level 1 data are a continuum of information that are ultimately used to produce reads and quality values. Next Gen quality values reflect the basic underlying sequencing chemistry and primary error modes, and thus calculations need to be optimized for each platform. For those of you who are familiar with phred, these are analogous to the settings in the phredpar.dat file. The other feature of Level 1 data are that they can be expressed differently.


The Spaces

Flow Space - The Roche 454 technology, is based on pyrosequencing [1] and the measured signals are a function of how many bases are incorporated in a base addition cycle. In pyrosequencing, we measure the pyrophosphate (PPi) that is released when a nucleotide is added to a growing DNA chain. If multiple bases are added in a cycle, two or more A's for example, proportionally more (PPi) is released and the light detected is more intense. As more bases are added in a row, the relative increase in light decreases; an 11/10 change ratio for example, is much lower than 2/1. Consequently, when there are longer sequences with the same base (e.g. AAAAA), it becomes harder to count the number of bases accurately, and the error rate increases. Flow space describes a sequence in terms of base incorporations. That is, the data are represented as an ordered string of bases plus the number of bases at each base site. The 454 software performs alignment calculations in flow space.

Color Space - The Applied Biosystems SOLiD technology uses DNA ligation [2] with collections of synthetic DNA molecules (oligos) that contain two nested known bases with a fluorescent dye. In each cycle these two bases are read at intervals of five bases. Getting full sequence coverage (25, 35, or more bases), requires multiple ligation and priming cycles such that each base is sequenced twice. The SOLiD technology uses this redundancy to decrease the error probability. The Level 1 sequence is reported in a “color” space, where the numbers 0,1,2,3 are used to represent one of the fluorescent colors. Color space must be decoded into a DNA sequence at the final stage of data processing. Like flow space, it is best to perform alignments using data in color space. Since each color represents two bases, decoding color space requires that the first base be known. This base came from the adapter.

Sequence Space - Illumina’s Solexa technology uses single base extensions of fluorescent-labeled nucleotides with protected 3'-OH groups. After base addition and detection, the 3'-OH is deprotected and the cycle repeated. The error probabilities are calculated in different ways by first analyzing the raw intensity files (remember firecrest and bustard) and then by alignment with the ELAND program to compute an empirical probability error. With Solexa data, errors occur more frequently at the ends of reads.

So, what does this all mean?

Next Gen technologies are all different in the way data are produced, they all produce different kinds of Level 1 data, and the Level 1 data are best analyzed within their own unique "space." This can have data integration implications depending on how you might want to combine data sets (Level 1 vs Level 2). Clearly, it is important to have easy access to quality values and error rates to help troubleshoot issues with runs and samples. The challenge is sifting out the important data from a morass of sequence files, quality files, data encodings, and other vagaries of the instruments and their software. Geospiza is good at this and we can help.

In terms of science, many assays and experiments, like Tag and Count, or whole genome assembly, can use redundancy to overcome random data errors. Data redundancy can also be used to develop rules to validate variations in DNA sequences. But, this is the tip of the iceberg. With Next Gen's extremely high sampling rates and molecular resolution, we can begin to think of finding very rare differences in complex mixtures of DNA molecules and use sequencing as a quantitative assay. In these cases understanding how the data are created and their corresponding uncertainties are a necessary step toward making full use of the data. The good news is that there is some good work being done in this area.

Remember, the differences we measure must be greater than the combined uncertainty of our measurements.

Further reading
1. Ronaghi, M., M. Uhlen, and P. Nyren, "A sequencing method based on real-time pyrophosphate." Science, 1998. 281(5375): p. 363, 365.

2. Shendure, J., G.J. Porreca, N.B. Reppas, et al., "Accurate multiplex polony sequencing of an evolved bacterial genome." Science, 2005. 309(5741): p. 1728-32.

3. "Quality scores and SNP detection in sequencing-by-synthesis systems." http://www.genome.org/cgi/content/abstract/gr.070227.107v2

4. "Scientific data management in the coming decade." http://research.microsoft.com/research/pubs/view.aspx?tr_id=860

5. Solexa quality values. http://rcdev.umassmed.edu/pipeline/Alignment%20Scoring%20Guide%20and%20FAQ.html
http://rcdev.umassmed.edu/pipeline/What%20do%20the%20different%20files%20in%20an%20analysis%20directory%20mean.html

Labels: , , , ,

Monday, March 17, 2008

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Next generation DNA sequencing introduces new concepts like color space, flow space, and sequence space. You might ask, what's a space? How do I deal with these spaces? Why are they important?

In this two part blog, I will first talk about error analysis in DNA sequencing. Next I will talk about how we might think about error analysis in next generation sequencing.

Last week I came across a story about an MIT physics professor, Walter Lewin, who captivates his student audiences with his lectures and creative demonstrations. MIT and iTunes have 100 of his lectures on line. I checked out the first one - your basic first college physics lecture that focuses on measurement and dimensional analysis - and agree, Lewin is captivating. I watched the entire lecture, and it made me think about DNA sequencing.

In the lecture, Lewin, proves "physics works!" and how his grandmother was right when she said that you are inch taller when laying down than when standing up. He used a student subject and measured his length laying down and standing up. Sure enough, the student was an inch longer laying down. But that was not the point. The point was - Lewin proved his grandmother was right because the change in the student's length was greater than the uncertainty of his measuring device (the ruler). Every measurement we make has uncertainty, or error, and for a comparison to be valid the difference in measures have to be greater than their combined uncertainties.

What does this have to do with DNA sequencing?

Each time we collect DNA sequence data we are making many measurements. That is, we are determining the bases of a DNA sample template in an in vitro replication process that allows us to "read" each base of the sequence. The measurements we collect, the string of DNA bases, therefore have uncertainty. We call this uncertainty in base measurement the error probability. In Sanger sequencing, Phil Green and Brent Ewing developed the Phred basecalling algorithm to measure per base error probabilities.

Error probabilities are small numbers (1/100, 1/10,000, 1/1,000,000). Rather than work with small fractions and decimal values with many leading zeros, we express error probabilities as positive whole integers, called quality values (QVs), by applying a transformation:

QV = -l0*log(P), where P is the error probability.

With this transformation our 1/100, 1/10,000, and 1/1,000,000 error probabilities become QVs of 20, 40, and 60, respectively.

The Phred basecalling algorithm has had a significant impact on DNA sequencing because it demonstrated that we could systematically measure the uncertainty of each base determination in a DNA sequence. Over the past 10 years, Phred quality values have been calibrated through many resequencing projects and are thus statistically powerful. An issue with Phred, and any basecaller, however is that it must be calibrated for different electrophoresis instruments (measurement devices) and that is why different errors and error rates can be observed with different combinations of basecallers and instruments.

Sequencing redundancy also reduces error probabilities

The gold standard in DNA sequencing is to sequence both strands of a DNA molecule. This is for good reason. Each stand represents an independent measurement. If our measurements agree, they can be better trusted, and if they disagree one needs to look more closely at the underlying data, or remeasure. This concept was also incorporated into Green's assembly program Phrap (unpublished).

Within the high throughput genomics community it is well understood that increasing the redundancy of data collection reduces error. In theory, one can automate the interpretation of DNA sequencing experiments, or assays, by collecting data at sufficient redundancy. The converse is also true, and I see people work the hardest with manually reviewing data when they do not collect enough. This is most common with variant detection resequencing assays.

Why isn't high redundancy data collection routine?

The challenges with high redundancy data collection in Sanger sequencing involve the high relative costs of collecting data and higher costs of collecting data from single molecules. Next generation (Next Gen) sequencing changes this landscape.

The higher throughput rates and lower costs of Next Gen sequencing hold great promise for revolutionizing genomics research and molecular diagnostics. In a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal because the large numbers of sequences that can be obtained make statistical calculations more valid.

Making statistical calculations valid, however, requires that we understand the inherit uncertainty of our measuring device. In this case, the different Next Gen genetic analyzers. That's where color space, flow space, and other spaces come into play.

Further Reading
1. Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res, 2007. 35(15): p. e97.
2. Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6.
3. Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, 454 sequencing put to the test using the complex genome of barley. BMC Genomics, 2006. 7: p. 275.
4. Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 2007. 67(18): p. 8511-8.
5. Highlander, S.K., K.G. Hulten, X. Qin, et al., Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus. BMC Microbiol, 2007. 7(1): p. 99.
6. Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res, 2007. 17(8): p. 1186-94.
7. Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

Labels: , , ,