Friday, August 8, 2008

ChIP-ing Away at Analysis

ChiP-Seq is becoming a popular way to study the interactions between proteins and DNA. This new technology is made possible by the Next Gen sequencing techniques and sophisticated tools for data management and analysis. Next Gen DNA sequencing provides the power to collect the large amounts of data required. FinchLab is the software system that is needed to track the lab steps, initiate analysis, and see your results.

In recent posts, we stressed the point that unlike Sanger sequencing, Next Gen sequencing demands that data collection and analysis be tightly coupled, and presented our initial approach of analyzing Next Gen data with the Maq program. We also discussed how the different steps (basecalling, alignment, statistical analysis) provide a framework for analyzing Next Gen data and described how these steps belong to three phases: primary, secondary, and tertiary data analysis. Last, we gave an example of how FinchLab can be used to characterize data sets for Tag Profiling experiments. This post expands the discussion to include characterization of data sets for ChIP-Seq.

ChIP-Seq

ChiP (Chromosome Immunoprecipitation) is a technique where DNA binding proteins, like transcription factors, can be localized to regions of a DNA molecule. We can use this method to identify which DNA sequences control expression and regulation for diverse genes. In the ChIP procedure, cells are treated with a reversible cross-linking agent to "fix" proteins to other proteins that are nearby, as well as the chromosomal DNA where they're bound. The DNA is then purified and broken into smaller chunks by digestion or shearing and antibodies are used to precipitate any protein-DNA complexes that contain their target antigen. After the immunoprecipitation step, unbound DNA fragments are washed away, the bound DNA fragments are released, and their sequences are analyzed to determine the DNA sequences that the proteins were bound to. Only few years ago, this procedure was much more complicated than it is today, for example, the fragments had to be cloned before they could be sequenced. When microarrays became available, a microarray-based technique called ChIP-on-chip made this assay more efficient by allowing a large number of precipitated DNA fragments to be tested in fewer steps.

Now, Next Gen sequencing takes ChIP assays to a new level [1]. In ChIP-seq the same cross linking, isolation, immunoprecipitation, and DNA purification steps are carried out. However, instead of hybridizing the resulting DNA fragments to a DNA array, the last step involves adding adaptors and sequencing the individual DNA fragments in parallel. When compared to microarrays, ChiP-seq experiments are less expensive, require fewer hands-on steps and benefit from the lack of hybridization artifacts that plague microarrays. Further, because ChIP-seq experiments produce sequence data, they allow researchers to interrogate the entire chromosome. The experimental results are no longer to the probes on the micoarray. ChIP-Seq data are better at distinguishing similar sites and collecting information about point mutations that may give insights into gene expression. No wonder ChIP-Seq is growing in popularity.

FinchLab

To perform a ChIP-seq experiment, you need to have a Next Gen sequencing instrument. You will also need to have the ability to run an alignment program and work with the resulting data to get your results. This is easier said than done. Once the alignment program runs, you might have to also run additional programs and scripts to translate raw output files to meaningful information. The FinchLab ChIP-seq pipeline, for example, runs Maq to generate the initial output, then runs Maq pileup to convert the data to a pileup file. The pileup file is then read by a script to create the HTML report, thumbnail images to see what is happening and "wig" files that can be viewed in the UCSC Genome Browser. If you do this yourself, you have to learn the nuances of the alignment program, how to run it different ways to create the data sets, and write the scripts to create the HTML reports, graphs, and wig files.

With FinchLab, you can skip those steps. You get the same results by clicking a few links to sort the data, and a few more to select the files, run the pipeline, and view the summarized results. You can also click a single link to send the data to the UCSC genome browser for further exploration.


Reference

ChIP-seq: welcome to the new frontier Nature Methods - 4, 613 - 614 (2007)

Labels: , , , , ,

Thursday, July 31, 2008

Questions from our mailbag: How do I cite FinchTV?

One of the questions that appears in our mailbox from time to time concerns citing FinchTV or other Geospiza products. A quick search with Google Scholar for "FinchTV" finds 42 examples where FinchTV was cited in research publications. Most of the citations seem to follow the same conventions.

We recommend citing FinchTV as you would any other experimental software tool, instrument, or reagent. The citation should include the version of the program, the company, the location, and the web site. Other Geospiza products (FinchLab, Finch Suite, and iFinch) may be cited in similar manner.

In our case, a citation would most likely read:

FinchTV 1.4.0 (Geospiza, Inc.; Seattle, WA, USA; http://www.geospiza.com)

If you're not sure which version of FinchTV you're using, open the About menu. The version number will appear on the page.

It would also be a good idea to check with the journal where you plan to submit the article. Most journals have a set of instructions for authors where they provide example citations.

Labels: , , ,

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Labels: , , , , ,

Monday, July 14, 2008

Maq Attack

Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger center, for assembling Next Gen reads onto a reference sequence. Since Maq is widely used for working with Next Generation DNA sequence data, we chose to include support for Maq in our upcoming release of FinchLab. In this post, we will discuss integrating secondary analysis algorithms like Maq with the primary analysis and workflows in FinchLab.

Improving laboratory processes through immediate feedback

The cost to run Next Generation DNA sequencing instruments and the volume of data produced make it important for labs to be able to monitor their processes in real time. In the last post, I discussed how labs can get performance data and accomplish scientific goals during the three stages of data analysis. To quickly review: Primary data analysis involves converting image data to sequence data. Secondary data analysis involves aligning the sequences from the primary data analysis to reference data to create data sets that are used to develop scientific information. An example of a secondary analysis step would be assembling reads into contigs when new genomes are sequenced. Unlike the first two stages, where much of the data is used to detect errors and measure laboratory performance, the last stage is focused on the science. In the Tertiary data analyses genomes are annotated, and data sets are compared. Thus the tertiary analyses are often the most important in terms of gaining new insights. The data used in this phase must be vetted first. It must be high quality and free from systemic errors.

The companies producing Next Gen systems recognize the need to automate primary and secondary analysis. Consequently, they provide some basic algorithms along with the Next Gen instruments. Although these tools can help a lab get started, many labs have found that significant software development is needed on top of the starting tools if they are to fully automate their operation, translate output files into meaningful summaries, and give users easy access to the data. The starter kits from the instrument vendors can also be difficult to adapt when performing other kinds of experiments. Working with Next Gen systems typically means that you will have deal with a lot of disconnected software, a lack of user interfaces, and diverse new choices for algorithms when it comes to getting your work done.

FinchLab and Maq in an integrated system

The Geospiza FinchLab integrates analytical algorithms such as Maq into a complete system that encompasses all the steps in genetic analysis. Our Samples to Results platform provides flexible data entry interfaces to track sample meta data. The laboratory information management system is user configurable so that any kind of genetic analysis procedure can be run and tracked and most importantly provides tight linkage between samples, lab work, and their resulting data. This system makes it easy to transition high quality primary results to secondary data analysis.

One of the challenges with Next Gen sequencing has been choosing an algorithm for secondary analysis. Secondary data analysis needs to be adaptable to different technology platforms and algorithms for specialized sequencing applications. FinchLab meets this need because it can accommodate multiple algorithms when it comes to secondary and tertiary analysis. One of these algorithms is Maq. Maq attractive because it can be used in diverse applications where reads are aligned to a reference sequence. Among these are Transcriptomics (Tag Profiling, EST analysis, small RNA discovery), Promoter Mapping (CHiP-Seq, DNAase hypersensitivity), Methylation analysis, and Variation Analyses (SNP, CNV). Maq offers a rich set of output files so it can be used to quickly provide an overview of your data and help you verify that your experiment is on track before you invest serious time in tertiary work. Finally Maq is being actively developed and improved and is open-source so it is easy to access and use regardless of affiliation.

Maq and other algorithms are integrated into FinchLab through the FinchLab Remote Analysis Server (RAS). RAS is a lightweight job tracking system that can be configured to run any kind of program in different computing environments. RAS communicates with FinchLab to get the data and return the results. Data analyses are run in FinchLab by selecting the sequence file(s), clicking a link to go to a page and select the analysis method(s) and reference data sets, and then clicking a button to start the work. RAS tracks the details of data processing and sends information back to FinchLab so that you can always see what happening through the web interface.

A basic FinchLab system includes the RAS and pipelines for running Maq in two ways. The first is Tag Profiling and Expression Analysis. In this operation, Maq output files are converted to gene lists with links to drill down into the data and NCBI references. The second option it to use Maq in a general analysis procedure where all the output files are made available. In the next months, new tools will convert more of these files into output that can be added to genome browsers and other tertiary analysis systems.

A final strength of RAS is that it produces different kinds of log files to track potential errors. These kinds of files are extremely valuable in trouble-shooting and fixing problems. Since Next Gen technology is new and still in constant flux, you can be certain that unexpected issues will arise. Keeping the research on track is easier when informative RAS logging and reports help to diagnose and resolve issues quickly. Not only can FinchLab help with Next Gen assays, help solve those unexpected Next Gen problems, multiple Next Gen algorithms can be integrated into FinchLab to complete the story.

Labels: , , , , ,

Wednesday, June 25, 2008

Finch 3: Getting Information Out of Your Data

Geospiza's tag line "From Sample to Results" represents the importance of capturing information from all steps in the laboratory process. Data volumes are important and lots of time is being spent discussing the overwhelming volumes of data produced by new data collection technologies like Next Gen sequencers. However, the real issue is not how you are going to store the data, rather it is what are you going to do with it? What do your data mean in the context of your experiment?

The Geospiza FinchLab software system supports the entire laboratory and data analysis workflow to convert sample information into results. What this means is that the system provides a complete set of web-based interfaces and an underlying database to enter information about samples and experiments, track sample preparation steps in the laboratory, link the resulting data back to samples, and process the data to get biological information. Previous posts have focused on information entry, laboratory workflows, and data linking. This post will focus on how data are processed to get biological information.

The ultra-high data output of Next Gen sequencers allows us to use DNA sequencing to ask many new kinds of questions about structural and nucleotide variation and measure several indicators of expression and transcription control on a genome-wide scale. The data produced consists of images, signal intensity data, quality information, and DNA sequences and quality values. For each data collection run, the total collection of data and files can be enormous and can require significant computing resources. While all of the data have to be dealt with in some fashion, some of the data have long-term value while other data are only needed in the short term. The final scientific results will often be produced by comparing data sets created from the DNA sequences and their comparison to reference data.

Next Gen data are processed in three phases.

Next Gen data workflows involve three distinct phases of work: 1. Data are collected from control and experimental samples. 2. Sequence data obtained from each sample are aligned to reference sequence data, or data sets to produce aligned data sets 3. Summaries of the alignment information from the aligned data sets are compared to produce scientific understanding. Each phase has a discrete analytical process and we, and others, call these phases primary data analysis, secondary data analysis and tertiary data analysis.

Primary data analysis involves converting image data to sequence data. The sequence data can be in familiar "ACTG" sequence space or less familiar color space (SOLiD) or flow space (454). Primary data analysis is commonly performed by software provided by the data collection instrument vendor and it is the first place where quality assessment about a sequencing run takes place.

Secondary data analysis creates the data sets that will be further used to develop scientific information. This step involves aligning the sequences from the primary data analyses to reference data. Reference data can be complete genomes, subsets of genomic data like expressed genes, or individual chromosomes. Reference data are chosen in an application specific manner and sometimes multiple reference data sets will be used in an iterative fashion.

Secondary data analysis has two objectives. The first is to determine the quality of the DNA library that was sequenced, from a biological and sample perspective. The primary data analysis supplies quality measurements that can used to determine if the instrument ran properly, or whether the density of beads or clusters were at their optimum to deliver the highest number of high quality reads. However, those data do not tell you about the quality of the samples. Answering questions about sample quality, such as did the DNA library contain systematic artifacts such as sequence bias? Were there high numbers of ligated adaptors or incomplete restriction enzyme digests, or any other factors that would interfere with interpreting the data? These kinds of questions are addressed in the secondary data analysis by aligning your reads to the reference data and seeing that your data make sense.

The second objective of secondary data analysis is to prepare the data sets for tertiary analysis where they will be compared in an experimental fashion. This step involves further manipulation of alignments, typically expressed in very large hard to read algorithm specific tables, to produce data tables that can be consumed by additional software. Speaking of algorithms, there is a large and growing list to choose from. Some are general purpose and others are specific to particular applications, we'll comment more on that later.

Tertiary data analysis represents the third phase of the Next Gen workflow. This phase may involve a simple activity like viewing a data set in a tool like a genome browser so that the frequency of tags can be used to identify promoter sites, patterns of variation, or structural differences. In other experiments, like digital gene expression, tertiary analysis can involve comparing different data sets in a similar fashion to microarray experiments. These kinds of analyses are the most complex; expression measurements need to be normalized between data sets and statistical comparisons need to be made to assess differences.

To summarize, the goal of primary and secondary analysis is to produce well-characterized data sets that can be further compared to obtain scientific results. Well-characterized means that the quality is good for both the run and the samples and that any biologically relevant artifacts are identified, limited, and understood. The workflows for these analyses involve many steps, multiple scientific algorithms, and numerous file formats. The choices of algorithms, data files, data file formats, and overall number of steps depend the kinds of experiments and assays being performed. Despite this complexity there are standard ways to work with Next Gen systems to understand what you have before progressing through each phase.

The Geospiza FinchLab system focuses on helping you with both primary and secondary data analysis.

Labels: , , , , , ,

Friday, June 13, 2008

Finch 3, Linking Samples and Data

One of the big challenges with Next Gen sequencing is linking sample information with data. People tell us: "It's a real problem." "We use Excel, but it is hard." "We're losing track."

Do you find it hard to connect sample information with all the different types of data files? If so you should look at FinchLab.

A review:

About a month ago, I started talking about our third version of the Finch platform and introduced the software requirements for running a modern lab. To review, labs today need software systems that allow them to:

1. Set up different interfaces to collect experimental information
2. Assign specific workflows to experiments
3. Track the workflow steps in the laboratory
4. Prepare samples for data collection runs
5. Link data from the runs back to the original samples
6. Process data according to the needs of the experiment

In FinchLab, order forms are used to first enter sample information into the system. They can be created for specific experiments and the samples entered will, most importantly, be linked to the data that are produced. The process is straightforward. Someone working with the lab, a customer or collaborator, selects the appropriate form and fills out the requested information. Later, an individual in the lab reviews the order and, if everything is okay, chooses the "processing" state from a menu. This action "moves" the samples into the lab where the work will be done. When the samples are ready for data collection they are added to an "Instrument run." The instrument run is Finch's way of tracking which samples go in what well of a plate or lane/chamber on a slide. The samples are added to the instrument and data are collected.

The data

Now comes the fun part. If you have a Next Gen system you'll ultimately end up with 1000's of files scattered in multiple directories. The primary organization for the data will be in unix-style directories, which are like Mac or Windows folders. Within the directories you will find a mix of sequence files, quality files, files that contain information about run metrics and possibly images. You'll have to make decisions about what to save for long-term use and what to archive, or delete.

As noted, the instrument software organizes the data by the instrument run. However, a run can have multiple samples, and the samples can be from different experiments. A single sample can be spread over multiple lanes and chambers of a slide. If you are running a core lab, the samples will come from different customers and your customers often belong to different lab groups. And there is the analysis. The programs that operate on the data require specific formats for input files and produce many kinds of output files. Your challenge is to organize the data so that it is easy to find and access in a logical way. So what do you do?

Organizing data the hard way

If you do not have a data management system, you'll need to write down which samples go with which person, group or experiment. That's pretty simple. You can tape a piece of paper on the instrument and write this down, or you can diligently open a file, commonly an Excel spreadsheet, and record the info there. Not too bad, after all there are only a handful of partitions on a slide (2, 8, 16) and you only run the instrument once or twice a week. If you never upgrade your instrument, or never try and push too many samples through, then you're fine. Of course the less you run your instrument the more your data cost and the goal is to get really good at running your instrument, as frequently as possible. Otherwise you look bad at audit time.

Let's look at a scenario where the instrument is being run at maximal throughput. Over the course of a year, data from between 200 and 1000 slide lanes (chambers) may be collected. These data may be associated with 100's or 1000's of samples and belong to a few or many users in one or many lab groups. The relevant sequence files are between a few hundred megabytes to gigabytes in size; they exist in directories with run quality metrics and possibly analysis results. To sort this out you could have committee meetings to determine whether data should be organized by sample, experiment, user, or group, or you could just pick an organization. Once you've decided on your organization you have to set up access. Does everyone get a unix account? Do you set up SAMBA services? Do you put the data on other systems like Macs and PCs? What if people want to share? The decisions and IT details are endless. Regardless, you'll need a battery of scripts to automate moving data around to meet your organizational scheme. Or you could do something easier.

Organizing data the Finch way

One of FinchLab's many strengths is how it organizes Next Gen data. Because the system tracks samples and users, and has group and permissions models, issues related to data access and sharing are simplified. After a run is complete, the system knows which data files go to what samples. It also knows which samples were submitted by each user. Thus data can be maintained in the run directories that were created by the instrument software to simplify file-based organization. When a run is complete in FinchLab a data link is made to the run directory. The data link informs the system which files go with a run. Data processing routines in the system sort the data into sequences, quality metric files, and other data. At this stage data are associated with samples. Once this is done, the lab has easy access to the data via web pages. The lab can also make decisions about access to data and how to analyze the data. These last two features make FinchLab a powerful system for core labs and research groups. With only few clicks your data are organized by run, user, group, and experiment - and you didn't have to think about it.



Labels: , , , , , , ,

Thursday, June 5, 2008

Finishing in the Future

"The data sets are astronomical," "the data that needs to be attached to sequences is unbelievable," and "browsing [data] is incomprehensible." These are just three of the many quotes I heard about the challenges associated with DNA sequencing last week at the "Finishing in the Future Meeting" sponsored by the Joint Genome Institute (JGI) and Los Alamos National Laboratory (LANL).

Metagenomics

The two and half day conference, focused on finishing genomic sequences, kicked off with a session on metagenomics. Metagenomics is about isolating DNA from environments and sequencing random molecules to "see what's out there." Excitement for metagenomics is being driven by Next Gen sequencing throughput, because so many sequences can be collected relatively inexpensively. A benefit of being able to collect such large data sets is that we can interrogate organisms that can cannot be cultured. The first talk, "Defining the Human Microbiome: Friends or Family," was presented by Bruce Birren from the Broad Institute of MIT & Harvard. In this talk, we learned about the HMP (Human Microbiome Project), a project dedicated to characterizing the microbes that live on our bodies. It is estimated that microbial cells out number our cells by ten to one. It has long been speculated that our microbiomes are involved in our health and sickness and recent studies are confirming these ideas.

Sequencing technologies continue to increase data throughput

The afternoon session opened with presentations from Roche (454), Illumina, and Applied Biosystems on their respective Next Gen sequencing platforms. Each company presented the strengths of their platform and new discoveries that are being made by virtue of having a lot of data. Each company also presented data on improvements designed to produce even more data and road maps for future improvement to produce even more data. As Haley Fiske from Illumina put it, "we're in the middle of an arms race!" Finally, all the companies are working on molecular barcodes, so that multiple samples can be analyzed within an experiment. So, we started with a lot of data from a sample and are going to a lot of data from a lot of samples. That should add some very nice complexity to sample and data tracking.

A unique perspective

Sydney Brenner opened the second day with a talk on "The Unfinished Genome." The thing I like most about a Sydney Brenner talk is how he puts ideas together. In this talk he presented how one could look at existing data and literature to figure things out or make new discoveries. In one example, he speculated on when the genes for eye development may have first appeared. From the physiology of the eye you can use the biochemistry of vision to identify the genes that encode the various proteins involved in the process. These proteins are often involved in other process, but differ slightly. They arise from gene duplication and modification. So, you can look at gene duplications and measure the age of a duplication by looking at neighboring genes. If a duplication event is old, neighboring genes will be unequal distances apart. You can use this information, along with phylogenetic data, to estimate when the events occurred. Of course this kind of study benefits from more sequence data. Sydney encouraged everyone to keep sequencing.

Sydney closed his talk by making a fun analogy where genomics is like astronomy and thus should have been called "genomy." He supported his analogy by noting that astronomy has astro physic and genomics has genetics. Both are quantitative and measure history and evolution. Astronomy also has astrology, the prediction of an individual's future from the stars. Similarly, folks would like to predict an individual's future from their genes and suggested we call this work "Genology," since it has the same kind of scientific foundation as astrology.

Challenges and solutions

The rest of the conference and posters focused on finishing projects. Today the genome centers are making use of all the platforms to generate large data sets and finish projects. A challenge for genomics is lowering finishing costs. The problem being that generating "draft" data has become so inexpensive and fast that finishing has become a signifiant bottleneck. Finishing is needed to produce the high quality referece sequences that will inform our genomic science, so investigarting ways to lower finishing costs is a worthwhile endeavour. Genome centers are approaching this problem by looking at ways to mix data from different technologies such as 454 and Illumina or SOLiD. They are also developing new and mixed software approaches such as combining multiple assembly algorithms to improve alignments. These efforts are being conducted in conjunction with experiments where mixtures of single pass and paired read data sets are tested to determine optimal approaches for closing gaps.

The take home from this meeting is that, over the coming years, a multitude of new approaches and software programs will emerge to enable genome scale science. The current technology providers are aggressively working to increase data throughput, data quality and read length to make their platforms as flexible as possible. New technology providers are making progress on even higher throughput platforms. Computer scientists are working hard on new algorithms and data visualizations to handle the data. Molecular barcodes will allow for greater numbers of samples per data collection event and increase sample tracking complexity.

The bottom line

Individual research groups will continue to have increasing access to "genome center scale" technology. However, the challenges with sample tracking, data management, and data analysis will be daunting. Research groups with interesting problems will be cut off from these technologies unless they have access to cost-effective, robust informatics infrastructures. They will need help setting up their labs, organizing the data, and making use of new and emerging software technologies.

That's where Geospiza can help.

Labels: , , , , ,