Friday, October 17, 2008

Uploading your data to iFinch

iFinch is a scaled down version of our V2 Finch system for genetic analysis. 

Unlike our larger, industrial strength systems, iFinch is designed for individual researchers, small labs, or teachers who want a trouble-free system for managing and working with genetic data.  Currently, students and teachers are using iFinch as part of the Bio-Rad Explorer Cloning and Sequencing kit.

I call iFinch "bioinformatics in a box." I've used iFinch in two bioinformatics courses and it's been pretty helpful. iFinch and FinchTV play nicely together and the combination works well for students.

You don't even have to get a computer for storing data or learn how to manage a database. We do all that for you and you use the system through the web. It's nice and painless.

**********NOTE***********
If you received an iFinch account from Bio-Rad, you will need to turn on your data processor before you begin uploading data.

Checking and starting your Finch data processor
1.  Log into your iFinch account.
2.  Find and select the Data Processor link in the System menu.

3. Look at the Data processor status.


4. If the Data processor has stopped, you will need to Restart it by selecting the Restart button.  If you are a student, you will need to have an instructor log in and do this.

Once your data processor has been started, you can go ahead and upload your data as shown in the movie below.


Uploading your data
The first thing we do with iFinch is to put our data into the iFinch database. In the movie, you can see how we upload chromatograms through the web interface.

video

iFinch can store any kind of file, but it really shines when it comes to working with chromatograms or genotyping data.

If you have lots of files (more than a few 96 well plates), we do have other systems for uploading data. But, that's another post and another movie.

Labels: ,

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , , , ,

Thursday, September 18, 2008

Road Trip: 454 Users Conference

Quiz: What can sequence small genomes in a single run? What can more than double or triple the EST database for any organism?
Answer: The Roche (454) Genome Sequencer FLX™ System.

Last week I had the pleasure of attending the Roche 454 users conference where the new release (Titanium) of the 454 sequencer was highlighted . This upgrade produces more, longer reads so that more than 600 million bases can be generated in each run. When compared to previous versions, the FLX Titanium produces about five times more data. The conference was well attended and outstanding with informative presentations on science, technology, and practical experiences.

In the morning of the first full day, Bill Farmerie, from the University of Florida, presented on how he got into DNA sequencing as a service and how he sees Next Gen sequencing changing the core lab environment. Back in 1998 he set out to establish a genomics service and talked to many groups about what to do. They told him two important things:
  1. "Don't sweat the sequencing part - this is what we are trained for."
  2. "Worry about information management - this we are not trained for."
From here, he discussed how Next Gen got started in his lab and related his experiences over the past three years and made these points:
  • The first two messages are still true. Sequencing gets solved, the problem is informatics.
  • DNA sequencing is expanding, more data are being produced faster at lower costs.
  • This is democratizing genomics - many groups now have access to high throughput technology that provides "genome center" capabilities.
  • The next bioinformatics challenge is enabling the research community, the groups with the sequencing projects, to make use of their data and information. This is not like Sanger, core labs need to deliver results with data.
  • The way to approach new problems and increase scale is to relieve bioinformatics staff of the burden of doing routine things so they can focus on developing novel applications.
  • To accomplish the above point, buy what you can and build what you have to.
Other speakers made similar points. The informatics challenge begins in the lab, but quickly becomes a major problem for the end researcher.

Bill has been following his points successfully for many years now. We starting working with him on his first genomics service and continue to support his lab with Next Gen. Our relationship with Bill and his group has been a great experience.

Other highlights from the meeting included:

A talk on continuous process improvements in DNA sequencing at the Broad Institute. Danielle Perrin presented work on how the Broad tackles process optimization issues during production to increase throughput, decrease errors, or save costs. In my perspective, this presentation really stresses the importance of coupling laboratory management with data analysis.

Multiple talks on microbial genomics. A strength of the 454 platform is how it generates long reads making this a platform of choice for sequencing smaller genomes and performing metagenomic surveys. We were also introduced to the RAST (Rapid Annotation using Subsystem Technology) server, an ideal tool for working with your completed genome or metagenome data set.

Many examples of how having millions of reads makes new gene expression and variation analysis discoveries possible when compared to other platforms like microarrays. In these talks speakers were occasionally asked which is better, long 454 reads or short reads from Illumina or SOLiD? The speakers typically said you need both, they complement each other.

The Wolly Mammoth. Steven Schuster from Penn State presented his and colleagues' work on sequencing mammoth DNA and its relatedness over 1000's of years. Next Gen is giving us a new "omics," Museomics.

And, of course, our poster demonstrating how FinchLab provides an end to end workflow solution for 454 DNA sequencing. In the poster (you have to click the image to get the BIG picture), we highlighted some new features coming out at the end of the month. These include the ability to collect custom data during lab processing, coupling Excel to FinchLab forms, and work on 454 data analysis. Now you will be able to enter the bead counts, agarose images, or whatever else you need to track lab details to make those continuous process improvements. Excel coupling makes data entry though FinchLab forms even easier. The 454 data analysis complements our work with Sanger, SOLiD, and Illumina data to make the FinchLab platform complete for any genomics lab.

Labels: , , , ,

Thursday, June 5, 2008

Finishing in the Future

"The data sets are astronomical," "the data that needs to be attached to sequences is unbelievable," and "browsing [data] is incomprehensible." These are just three of the many quotes I heard about the challenges associated with DNA sequencing last week at the "Finishing in the Future Meeting" sponsored by the Joint Genome Institute (JGI) and Los Alamos National Laboratory (LANL).

Metagenomics

The two and half day conference, focused on finishing genomic sequences, kicked off with a session on metagenomics. Metagenomics is about isolating DNA from environments and sequencing random molecules to "see what's out there." Excitement for metagenomics is being driven by Next Gen sequencing throughput, because so many sequences can be collected relatively inexpensively. A benefit of being able to collect such large data sets is that we can interrogate organisms that can cannot be cultured. The first talk, "Defining the Human Microbiome: Friends or Family," was presented by Bruce Birren from the Broad Institute of MIT & Harvard. In this talk, we learned about the HMP (Human Microbiome Project), a project dedicated to characterizing the microbes that live on our bodies. It is estimated that microbial cells out number our cells by ten to one. It has long been speculated that our microbiomes are involved in our health and sickness and recent studies are confirming these ideas.

Sequencing technologies continue to increase data throughput

The afternoon session opened with presentations from Roche (454), Illumina, and Applied Biosystems on their respective Next Gen sequencing platforms. Each company presented the strengths of their platform and new discoveries that are being made by virtue of having a lot of data. Each company also presented data on improvements designed to produce even more data and road maps for future improvement to produce even more data. As Haley Fiske from Illumina put it, "we're in the middle of an arms race!" Finally, all the companies are working on molecular barcodes, so that multiple samples can be analyzed within an experiment. So, we started with a lot of data from a sample and are going to a lot of data from a lot of samples. That should add some very nice complexity to sample and data tracking.

A unique perspective

Sydney Brenner opened the second day with a talk on "The Unfinished Genome." The thing I like most about a Sydney Brenner talk is how he puts ideas together. In this talk he presented how one could look at existing data and literature to figure things out or make new discoveries. In one example, he speculated on when the genes for eye development may have first appeared. From the physiology of the eye you can use the biochemistry of vision to identify the genes that encode the various proteins involved in the process. These proteins are often involved in other process, but differ slightly. They arise from gene duplication and modification. So, you can look at gene duplications and measure the age of a duplication by looking at neighboring genes. If a duplication event is old, neighboring genes will be unequal distances apart. You can use this information, along with phylogenetic data, to estimate when the events occurred. Of course this kind of study benefits from more sequence data. Sydney encouraged everyone to keep sequencing.

Sydney closed his talk by making a fun analogy where genomics is like astronomy and thus should have been called "genomy." He supported his analogy by noting that astronomy has astro physic and genomics has genetics. Both are quantitative and measure history and evolution. Astronomy also has astrology, the prediction of an individual's future from the stars. Similarly, folks would like to predict an individual's future from their genes and suggested we call this work "Genology," since it has the same kind of scientific foundation as astrology.

Challenges and solutions

The rest of the conference and posters focused on finishing projects. Today the genome centers are making use of all the platforms to generate large data sets and finish projects. A challenge for genomics is lowering finishing costs. The problem being that generating "draft" data has become so inexpensive and fast that finishing has become a signifiant bottleneck. Finishing is needed to produce the high quality referece sequences that will inform our genomic science, so investigarting ways to lower finishing costs is a worthwhile endeavour. Genome centers are approaching this problem by looking at ways to mix data from different technologies such as 454 and Illumina or SOLiD. They are also developing new and mixed software approaches such as combining multiple assembly algorithms to improve alignments. These efforts are being conducted in conjunction with experiments where mixtures of single pass and paired read data sets are tested to determine optimal approaches for closing gaps.

The take home from this meeting is that, over the coming years, a multitude of new approaches and software programs will emerge to enable genome scale science. The current technology providers are aggressively working to increase data throughput, data quality and read length to make their platforms as flexible as possible. New technology providers are making progress on even higher throughput platforms. Computer scientists are working hard on new algorithms and data visualizations to handle the data. Molecular barcodes will allow for greater numbers of samples per data collection event and increase sample tracking complexity.

The bottom line

Individual research groups will continue to have increasing access to "genome center scale" technology. However, the challenges with sample tracking, data management, and data analysis will be daunting. Research groups with interesting problems will be cut off from these technologies unless they have access to cost-effective, robust informatics infrastructures. They will need help setting up their labs, organizing the data, and making use of new and emerging software technologies.

That's where Geospiza can help.

Labels: , , , , ,

Saturday, February 16, 2008

Entering information in iFinch via FinchTV, part I

Teaching is a hard habit to break so I teach short courses now and then.

This year, I've been having my students use FinchTV to enter their blast results into iFinch. This also works with FinchLab and other Finch systems, too.

This has been pretty helpful. The data get stored for each chromatogram and we can all view the results (I'll address this part in a later post.)

How does this work?

1. Log in to your Finch account. Open a chromatogram in FinchTV either by clicking the FinchTV icon or the link from the Chromat Read page that says Open in FinchTV.

2. When you're ready to enter information, click the Commit button (outlined below).



3. You'll see a message appear asking if you're sure. Say "yes."

4. Enter the information that you want to store. Since we were using FinchTV to connect to NCBI blast and identify our bacteria, I'm entering the conclusion from my blast results.

5. Then I click the "OK" button.



6. If I refresh my web browser page, I can see that the version number for my read is now at "2", and I can see that my information has been stored in the database.



In a later post, I'll show how we get that information out.

Stay tuned...

Labels: , , ,

Monday, February 11, 2008

iFinch in education: metagenomics with JHU, part I.

iFinch is the perfect bioinformatics tool to accompany a class. I used it Fall quarter in a class that I teach at Shoreline Community College (Washington) and I'm using it right now in an on-line class that I teach at Austin Community College (Texas).

We cover several different topics in the class, but I have a fondness for long projects where we can use multiple techniques and tie everything to a common theme.

This semester we're working with bacterial sequences that were obtained from students at John Hopkins University. I've been collaborating with an instructor there for several years and now we have four years of data to dig our teeth into.

This video describes the first part of the project that we're working on.



JHU bacterial metagenomics project from Sandra Porter on Vimeo.

Labels: , ,

Using the Finch Q >20 plots to evaluate your data


All of the Finch systems: Solutions Finch, FinchLab, and iFinch; have a folder report with visual snapshots that summarize the quality of data in that folder. The Q20 histogram plot is one of those tools and in these next two posts, I'll describe what we can learn from these plots.


First, we'll talk about the values on the x axis. When we use the term "Q> 20 bases," we're referring to the number of bases in a read that have a quality value greater than 20. If a base has a quality value of 20, there is a 1 in 100 chance that the base has been misidentified. We use the Q20 value to mark a threshold point where a base has an acceptable quality value.

Histogram plots work by consolidating data that fit into a certain range. In the graph above, you can see that on the x axis, we show groups of reads. The first group contains reads that have less than 50 good (Q > 20) bases. The next group contains reads that have between 50 and 99 good bases, next 100 to 149, and so on.

On the y axis, we show the number of reads that fall into each group. In this graph, we have almost 30 reads that have over 950 good quality bases.

Uhmm, uhmm, uhhmmm, good sequence data, just the stuff I like to see.

Labels: , ,