Earlham Institute releases open source software to help identify gene families
Researchers at Earlham Institute (EI) have released ‘GeneSeqToFamily’, an open-source Galaxy workflow that helps scientists to find gene families based on the ‘EnsemblCompara GeneTrees’ pipeline.
Published in Gigascience, the open source Galaxy workflow aims to make researchers job of finding find gene families much easier.
Co-author Wilfried Haerty, Group Leader of Evolutionary Genomics at EI, explained why this tool is so useful to biologists: ‘The software developed at the Earlham Institute enables scientists to investigate species of interest using a flexible and reproducible pipeline. The performance of our workflow was assessed on vertebrate genome assemblies of various qualities (platypus, pig, horse, dog, mouse and human). The species were selected to assess the impact of genome quality on gene families identification. The mouse, dog and human genomes are of high quality whereas the three others are at different stages of analysis completion.’
Based on and expanding Ensembl’s existing EnsemblCompara Gene Trees pipeline, the GeneSeqToFamily workflow removes many of the complex prerequisites of the process, such as having to use the command line to install a large number of separate tools, by converting the whole process into Galaxy; a much simpler platform to use.
Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. The workflow is highly customisable, allowing users to choose parameters, change tools and run the software on their own genes, without having to use the Ensembl database.
GeneSeqToFamily contains a number of new, standalone Galaxy tools, including TreeBeST, hcluster_sg, T-Coffee and ETE. Developed at EI by Anil Thanki and Nicola Soranzo of the Data Infrastructure Group. The software makes the process of finding and generating phylogenetic trees easier, using a range of open platforms and databases.
Anil Thanki, scientific programmer at EI, commented: ‘We are excited to put our work in the open domain, where it allows biologists and bioinformaticians to use the Ensembl Compara GeneTrees Pipeline in a simple, graphical user interface and modify it if needed.’
The team hopes that the new workflow will help users unfamiliar with the complexities associated with using Compara to be able to more easily analyse phylogenetic datasets, while collating a number of useful gene family tools in one Galaxy workflow. Users can either select existing Ensembl databases to use as the reference sets for their analysis, or provide their own data in the same format, and tools are provided that can help.