Bioinformatics Lab. English

Phylogenetic Estimation of Metagenomic sequences on the basis of batch-learning Self-organizing map (BLSOM)

Takashi Abe(1), Kennosuke Wada(2), Shigehiko Kanaya(3), Toshimichi Ikemura(2)

  1. Graduate School of Science and Technology, Niigata University
  2. Nagahama Institute of Bio-science and Technology
  3. Nara Institute of Science and Technology

One of the most important tasks of life science is to unveil unknown basic knowledge from a huge quantity of genomic sequences accumulated in the International DNA Databanks. We developed a novel bioinformatics tool for large-scale comprehensive studies of phylotype-specific characteristics by focusing on almost all available sequences from prokaryotic, eukaryotic and viral genomes. An unsupervised neural network algorithm, self-organizing map (SOM), is an effective tool for clustering and visualizing high-dimensional complex data on a single map, and we modified the SOM for the present genome analyses by developing a Batch-Learning SOM (BLSOM).


We used the BLSOM initially to analyze short oligonucleotide frequencies (di- to pentanucleotide frequency) in a wide range of prokaryotic and eukaryotic genomes. When only fragmental sequences (e.g., 10 kb sequences) from mixed genomes derived from multiple organisms were available, it appears impossible to identify how many and what types of genomes are present in the collected sequences. However, we found that BLSOM could classify the sequence fragments according to species without any information other than oligonucleotide frequencies. BLSOM recognized, in most sequence fragments, species-specific characteristics of oligonucleotide frequencies, permitting phylotype-specific clustering (self-organization) of sequences and unveiling diagnostic oligonucleotides responsible for the phylotype-specific clustering (Abe et al. 2003; 2005).


Metagenomics studies of uncultivable microorganisms in environmental and clinical samples should allow extensive surveys of genes useful in medical and industrial applications. Traditional methods of phylogenetic assignment have been based on sequence homology searches and therefore inevitably focused on well-characterized genes, for which orthologous sequences required for constructing a reliable phylogenetic tree are available. However, most of the well-characterized genes are not industrially attractive. The present alignment-free clustering method, BLSOM, is the most suitable method for this purpose. When we consider phylogenetic classification of species-unknown sequences obtained from environmental and clinical samples, BLSOMs have to be constructed in advance with all available sequences from species-known prokaryotes and eukaryotes, as well as from viruses and organelles. Using high-performance supercomputers, sequences were clustered (self-organized) on BLSOM according to phylotypes with high accuracy.


To estimate phylotypes of the metagenomic sequences, three types of large-scale BLSOMs, namely Kingdom-, Prokaryote- and Genus group-BLSOM, were constructed in advance, using sequences deposited in DDBJ/ENA/GenBank as previously described. Kingdom-BLSOM was constructed with tetranucleotide frequencies for 5-kb sequences from the whole-genome sequences of 111 eukaryotes, 2,813 prokaryotes, 1,728 mitochondria, 110 chloroplasts and 31,486 viruses. To obtain more detailed phylotype information for prokaryotic sequences, Prokaryote- and Genus group-BLSOM were constructed with a total of 3,500,000 5-kb sequences from 3,157 species, for which at least 10 kb of sequence was available from DDBJ/ENA/GenBank.
Mapping of metagenomic sequences longer than 300 bp on Kingdom-BLSOMs, after normalization of the sequence length, was conducted by finding the lattice point with the minimum Euclidean distance in the multidimensional space. To identify further detailed phylogenies of the metagenomic sequences that had been mapped to the prokaryotic territories on Kingdom-BLSOM, these were successively mapped on Prokaryote-BLSOM. Similar stepwise mappings of metagenomic sequences on BLSOMs constructed with sequences from more detailed phylogenetic categories (e.g., phylum and genus) were then conducted, to obtain further detailed phylogenetic information.


Because BLSOM does not require orthologous sequence sets, the present alignment-free method could provide a new systematic strategy for revealing microbial diversity and the relative abundance of different phylotype members of uncultured microorganisms including viruses in environmental and clinical samples. This software will be freely available at the followings.



User Guide

  • Output file that well to use
    • 「Myresult_Top.txt」:Phylogenetic estimation results of the Kingdom/Phylum/Genus in each sequence.
    • 「Myresult_Hist.txt」:The counting result of the number that have been estimated for each category in each Kingdom/Phylum/Genus.


  • Recommended PC configuration
    • CPU: Intel Core 2 Duo 2.8 GHz or better
    • Memory: 2 GB or more
    • HDD: Free space of 4 GB or more (excluding swap space)
    • Video: Resolution of 1280 x 1024 pixels or more, 16-bit or higher color
  • OS
    • Microsoft Windows 7 or better
    • Microsoft .NET Framework 4.0 or better runtime environment


  1. Takashi Abe, Shigehiko Kanaya, Makoto Kinouchi, Yuta Ichiba, Tokio Kozuki and Toshimichi Ikemura. Informatics for unveiling hidden genome signatures. Genome Research, 13, 693-702, 2003.
  2. Takashi Abe, Hideaki Sugawara, Makoto Kinouchi, Shigehiko Kanaya and Toshimichi Ikemura. Novel Phylogenetic Studies of Genomic Sequence Fragments Derived from Uncultured Microbe Mixtures in Environmental and Clinical Samples. DNA Research, 12, 281-290, 2005.
  3. Takashi Abe, Shigehiko Kanaya, Hiroshi Uehara and Toshimichi Ikemura. A novel bioinformatics strategy for function prediction of poorly-characterized protein genes obtained from metagenome analyses. DNA Research, 16, 287-298, 2009.
  4. Hiroshi Uehara, Yuki Iwasaki, Chieko Wada, Kennosuke Wada, Toshimichi Ikemura and Takashi Abe. A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries. Genes & Genetic Systems, 86, 53-66, 2011.
  5. Ryo Nakao, Takashi Abe, Ard M. Nijhof, Seigo Yamamoto, Frans Jongejan, Toshimichi Ikemura, Chihiro Sugimoto. A novel approach, based on BLSOMs (Batch Learning Self-Organizing Maps), to the microbiome analysis of ticks. ISME Journal, 7, 1003-1015, 2013.
    Counter: 3296, today: 6, yesterday: 4