Header logo is

Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources

2001

Conference Paper

ei


We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method‘s ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.

Author(s): Seldin, Y. and Bejerano, G. and Tishby, N.
Journal: In the proceeding of the 18th International Conference on Machine Learning (ICML 2001)
Pages: 513-520
Year: 2001
Day: 0

Department(s): Empirical Inference
Bibtex Type: Conference Paper (inproceedings)

Event Name: 18th International Conference on Machine Learning (ICML 2001)

Digital: 0
Language: en
Organization: Max-Planck-Gesellschaft
School: Biologische Kybernetik

Links: PDF

BibTex

@inproceedings{6588,
  title = {Unsupervised Sequence Segmentation by a 
  Mixture of Switching Variable Memory Markov Sources},
  author = {Seldin, Y. and Bejerano, G. and Tishby, N.},
  journal = {In the proceeding of the 18th International Conference on Machine Learning (ICML 2001)},
  pages = {513-520},
  organization = {Max-Planck-Gesellschaft},
  school = {Biologische Kybernetik},
  year = {2001}
}