(中南工业大学信息工程学院,长沙 410083)
摘 要: 提出了一种新的基因组数据模型和模式发现算法。该模型由人工基因组、人工蛋白、进化操作、 进化控制、模式匹配、终止判断6个环节组成,其中抽象代数结构由格集合构形和相应有限状态机操作来动态描述,候选符号序列由符号动力学引导的进化算法所生成,进化程度由粗糙集所刻划的元进化机制所控制,模式匹配由句法模式识别器和文法推断过程所完成,终止判断依具体问题求解的约束条件而定。相应的算法为循环性的群体隐式并行搜索,数据结构以答号化粗粒度的处理为主,并与面向语义的模块化程序设计相配合。在该人工生命技术的应用中,由计算机自动生成了候选符号序列,从中获得了“真实” 的氨基酸序列。实验结果表明,所提出并实现的计算方法有助于基因组学层次下的生物信息学的统一计算理论的建立和应用系统开发。
关键字: 基因组学 生物信息学 进化计算
(College of Information Engineering,Central South University of Technology, Changsha 410083, P. R. China)
Abstract:A novel model of genomic data mining and a corresponding algorithm for pattern discovery were proposed . The model consists of six units such as artificial genome, artificial proteome, evolutionary operation, evolutionary control, pattern matching and termination judgement. The abstract algebraic structure is described by lattice set configuration and finite state automata dynamically. The candidate string sequence is generated by evolutionary algorithm with symbolic dynamics. The degree of evolution is controlled by meta-evolution mechanism and expressed by rough sets. The pattern matching procedure is implemented by syntatic pattern recognizer and grammar inference. Termination judgement is dependent on concret problem solving paradigm. The algorithm is with the cycle type of implicit parallation and population searching . The data structure focusses on coarse-grained symbolic information processing and modular programming oriented to semantics. With the application of the above-mentioned artificial life techniques, candidate symbolic sequences have been automatically produced by computer system and “ real” amino-acid sequence obtained among them. The experimental result shows that the computational method proposed and implemented here is helpful to the building of unified computational theory of bioinformatics in the genomics level and development of application systems.
Key words: genomics bioinformatics evolutionary computation