DNA N6甲基腺嘌呤(6mA)是DNA中一种重要的甲基化修饰,参与生物学许多调控过程,在生物过程中起着重要的作用。文章用了公开的小鼠数据集进行研究,首先对小鼠的基因序列(A、T、C、G)通过数学表示符进行信息编码,然后采用卡方检验的方法对编码信息进行特征筛选,筛选出6mA位点相关的特征进行下一步的研究,最后用了七种机器学习算法构建分类模型,并采用五折交叉验证(5-Fold Cross-Validation)对预测结果进行验证,结果显示在使用滑动窗口编码方式下选取前20个最优特征作为训练集样本特征,其随机森林模型对于小鼠6mA位点预测准确率可达到1。
DNA N6-methyladenine (6mA) is an important DNA methylation modification that plays a significant role in many biological regulatory processes. This article use a publicly available mouse dataset to study this modification. Firstly, the mouse gene sequence (A, T, C, G) is encoded using mathematical representation symbols. Then, the encoded information is subjected to feature selection using chi-square testing to select features related to 6mA sites for further study. Seven machine learning algorithms are then used to construct a classification model, and the predictive results are validated using a five-fold cross-validation method. The results showed that selecting the top 20 optimal features as training set sample features using a sliding window encoding method yielded a random forest model that achieved an accuracy of 1 in predicting mouse 6mA sites.
冯欣 李英瑞 王苹 董哲原 辛瑞昊. 基于机器学习的小鼠基因位点预测方法研究
[J]. 吉林化工学院学报, 2022, 39(11): 14-19.
FENG Xin , LI Yingrui , WANG Ping, DONG Zheyuan, XIN Ruihao . Research on Prediction Method of Mouse Gene Loci Based on Machine Learning
. Journal of Jilin Institute of Chemical Technology, 2022, 39(11): 14-19.