Feature selection plays an important role in gene expression profile-based sample classification. A large number of genes will increase the dimensionality, computational complexity, and clinical cost in future. In this letter, the expression data set (1368 genes in 62 normal and 40 tumor samples including sample duplication in different batches) from Sharma et al. paper [1] was re-analyzed using Tclass system. The Tclass system was developed for gene expression profile-based tumor classification by our center. The results indicate that the number of genes for early detection of breast cancer is less than 10. But in paper [1], they found out a set of 37 genes for early detection of breast cancer. The classification accuracy was about 82%. The following figure displays the relationship between the number of genes and classification accuracy from Tclass system.

The relationship between the number of genes and classification accuracy was displayed for (A) different partition ratios, and (B) leave-one-out cross-validation analysis

Reference

  1. Sharma P, Sahni NS, Tibshirani R, Skaane P, Urdal P, Berghagen H, Jensen M, Kristiansen L, Moen C, Sharma P, Zaka A, Arnes J, Sauer T, Akslen LA, Schlichting E, Børresen-Dale AL, Lönneborg A: Early detection of breast cancer based on gene-expression patterns in peripheral blood cells. Breast Cancer Research 2005, 7:R634-R644

  2. Li WJ, Xiong MM: Tclass: tumor classification system based on gene expression profile. Bioinformatics 2002, 18:325-326