Dr. Chen is a Professor of Computer Science and the founding director of Data Sciences and Analytics Lab (DSAL) in the Department of Computer Science at Wayne State University. He served as the Department Chair between 2012 and 2014. Dr. Chen received his PhD in 2001 from Carnegie Mellon University. His main research interest includes Data Sciences and Advanced Analytics, Data Mining, Machine Learning, Bioinformatics, and Healthcare Informatics. Dr. Chen has published over 100 peer-reviewed papers in these research fields at top journals and conferences such as KDD, ICML, Bioinformatics, and IEEE TKDE. His research is funded by several federal agencies such as the National Science Foundation, National Institutes of Health, as well as some local industry. He serves as an Editorial Board Member for several international journals such as BMC Systems Biology and IEEE Access. He also served as a Conference Chair or Program Chair for several international conferences such as the Thirteen International Conference on Machine Learning and Applications (ICMLA) in 2014, the 21st ACM Conference on Information and Knowledge Management (CIKM) in 2012, and the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) in 2009. He has also served as a Program Committee Member for numerous international conferences.
Big Data, Data Sciences, Machine Learning, Data Mining, Bioinformatics, Healthcare Informatics, Multimedia Data Analytics
Selected Recent Publications
Keywords: Data Sciences, Big Data, Deep Learning
o X. Chen and X. Lin: Big Data Deep Learning: Challenges and Perspectives. IEEE Access, vol. 2, 514-525, 2014; DOI: 10.1109/ACCESS.2014.2325029.
Deep learning is currently an extremely active research area in machine learning and pattern recognition society. It has gained huge successes in a broad area of applications such as speech recognition, computer vision, and natural language processing. With the sheer size of data available today, big data brings big opportunities and transformative potential for various sectors; on the other hand, it also presents unprecedented challenges to harnessing data and information. As the data keeps getting bigger, deep learning is coming to play a key role in providing big data predictive analytics solutions. In this paper, we provide a brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well as the future trends.
o K. Zhang and X. Chen: Large-scale Deep Belief Nets with MapReduce. IEEE Access, vol. 2, 395-403, 2014; DOI: 10.1109/ACCESS.2014.2319813.
Deep belief nets (DBNs) with restricted Boltzmann machines (RBMs) as the building block have recently attracted wide attention due to their great performance in various applications. The learning of a DBN starts with pretraining a series of the RBMs followed by fine-tuning the whole net using backpropagation. Generally, the sequential implementation of both RBMs and backpropagation algorithm takes significant amount of computational time to process massive data sets. The emerging big data learning requires distributed computing for the DBNs. In this paper, we present a distributed learning paradigm for the RBMs and the backpropagation algorithm using MapReduce, a popular parallel programming model. The experimental results demonstrate that the distributed RBMs and DBNs are amenable to large-scale data with a good performance in terms of accuracy and efficiency.
o M. Aslan, X. Chen, and H. Cheng: Learning sparse and scale-free networks. 2014 International Conference on Data Science and Advanced Analytics (DSAA)
Gaussian networks study undirected interactions between random variables, through the estimation of the precision matrices. Recently, it has been demonstrated that some of the important networks display features similar to scale-free graphs. There have been few works on the learning of the sparse Gaussian graphical models aiming to preserve properties of networks, which are believed to be scale-free or have dominating hubs. We propose a new log-likelihood formulation, which promotes the sparseness of the precision matrix and features of scale-free graphical topology. We used the alternating direction method of multipliers (ADMM) form, which is used for the convex optimization, to solve the general L1 regularized loss optimization. Our proposed method exhibits better estimation performance on various data sets and number of samples, N. Also, the proposed method and some of the state of the arts methods are tested under various penalty constants to validate the robustness.
o Y. Chen, H. Sampathkumar, B. Luo, and X. Chen: iLike: Bridging the semantic gap in vertical image search by integrating text and visual features. IEEE Transactions on Knowledge and Data Engineering, vol. 25(10), pp. 2257-2270, 2013.
With the development of Internet and Web 2.0, large-volume multimedia contents have been made available online. It is highly desired to provide easy accessibility to such contents, i.e., efficient and precise retrieval of images that satisfies users' needs. Toward this goal, content-based image retrieval (CBIR) has been intensively studied in the research community, while text-based search is better adopted in the industry. Both approaches have inherent disadvantages and limitations. In this paper, we present iLike, a vertical image search engine that integrates both textual and visual features to improve retrieval performance. We bridge the semantic gap by capturing the meaning of each text term in the visual feature space, and reweight visual features according to their significance to the query terms. We also bridge the user intention gap because we are able to infer the "visual meanings" behind the textual queries. Last but not least, we provide a visual thesaurus, which is generated from the statistical similarity between the visual space representation of textual terms. Experimental results show that our approach improves both precision and recall, compared with content-based or text-based image retrieval techniques. More importantly, search results from iLike are more consistent with users' perception of the query terms.
o H. Cheng, Z. Liu, L. Yang, and X. Chen: Sparse representation and learning in visual recognition: theory and applications. Signal Processing, 93(6): 1408-1425, 2012.
Sparse representation and learning has been widely used in computational intelligence, machine learning, computer vision and pattern recognition, etc. Mathematically, solving sparse representation and learning involves seeking the sparsest linear combination of basis functions from an overcomplete dictionary. A rational behind this is the sparse connectivity between nodes in human brain. This paper presents a survey of some recent work on sparse representation, learning and modeling with emphasis on visual recognition. It covers both the theory and application aspects. We first review the sparse representation and learning theory including general sparse representation, structured sparse representation, high-dimensional nonlinear learning, Bayesian compressed sensing, sparse subspace learning, non-negative sparse representation, robust sparse representation, and efficient sparse representation. We then introduce the applications of sparse theory to various visual recognition tasks, including feature representation and selection, dictionary learning, Sparsity Induced Similarity (SIS) measures, sparse coding based classification frameworks, and sparsity-related topics.
o M. Wasikowski and X. Chen: Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Transactions on Knowledge and Data Engineering,, vol. 22(10): 1388-1400, 2010.
The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.
Keywords: Bioinformatics & Healthcare Informatics
o H. Sampathkumar, X. Chen, and B. Luo: Mining Adverse Drug Reactions from Online Healthcare Forums Using Hidden Markov Model. BMC Medical Informatics and Decision Making, 14:91, 2014
Adverse Drug Reactions (ADRs) are one of the leading causes of injury or death among patients undergoing medical treatments. Not all ADRs are identified before a drug is made available in the market. Current post-marketing drug surveillance methods, which are based purely on voluntary spontaneous reports, are unable to provide the early indications necessary to prevent the occurrence of such injuries or fatalities. The objective of this research is to extract reports of adverse drug side-effects from messages in online healthcare forums and use them as early indicators to assist in post-marketing drug surveillance. We treat the task of extracting adverse side-effects of drugs from healthcare forum messages as a sequence labeling problem and present a Hidden Markov Model (HMM) based Text Mining system that can be used to classify a message as containing drug side-effect information and then extract the adverse side-effect mentions from it. The results from the HMM based Text Miner are encouraging to pursue further enhancements to this approach. The mined novel side-effects can act as early indicators for health authorities to help focus their efforts in post-marketing drug surveillance.
o Z. Zhang, Z. Hailat, M. J. Falk, and X. Chen: Integrative analysis of independent transcriptome data for rare diseases. METHODS, 69(3): 315-325, 2014
High-throughput technologies have been generating a great amount of publicly available gene expression data. For rare diseases that lack of clinical samples and research funding, there is a practical benefit to jointly analyze existing data sets commonly related to a specific rare disease. In this study, we collected a number of independently generated transcriptome data sets from four species: human, fly, mouse and worm. All data sets included samples with both normal and abnormal mitochondrial function. We reprocessed each data set to standardize format, scale and gene annotation and used HomoloGene database to map genes between species. Standardized procedure was also applied to compare gene expression profiles of normal and abnormal mitochondrial function to identify differentially expressed genes. We further used meta-analysis and other integrative analyses to recognize patterns across data sets and species. Novel insights related to mitochondrial dysfunction was revealed via these analyses, such as a group of genes consistently dysregulated by impaired mitochondrial function in multiple species. This study created a template for the study of rare diseases using genomic technologies and advanced statistical methods. All data and results generated by this study are freely available and stored at http://goo.gl/nOGWC2, to support further data mining.
o J. Jeong and X. Chen: A new semantic functional similarity over gene ontology. IEEE Trans. On Computational Biology and Bioinformatics, 12(2), 322-334, 2014.
Identifying functionally similar or closely related genes and gene products has significant impacts on biological and clinical studies as well as drug discovery. In this paper, we propose an effective and practically useful method measuring both gene and gene product similarity by integrating the topology of gene ontology, known functional domains and their functional annotations. The proposed method is comprehensively evaluated through statistical analysis of the similarities derived from sequence, structure and phylogenetic profiles, and clustering analysis of disease genes clusters. Our results show that the proposed method clearly outperforms other conventional methods. Furthermore, literature analysis also reveals that the proposed method is both statistically and biologically promising for identifying functionally similar genes or gene products. In particular, we demonstrate that the proposed functional similarity metric is capable of discovering new disease related genes or gene products.
o X. Chen, J. Jeong, and P. Dermyer: KUPS: Constructing datasets of interacting and non-interacting protein pairs with associated attributes. Nucleic Acids Research, 2011, Jan; 39:D750-4.
KUPS (The University of Kansas Proteomics Service) provides high-quality protein–protein interaction (PPI) data for researchers developing and evaluating computational models for predicting PPIs by allowing users to construct ready-to-use data sets of interacting protein pairs (IPPs), non-interacting protein pairs (NIPs) and associated features. Multiple filters and options allow the user to control the make-up of the IPPs and NIPs as well as the quality of the resultant data sets. Each data set is built from the overall database, which includes 185 446 IPPs and ∼1.5 billion NIPs from five primary databases: IntAct, HPRD, MINT, UniProt and the Gene Ontology. The IPP set can be set to specific model organisms, interaction types and experimental evidence. The NIP set can be generated using four different strategies, which can alleviate biased estimation problems. Lastly, multiple features can be provided for all of the IPP and NIP pairs. Additionally, KUPS provides two benchmark data sets to help researchers compare their algorithms to existing approaches. KUPS is freely available at http://www.ittc.ku.edu/chenlab.
o A. Senf and X. Chen,: dentification of Genes Involved in the Same Pathway Using a Hidden Markov Model-based Approach. Bioinformatics, 25(22): 2945-2954, 2009.
The sequencing of whole genomes from various species has provided us with a wealth of genetic information. To make use of the vast amounts of data available today it is necessary to devise computer-based analysis techniques. We propose a Hidden Markov Model (HMM) based algorithm to detect groups of genes functionally similar to a set of input genes from microarray expression data. A subset of experiments from a microarray is selected based on a set of related input genes. HMMs are trained from the input genes and a group of random gene input sets to provide significance estimates. Every gene in the microarray is scored using all HMMs and significant matches with the input genes are retained. We ran this algorithm on the life cycle of Drosophila microarray data set with KEGG pathways for cell cycle and translation factors as input data sets. Results show high functional similarity in resulting gene sets, increasing our biological insight into gene pathways and KEGG annotations. The algorithm performed very well compared to the Signature Algorithm and a purely correlation-based approach.
Software and Services (by our group)
o KUPS – Proteomics Service (for protein-protein interactions)
o DDINet – A network of interacting protein domains
o CSIDOP – Protein function assignment
o Microarray – HMM for Microarray analysis
o KU GOAL – Gene Ontology Analysis layer
PhD Students, Post-doc, Visiting Scholars
Current Graduate Students, Post-doc, and Visiting Scholars
o Melih Aslan (Postdoc)
o Kunlei Zhang (Postdoc)
o Weiwei Zong (Postdoc)
o Mingyu You, PhD, Visiting Scholar, 2014 – 2015 (Tongji University, China)
o Zeyad Hailat (PhD student, 2012 - )
o Tarik Khalid Alafif (PhD student, 2012 - )
o Artem Komaruchev (PhD student, 2013 - )
o Jing Yu (PhD student, 2013 - )
o Elaheh Rashedi (PhD student, 2014 - )
o Iatuma Itauma (MS student)
o Yingbo Jiang (MS student)
o Ruoyun Pang (MS student)
o Faria Mahnaz (MS student)
Former Group Members
o Changlin Ma, PhD, Visiting Scholar, 2013 – 2014 (currently Huazhong Normal University, China)
o Jiangsheng Yu, postdoc, currently Senior Data Scientist, Tokyo Electron America, California, USA
o Huilin Xiong, Postdoc, currently Professor at Shanghai Jiaotong University, Shanghai, China
o Mei liu, PhD, currently Assistant Professor at University of Kansas Medical Center
o Alex Senf , PhD, currently with European Bioinformatics Institute (EBI), UK
o Jong Cheol Jeong, PhD, currently Postdoc, Harvard University, USA
o Meeyong Park, PhD, currently Postdoc, University of Michigan, USA
o Bing Han, PhD, Senior Machine Learning Data Scientists, Intertrust Technologies Corporation, CA
Fall 2015: CSC5991 Foundations of Data Science
Big Data are omnipresent in contemporary scientific, engineering, government, social and business applications. It contains a huge amount of information that cannot be analyzed by traditional data analytic tools. Consequently, data science has emerged as a new and exciting discipline that explores novel techniques and theories, rooted in many fields such as mathematics, statistics, computer science, and information theory, for extracting knowledge from Big Data. This course will cover foundational aspects of data science such as optimization, informatics theory, and machine learning. It is designed to prepare students with the theoretical foundations and practical skills in analyzing data.