CSC 7991 Introduction to Data Mining


Class information

Course #: CSC 7991

Prerequisite: graduate status in a biological science or computer science, or approval of the instructor.

Day: T-Th

Room: 306 State Hall

Hours: 3:00 p.m. - 4:20 p.m.

Instructor information

Instructor: Sorin Draghici

Office: 408 State Hall

Office hours: †††††††††††††† Tue: 6.00pm Ė 7.00pm or by appointment.

Telephone: 577-5484


Web page:

On this web page you can find the syllabus, and announcements regarding the course if any.


Required: Lecture notes

Recommended: Introduction to data mining - Ian H. Witten, Eibe Frank

Course objectives:

The course is focused on data analysis of microarray data. The goal of this course is to present the main data mining techniques available in a way that is useful to the biological scientist. The intended audience includes as a central figure the researcher or practitioner with a background in the biological sciences that needs to use computational tools in order to analyze data. At the same time, the course is intended for the computer scientists who would like to use their background in order to solve problems at the border with biology and medicine.The course explains the nature of the specific challenges that such problems pose as well as various adaptations that classical algorithms need to undergo in order to provide good results in this particular field.

Important dates:

Final exam: 25 April

Revision for final exam: 18 April

Midterm exam: 3/7.


Elementary calculus and basic algebra.

Course contents



Elements of statistics

Measures of central tendency, measures of variability, the normal distribution, some statistical tests (t-test, Mann-Whitney, etc).

Data preparation

Pre-processing, flip dye experiments, background correction. etc.


Divide/subtract mean, replicates, thresholding, ratios, log transform.

Data analysis and Data mining

Why, what, how.

Basic tools

Histograms, scatterplots, time series

Selection of differentially regulated genes

Fold change, unusual ratio, maximum likelihood, confidence analysis, SAM.

Exploratory analysis

PCA, similarity measures




k-means, hierarchical, top down, bottom up, SOFM, how and when to do what.

Advanced unsupervised tools

Cluster confidence and significance analysis

Supervised learning

Issues in supervised learning: training validation, curse of dimensionality.

Supervised techniques

Neural networks, gene shaving

Other techniques

Bayesian techniques, etc.

Other techniques & Revision



Class policies

Attendance: Attending all lectures is essential; the assignments, exams, quizzes, etc. will be based primarily (though not exclusively) on the materials presented in these lectures. Also, assignments due dates, explanation and clarification of assignments will be presented during lecture and lab sessions. If you miss a lecture or lab session, it is your responsibility to obtain the information covered in the session.

Health Safety: Please report to the instructor any health condition which may create a classroom emergency (e.g. seizure disorders, diabetes, heart conditions, etc.).

Computer lab: To enhance your learning and for your homework, the computer lab, equipped withPCís is available to you during the time posted on the labís door.

If you have a PC and appropriate software at home, you are encouraged to work at home. However, it is your responsibility to make sure that your homework is fully compatible with the equipment in the undergraduate lab and to transfer your homework on the equipment in the lab so that it is available for assessment on the due date.

Grading procedures

Assignments, quizzes, examinations and final project: There will be a number of assignments, due at the beginning of the class session of the due date.Late submissions (but not later than one week) will carry a 10% deduction of the marks for each day it is late.If you must, late homework can be turned in to the secretary in the Department of Computer Science main office (431 State Hall, open weekdays from 9am to 5pm). No assignments will be accepted after 9 calendar days past its due date. Since each assignment is an integral part of the course, the instructor reserves the right to give a failing grade to anyone who is turning in 50% or less of the homework.

There will be a number of unannounced quizzes duringthe regular lecture hours. The examinations will be closed books, closed notes and closed neighbors.

Since the two exams cover different parts of the course material, in order to pass the course, you must pass both exams. If you suspect that you will be unable to attend an exam because of a valid and verifiable excuse, you must give me prior notice, at least one full day before the exam. There will be NO make-up examinations.

Be aware of the fact that this course, like any other course, require a certain amount of work to be done. Specifically to this course, some of the work has to be done on a computer. Simply attending the lectures is not sufficient to obtain a passing grade.

Final grade: Each homework/exam/quiz/lab/term project is worth 100 points.

The final grade will be calculated as follows:

Average of homework: †††††††††††††† 10%

Quizzes:†††††††††††††† †††††††††††††† †††††††††††††† 10%

Project†††††††††††††† †††††††††††††† 30%

Midterm exam:†††† †††††††††††††† 25%

Final:†††† †††††††††††††† †††††††††††††† 25%

The homework might involve collecting and reading research papers related to the topic. Writing short essays, providing feedback on the lecture notes, etc.

The project will involve analyzing a real world dataset. You are encouraged to use your own data. If you do not work currently with microarrays, you will be able to choose a data sets made available by the instructor. The report for the project will be written in the form of a research paper. The submission of the report for publication is strongly encouraged but not compulsory.At the end of the semester, you will give a 15-20min presentation of the project work.

The final letter grade will be determined approximately as follows:

The final letter grade will be determined approximately as follows:

A: ††††††† 95-100 %

A-:†††††† 90-94.99

B+:††††† 85-89.99

B: ††††††† 80-84.99

B-:†††††† 75-79.99

C+:††††† 70-74.99

C: ††††††† 66-69.99

C-:†††††† 62-65.99

D+:††††† 58-61.99

D: ††††††† 54-57.99

D-:†††††† 50-53.99

E: ††††††† less than 50%

A grade of Incomplete (I) will not be given unless in very exceptional circumstances.


Student Responsibilities:

Student Responsibilities and Academic Honesty: As a college student who is committed to seek a higher education, we expect you to be a very responsible person. At the least, please:

        Do your best to understand the material covered in the class and ask questions when you do not understand.

        Be aware of the homework assignments, deadlines and late assignment policy.

        Turn in your assignments in neat, readable and easily accessible form.

        Obtain notes and handouts from your classmates if you miss a class for unavoidable circumstances.

Also, we expect all of you to have the highest level of academic honesty. We expect each of you to do your work (assignments, lab exercises, quizzes, exams) yourself and strongly encourage you to discuss with the instructor regarding any problems which you might have in the course work. Remember, you are here to gather more knowledge and become a more educated person, not to collect grades.

In fairness to all, if we find two or more assignments which appear to be copied from each other, we will split the points evenly among all those involved (no matter who copied from whom). Repeated incidents will be dealt with severe disciplinary actions including expulsion from the CS program.

Please behave decently in the classroom. If you have any questions or problems regarding the topic being discussed, feel free to ask your instructor at any time. Donít be shy: no question is too simple and many others might share your puzzlement. Please refrain from discussing other issues among yourselves during the class. You might be disturbing your colleagues who have the right to attend the lecture in a noise‑free environment.



Instructions: Please complete the following and return to the instructor.

Name: __________________________________________________

†††††††††††††† (Last)†††††††††††††† (First)†††††††††††††† (Middle)

Student ID number: _____________________

Telephone: ___________________ †††††††††††††† (Home) Can I leave a message?________

†††††††††††††† †††††††††††††† (Office) Can I leave a message?________


What level are you? __________________






Have you ever used microarrays:



Doyou plan to use microarrays in the next year:



Have you ever taken any course on statistics:




Why are you interested in this course:


















Please circle the number that best represents you response to each item using the following scale:

1 Strongly disagree

2 Disagree

3 Disagree somewhat

4 Neutral

5 Agree somewhat

6 Agree

7 Strongly agree

1.      At the beginning of the course the overall class plan was clearly presented.

1†††††††††††††† 2†††††††††††††† 3†††††††††††††† 4†††††††††††††† 5†††††††††††††† 6†††††††††††††† 7


2. At the beginning of the course, my responsibilities as a student were made clear.

1†††††††††††††† 2†††††††††††††† 3†††††††††††††† 4†††††††††††††† 5†††††††††††††† 6†††††††††††††† 7


3. The grading procedures were clearly explained at the start of the course.

1†††††††††††††† 2†††††††††††††† 3†††††††††††††† 4†††††††††††††† 5†††††††††††††† 6†††††††††††††† 7