Data Analysis Tools for DNA Microarrays

Sorin Draghici

Publisher: Chapman & Hall /CRC Press

ISBN: 1584883154

Order from Chapman & Hall /CRC Press.

Related Research.


 

Table of Contents
Companion CD Contents

 

Table of Contents
       

0.0.1

Audience and prerequisites

viii

0.0.2

Aims and contents

viii

 

 

 

 

1

Introduction

3

1.1

Bioinformatics . an emerging discipline

3

1.2

The building blocks of genomic information

5

1.3

Expression of genetic information

9

1.4

The need for microarrays

13

1.5

Summary

14

 

 

 

 

Microarrays

15

2.1

Microarrays . tools for gene expression analysis

15

2.2

Fabrication of microarrays

16

2.2.1

cDNA microarrays

17

2.2.2

In situ synthesis

17

2.2.3

A brief comparison of cDNA and oligonucleotide technologies

22

2.3

Applications of microarrays

22

2.4

Challenges in using microarrays in gene expression studies

23

2.5

Sources of variability

28

2.6

Summary

32

 

 

 

 

3 Image processing 33

3.1

Introduction 33

3.2

Basic elements of digital imaging 33

3.3

Microarray image processing 38

3.4

Image processing of cDNA microarrays 42

3.4.1

Spot nding 42

3.4.2

Image segmentation 43

3.4.3

Quantication 50

3.4.4

Spot quality assessment 53

3.5

Image processing of Affymetrix arrays 55

3.6

Summary 58

 

     
4 Elements of statistics 61

4.1

Introduction 61

4.2

Some basic terms 62

4.3

Elementary statistics 64

4.3.1

Measures of central tendency: mean, mode and median  64

4.3.2

Measures of variability 68

4.3.3

Some interesting data manipulations 70

4.3.4

Covariance and correlation 71

4.4

Probabilities 77

4.4.1

Computing with probabilities 80

4.5

Bayes' theorem 84

4.6

Probability distributions 86

4.6.1

Discrete random variables 87

4.6.2

Binomial distribution 89

4.6.3

Continuous random variables 94

4.6.4

The normal distribution 96

4.6.5

Using a distribution 99

4.7

Central limit theorem 102

4.8

Are replicates useful? 104

4.9

Summary 106

4.10

Solved problems 106

4.11

Exercises 107
       
5 Statistical hypothesis testing 109

 5.1

Introduction 109

5.2

The framework 109

5.3

Hypothesis testing and signicance 112

5.3.1

One-tail testing 113

5.3.2

Two-tail testing 118

5.4

I do not believe God does not exist 120

5.5

An algorithm for hypothesis testing 121

5.6

Errors in hypothesis testing 122

5.7

Summary 126

5.8

Solved problems 126
       

  6

Classical approaches to data analysis 129

6.1

Introduction 129

6.2

Tests involving a single sample 130

6.2.1

Tests involving the mean. The t distribution 130

6.2.2

Choosing the number of replicates 134

6.2.3

Tests involving the variance (σ2). The chi-square distribution 136

6.2.4

Condence intervals for standard deviation 139

6.3

Tests involving two samples 140

6.3.1

Comparing variances. The F distribution 140

6.3.2

Comparing means 144

6.3.3

Condence intervals for the difference of means 1 - 2 149

6.4

Summary 150

6.5

Exercises 153
       
7 Analysis of Variance - ANOVA  155

7.1

 Introduction 155

7.1.1

  Problem denition and model assumptions 155

7.1.2

  The .dot. notation 158

7.2

  One-way ANOVA 159

7.2.1

  One-way Model I ANOVA 159

7.2.2

  One-way Model II ANOVA 166

7.3

  Two-way ANOVA 169

7.3.1

  Randomized complete block design ANOVA 170

7.3.2

 

Comparison between one-way ANOVA and randomized block design ANOVA

172

7.3.3

  Some examples 174

7.3.4

  Factorial design two-way ANOVA 178

7.3.5

   Data analysis plan for factorial design ANOVA 182

7.3.6

   Reference formulae for factorial design ANOVA 183

7.4

  Quality control 183

7.5

  Summary 186

7.6

  Exercises 187
       
8   Experiment design 189

8.1

  The concept of experiment design 189

8.2

  Comparing varieties 190

8.3

  Improving the production process 192

8.4

  Principles of experimental design 193

8.4.1

  Replication 194

8.4.2

  Randomization 196

8.4.3

  Blocking 197

8.5

  Guidelines for experimental design 198

8.6

  A short synthesis of statistical experiment designs 200

8.6.1

  The xed effect design 200

8.6.2

  Randomized block design 201

8.6.3

  Balanced incomplete block design 201

8.6.4

  Latin square design 202

8.6.5

  Factorial design 203

8.6.6

  Confounding in the factorial design 204

8.7

  Some microarray specic experiment designs 205

8.7.1

  The Jackson Lab approach 206

8.7.2

  Ratios and ip-dye experiments 208

8.7.3

  Reference design vs. loop design 210

8.8

  Summary 213
       
9   Multiple comparisons 215

9.1

  Introduction 215

9.2

  The problem of multiple comparisons 215

9.3

  A more precise argument 220

9.4

  Corrections for multiple comparisons 222

9.4.1

  The Sidak correction 222

9.4.2

  The Bonferroni correction 223

9.4.3

  Holm's step-wise correction 224

9.4.4

  The false discovery rate (FDR) 225

9.4.5

  Permutation correction 225

9.4.6

  Signicance analysis of microarrays (SAM) 227

9.4.7

  On permutations based methods 228

9.5

  Summary 229
       
10   Analysis and visualization tools 231

10.1

  Introduction 231

10.2

  Box plots 231

10.3

  Gene pies 232

10.4

  Scatter plots 233

10.4.1

  Scatter plot limitations 237

10.4.2

  Scatter plot summary 238

10.5

  Histograms 239

10.5.1

  Histograms summary 244

10.6

  Time series 245

10.7

  Principal component analysis (PCA) 246

10.7.1

  PCA limitations 257

10.7.2

  PCA summary 257

10.8

  Independent component analysis (ICA) 259

10.9

  Summary 260
       
11   Cluster analysis 263

11.1

  Introduction 263

11.2

  Distance metric 264

11.2.1

  Euclidean distance 265

11.2.2

  Manhattan distance 266

11.2.3

  Chebychev distance 268

11.2.4

  Angle between vectors 268

11.2.5

  Correlation distance 269

11.2.6

  Squared Euclidean distance 270

11.2.7

  Standardized Euclidean distance 270

11.2.8

  Mahalanobis distance 272

11.2.9

  Minkowski distance 273

11.2.10

  When to use what distance 273

11.2.11

  A comparison of various distances 275

11.3

  Clustering algorithms 276

11.3.1

  k-means clustering 281

11.3.2

  Hierarchical clustering 288

11.3.3

  Kohonen maps or self-organizing feature maps (SOFM) 297

11.4

  Summary 305
       
12   Data pre-processing and normalization 309

12.1

  Introduction 309

12.2

  General pre-processing techniques 309

12.2.1

  The log transform 309

12.2.2

  Combining replicates and eliminating outliers 311

12.2.3

  Array normalization 313

12.3

  Normalization issues specic to cDNA data 318

12.3.1

  Background correction 318

12.3.2

  Other spot level pre-processing 320

12.3.3

  Color normalization 320

12.4

  Normalization issues specic to Affymetrix data 329

12.4.1

  Background correction 329

12.4.2

  Signal calculation 330

12.4.3

  Detection calls 334

12.4.4

  Relative expression values 335

12.5

  Other approaches to the normalization of Affymetrix data 336

12.6

  Useful pre-processing and normalization sequences 336

12.7

  Summary 338

12.8

  Appendix 339

12.8.1

  A short primer on logarithms 339
       
13   Methods for selecting differentially regulated genes 341

13.1

  Introduction 341

13.2

  Criteria 342

13.3

  Fold change 343

13.3.1

  Description 343

13.3.2

  Characteristics 345

13.4

  Unusual ratio 347

13.4.1

  Description 347

13.4.2

  Characteristics 348

13.5

  Hypothesis testing, corrections for multiple comparisons and resampling 349

13.5.1

  Description 349

13.5.2

  Characteristics 350

13.6

  ANOVA 351

13.6.1

  Description 351

13.6.2

  Characteristics 351

13.7

  Noise sampling 352

13.7.1

  Description 352

13.7.2

  Characteristics 353

13.8

  Model based maximum likelihood estimation methods 354

13.8.1

  Description 354

13.8.2

  Characteristics 357

13.9

  Affymetrix comparison calls 358

13.10

  Other methods 359

13.11

  Summary 360

13.12

  Appendix 361

13.12.1

 

A comparison of the noise sampling method with the full blown ANOVA approach

361
       
14   Functional analysis and biological interpretation of microarray data 363

14.1

  Introduction 363

14.2

  The Gene Ontology 364

14.2.1

  The need for an ontology 364

14.2.2

  What is the Gene Ontology (GO)? 364

14.2.3

  What does GO contain? 365

14.2.4

  Access to GO 366

14.3

  Other related resources 367

14.4

  Translating lists of differentially regulated genes into biological knowledge 367

14.4.1

  Statistical approaches 369

14.5

  Onto-Express 372

14.5.1

  Implementation 372

14.5.2

  Graphical input interface description 373

14.5.3

  Some real data analyses 376

14.5.4

  Interpretation of the functional analysis results 381

14.6

  Summary 382
       
15   Focused microarrays . comparison and selection 383

15.1

  Introduction 383

15.2

  Criteria for array selection 385

15.3

  Onto-Compare 385

15.4

  Some comparisons 387

15.5

  Summary 391
       
16   Commercial applications 393

16.1

  Introduction 393

16.2

  Signicance testing among groups using GeneSight 395

16.2.1

  Problem description 395

16.2.2

  Experiment design 396

16.2.3

  Data analysis 396

16.2.4

  Conclusion 407

16.3

 

Statistical analysis of microarray data using S-PLUS and Insightful ArrayAnalyzer

409

16.3.1

  Experiment design 410

16.3.2

  Data preparation and exploratory data analysis 410

16.3.3

  Differential expression analysis 410

16.3.4

  Clustering and prediction 411

16.3.5

  Analysis summaries, visualization and annotation of results 411

16.3.6

  S+ArrayAnalyzer example: Swirl Zebrash experiment 412

16.3.7

  Summary 415

16.4

  SAS software for genomics 416

16.4.1

  SAS research data management 416

16.4.2

  SAS microarray solution 418

16.5

  Spotre's DecisionSite 421

16.5.1

   Introduction 421

16.5.2

   Experiment description 421

16.5.3

  Microarray data access 422

16.5.4

  Data transformation 423

16.5.5

  Filtering and visualizing gene expression data 424

16.5.6

  Finding gene expression patterns 427

16.5.7

 

Using clustering and data reduction techniques to isolate group of genes

428

16.5.8

  Comparing sample groups 431

16.5.9

  Using Portfolio Lists to isolate signicant genes 432

16.5.10

  Summary 434

16.6

   Summary 436
       
17   The road ahead 437

17.1

What next? 437

17.2

  Molecular diagnosis 437

17.3

Gene regulatory networks 439

17.4

  Conclusions 441
       
References     443
       

 

Companion CD Contents (available with the book)
1 GeneSight (BioDiscovery) - software suite for microarray data analysis.
2 ImaGene (BioDiscovery) - software for Image Processing of microarray slides
3 S-Plus (Insightful) - software suite for statistical analysis
4 ArrayAnalyzer (Insightful) - software suite for microarray data analysis

 

Dr. Sorin Draghici
Intelligent Systems and Bionformatics Laboratory Home Page

 

Webmaster - ISBL Team

copyright 2003 Intelligent Systems and Bioinformatics Laboratory, Computer Science Department, Wayne State University