Tải bản đầy đủ (.pdf) (584 trang)

Pharmaceutical data mining approaches and applications for drug discovery balakin 2009 12 21

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.7 MB, 584 trang )



PHARMACEUTICAL
DATA MINING


Wiley Series On Technologies for the Pharmaceutical Industry

Sean Ekins, Series Editor
Editorial Advisory Board
Dr. Renee Arnold (ACT LLC, USA); Dr. David D. Christ (SNC Partners LLC,
USA); Dr. Michael J. Curtis (Rayne Institute, St Thomas’ Hospital, UK);
Dr. James H. Harwood (Pfizer, USA); Dr. Dale Johnson (Emiliem, USA);
Dr. Mark Murcko, (Vertex, USA); Dr. Peter W. Swaan (University of Maryland,
USA); Dr. David Wild (Indiana University, USA); Prof. William Welsh (Robert
Wood Johnson Medical School University of Medicine & Dentistry of New Jersey,
USA); Prof. Tsuguchika Kaminuma (Tokyo Medical and Dental University, Japan);
Dr. Maggie A.Z. Hupcey (PA Consulting, USA); Dr. Ana Szarfman
(FDA, USA)

Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental
Chemicals
Edited by Sean Ekins
Pharmaceutical Applications of Raman Spectroscopy
Edited by Slobodan Šašić
Pathway Analysis for Drućg Discovery: Computational Infrastructure and
Applications
Edited by Anton Yuryev
Drug Efficacy, Safety, and Biologics Discovery: Emerging Technologies and Tools
Edited by Sean Ekins and Jinghai J. Xu
The Engines of Hippocrates: From the Dawn of Medicine to Medical and


Pharmaceutical Informatics
Barry Robson and O.K. Baek
Pharmaceutical Data Mining: Approaches and Applications for Drug Discovery
Edited by Konstantin V. Balakin


PHARMACEUTICAL
DATA MINING
Approaches and Applications for
Drug Discovery
Edited by

KONSTANTIN V. BALAKIN
Institute of Physiologically Active Compounds
Russian Academy of Sciences

A JOHN WILEY & SONS, INC., PUBLICATION


Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means, electronic, mechanical, photocopying, recording, scanning, or
otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright
Act, without either the prior written permission of the Publisher, or authorization through
payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030,

(201) 748-6011, fax (201) 748-6008, or online at />Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created
or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional
where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any
other commercial damages, including but not limited to special, incidental, consequential, or
other damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Pharmaceutical data mining : approaches and applications for drug discovery / [edited by]
Konstantin V. Balakin.
p. ; cm.
Includes bibliographical references and index.
ISBN 978-0-470-19608-3 (cloth)
1. Pharmacology. 2. Data mining. 3. Computational biology. I. Balakin, Konstantin V.
[DNLM: 1. Drug Discovery–methods. 2. Computational Biology. 3. Data
Interpretation, Statistical. QV 744 P5344 2010]
RM300.P475 2010
615′.1–dc22
2009026523
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1



CONTENTS

PREFACE

ix

ACKNOWLEDGMENTS

xi

CONTRIBUTORS

PART I

1

DATA MINING IN THE PHARMACEUTICAL
INDUSTRY: A GENERAL OVERVIEW

A History of the Development of Data Mining in Pharmaceutical
Research

xiii

1

3

David J. Livingstone and John Bradshaw


2

Drug Gold and Data Dragons: Myths and Realities of Data
Mining in the Pharmaceutical Industry

25

Barry Robson and Andy Vaithiligam

3

Application of Data Mining Algorithms in Pharmaceutical
Research and Development

87

Konstantin V. Balakin and Nikolay P. Savchuk

PART II
4

CHEMOINFORMATICS-BASED APPLICATIONS

Data Mining Approaches for Compound Selection and Iterative
Screening

113

115


Martin Vogt and Jürgen Bajorath
v


vi

5

CONTENTS

Prediction of Toxic Effects of Pharmaceutical Agents

145

Andreas Maunz and Christoph Helma

6

Chemogenomics-Based Design of GPCR-Targeted Libraries
Using Data Mining Techniques

175

Konstantin V. Balakin and Elena V. Bovina

7

Mining High-Throughput Screening Data by Novel
Knowledge-Based Optimization Analysis


205

S. Frank Yan, Frederick J. King, Sumit K. Chanda, Jeremy S. Caldwell,
Elizabeth A. Winzeler, and Yingyao Zhou

PART III
8

BIOINFORMATICS-BASED APPLICATIONS

Mining DNA Microarray Gene Expression Data

235
237

Paolo Magni

9

Bioinformatics Approaches for Analysis of
Protein–Ligand Interactions

267

Munazah Andrabi, Chioko Nagao, Kenji Mizuguchi, and Shandar Ahmad

10

Analysis of Toxicogenomic Databases


301

Lyle D. Burgoon

11

Bridging the Pharmaceutical Shortfall: Informatics Approaches
to the Discovery of Vaccines, Antigens, Epitopes, and Adjuvants

317

Matthew N. Davies and Darren R. Flower

PART IV

12

DATA MINING METHODS IN CLINICAL
DEVELOPMENT

Data Mining in Pharmacovigilance

339
341

Manfred Hauben and Andrew Bate

13


Data Mining Methods as Tools for Predicting Individual
Drug Response

379

Audrey Sabbagh and Pierre Darlu

14

Data Mining Methods in Pharmaceutical Formulation

401

Raymond C. Rowe and Elizabeth A Colbourn

PART V

15

DATA MINING ALGORITHMS
AND TECHNOLOGIES

Dimensionality Reduction Techniques for Pharmaceutical
Data Mining
Igor V. Pletnev, Yan A. Ivanenkov, and Alexey V. Tarasov

423

425



CONTENTS

16

Advanced Artificial Intelligence Methods Used in the Design
of Pharmaceutical Agents

vii

457

Yan A. Ivanenkov and Ludmila M. Khandarova

17

Databases for Chemical and Biological Information

491

Tudor I. Oprea, Liliana Ostopovici-Halip, and Ramona Rad-Curpan

18

Mining Chemical Structural Information from the Literature

521

Debra L. Banville


INDEX

545



PREFACE

Pharmaceutical drug discovery and development have historically followed a
sequential process in which relatively small numbers of individual compounds
were synthesized and tested for bioactivity. The information obtained from
such experiments was then used for optimization of lead compounds and their
further progression to drugs. For many years, an expert equipped with the
simple statistical techniques of data analysis was a central figure in the analysis
of pharmacological information. With the advent of advanced genome and
proteome technologies, as well as high-throughput synthesis and combinatorial screening, such operations have been largely replaced by a massive parallel mode of processing, in which large-scale arrays of multivariate data are
analyzed. The principal challenges are the multidimensionality of such data
and the effect of “combinatorial explosion.” Many interacting chemical,
genomic, proteomic, clinical, and other factors cannot be further considered
on the basis of simple statistical techniques. As a result, the effective analysis
of this information-rich space has become an emerging problem. Hence, there
is much current interest in novel computational data mining approaches that
may be applied to the management and utilization of the knowledge obtained
from such information-rich data sets. It can be simply stated that, in the era
of post-genomic drug development, extracting knowledge from chemical, biological, and clinical data is one of the biggest problems. Over the past few
years, various computational concepts and methods have been introduced to
extract relevant information from the accumulated knowledge of chemists,
biologists, and clinicians and to create a robust basis for rational design of
novel pharmaceutical agents.
Reflecting the needs, the present volume brings together contributions

from academic and industrial scientists to address both the implementation of
ix


x

PREFACE

new data mining technologies in the pharmaceutical industry and the challenges they currently face in their application. The key question to be answered
by these experts is how the sophisticated computational data mining techniques can impact the contemporary drug discovery and development.
In reviewing specialized books and other literature sources that address
areas relevant to data mining in pharmaceutical research, it is evident that
highly specialized tools are now available, but it has not become easier for
scientists to select the appropriate method for a particular task. Therefore,
our primary goal is to provide, in a single volume, an accessible, concentrated,
and comprehensive collection of individual chapters that discuss the most
important issues related to pharmaceutical data mining, their role, and possibilities in the contemporary drug discovery and development. The book
should be accessible to nonspecialized readers with emphasis on practical
application rather than on in-depth theoretical issues.
The book covers some important theoretical and practical aspects of pharmaceutical data mining within five main sections:











a general overview of the discipline, from its foundations to contemporary
industrial applications and impact on the current and future drug
discovery;
chemoinformatics-based applications, including selection of chemical
libraries for synthesis and screening, early evaluation of ADME/Tox and
physicochemical properties, mining high-throughput screening data, and
employment of chemogenomics-based approaches;
bioinformatics-based applications, including mining the gene expression
data, analysis of protein–ligand interactions, analysis of toxicogenomic
databases, and vaccine development;
data mining methods in clinical development, including data mining in
pharmacovigilance, predicting individual drug response, and data mining
methods in pharmaceutical formulation;
data mining algorithms, technologies, and software tools, with emphasis
on advanced data mining algorithms and software tools that are currently
used in the industry or represent promising approaches for future drug
discovery and development, and analysis of resources available in special
databases, on the Internet and in scientific literature.

It is my sincere hope that this volume will be helpful and interesting not
only to specialists in data mining but also to all scientists working in the field
of drug discovery and development and associated industries.
Konstantin V. Balakin


ACKNOWLEDGMENTS
I am extremely grateful to Prof. Sean Ekins for his invitation to write the book
on pharmaceutical data mining and for his invaluable friendly help during the
last years and in all stages of this work. I also express my sincere gratitude to
Jonathan Rose at John Wiley & Sons for his patience, editorial assistance, and

timely pressure to prepare this book on time. I want to acknowledge all the
contributors whose talent, enthusiasm, and insights made this book possible.
My interest in data mining approaches for drug design and development
was encouraged nearly a decade ago while at ChemDiv, Inc. by Dr. Sergey E.
Tkachenko, Prof. Alexandre V. Ivashchenko, Dr. Andrey A. Ivashchenko,
and Dr. Nikolay P. Savchuk. Collaborations with colleagues in both industry
and academia since are also acknowledged. My anonymous proposal reviewers are thanked for their valuable suggestions, which helped expand the scope
of the book beyond my initial outline. I would also like to acknowledge Elena
V. Bovina for technical help.
I dedicate this book to my family and to my wife.

xi



CONTRIBUTORS

Shandar Ahmad, National Institute of Biomedical Innovation, 7-6-8, Saitoasagi, Ibaraki-shi, Osaka 5670085, Japan; Email:
Munazah Andrabi, National Institute of Biomedical Innovation, Ibaraki-shi,
Osaka, Japan; Email:
Jürgen Bajorath, Department of Life Science Informatics, B-IT, LIMES
Program Unit Chemical Biology & Medicinal Chemistry, Rheinische
Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany;
Email:
Konstantin V. Balakin, Institute of Physiologically Active Compounds of
Russian Academy of Sciences, Severny proezd, 1, 142432 Chernogolovka,
Moscow region, Russia; Nonprofit partnership «Orchemed», 12/1,
Krasnoprudnaya ul., 107140 Moscow, Russia; Email: ,

Debra L. Banville, AstraZeneca Pharmaceuticals, Discovery Information,

1800 Concord Pike, Wilmington, Delaware 19850; Email: debra.banville@
astrazeneca.com
Andrew Bate, Risk Management Strategy, Pfizer Inc., New York, New York
10017, USA; Department of Medicine, New York University School of
Medicine, New York, NY, USA; Departments of Pharmacology and
Community and Preventive Medicine, New York Medical College, Valhalla,
NY, USA; Email:
xiii


xiv

CONTRIBUTORS

Elena V. Bovina, Institute of Physiologically Active Compounds of Russian
Academy of Sciences, Severny proezd, 1, 142432 Chernogolovka, Moscow
region, Russia; Email:
John Bradshaw, Formerly with Daylight CIS Inc, Sheraton House, Cambridge
UK CB3 0AX, UK.
Lyle D. Burgoon, Toxicogenomic Informatics and Solutions, LLC, Lansing,
MI USA, P.O. Box 27482, Lansing, MI 48909; Email:
Jeremy S. Caldwell, Genomics Institute of the Novartis Research Foundation,
10675 John Jay Hopkins Drive, San Diego, CA 92121, USA.
Sumit K. Chanda, Infectious and Inflammatory Disease Center, Burnham
Institute for Medical Research, La Jolla, CA 92037, USA; Email: schanda@
burnham.org
Elizabeth A Colbourn, Intelligensys Ltd., Springboard Business Centre,
Stokesley Business Park, Stokesley, North Yorkshire, UK; Email:

Ramona Rad-Curpan, Division of Biocomputing, MSC11 6145, University of

New Mexico School of Medicine, University of New Mexico, Albuquerque
NM 87131-0001, USA.
Pierre Darlu, INSERM U535, Génétique épidémiologique et structure des
populations humaines, Hôpital Paul Brousse, B.P. 1000, 94817 Villejuif
Cdedex, France; Univ Paris-Sud, UMR-S535, Villejuif, F-94817, France;
Email:
Matthew N. Davies, The Jenner Institute, University of Oxford, High Street,
Compton, Berkshire, RG20 7NN, UK; Email:
Darren R. Flower, The Jenner Institute, University of Oxford, High Street,
Compton, Berkshire, RG20 7NN, UK.
Manfred Hauben, Risk Management Strategy, Pfizer Inc., New York, New
York 10017 , USA; Department of Medicine, New York University School
of Medicine, New York, NY, USA; Departments of Pharmacology and
Community and Preventive Medicine, New York Medical College, Valhalla,
NY, USA; Email: manfred.hauben@Pfizer.com
Christoph Helma, Freiburg Center for Data Analysis and Modelling (FDM),
Hermann-Herder-Str. 3a, 79104Freiburg, Germany; In silico toxicology,
Talstr. 20, 79102 Freiburg, Germany; Email:


CONTRIBUTORS

xv

Yan A. Ivanenkov, Chemical Diversity Research Institute (IIHR), 141401,
Rabochaya Str. 2-a, Khimki, Moscow region, Russia; Institute of
Physiologically Active Compounds of Russian Academy of Sciences,
Severny proezd, 1, 142432 Chernogolovka, Moscow region, Russia; Email:

Ludmila M. Khandarova, InformaGenesis Ltd., 12/1, Krasnoprudnaya ul.,

107140 Moscow, Russia; Email:
Frederick J. King, Genomics Institute of the Novartis Research Foundation,
10675 John Jay Hopkins Drive, San Diego, CA 92121, USA; Novartis
Institutes for BioMedical Research, Cambridge, MA 02139, USA.
David J. Livingstone, ChemQuest, Isle of Wight, UK; Centre for Molecular
Design, University of Portsmouth, Portsmouth, UK; Email: davel@
chemquestuk.com
Paolo Magni, Dipartimento di Informatica e Sistemistica, Universita degli
Studi di Pavia, Via Ferrata 1, I-27100 Pavia, Italy; Email: paolo.magni@
unipv.it
Andreas Maunz, Freiburg Center for Data Analysis and Modelling (FDM),
Hermann-Herder-Str. 3a, 79104 Freiburg, Germany; Email: andreas@
maunz.de
Kenji Mizuguchi, National Institute of Biomedical Innovation, 7-6-8, Saitoasagi, Ibaraki-shi, Osaka 5670085, Japan; Email:
ne.jp
Chioko Nagao, National Institute of Biomedical Innovation, 7-6-8, Saitoasagi, Ibaraki-shi, Osaka 5670085, Japan.
Tudor I. Oprea, Division of Biocomputing, MSC11 6145, University of
New Mexico School of Medicine, University of New Mexico, Albuquerque
NM 87131-0001, USA; Sunset Molecular Discovery LLC, 1704 B Llano
Street, S-te 140, Santa Fe NM 87505-5140, USA; Email:
edu
Liliana Ostopovici-Halip, Division of Biocomputing, MSC11 6145, University of New Mexico School of Medicine, University of New Mexico,
Albuquerque NM 87131-0001, USA.
Igor V. Pletnev, Department of Chemistry, M.V.Lomonosov Moscow State
University, Leninskie Gory 1, 119992 GSP-3 Moscow, Russia; Email:



xvi


CONTRIBUTORS

Barry Robson, Global Pharmaceutical and Life Sciences 294 Route 100,
Somers, NY 10589; The Dirac Foundation, Everyman Legal, No. 1G,
Network Point, Range Road, Witney, Oxfordshire, OX29 0YN; Email:

Raymond C. Rowe, Intelligensys Ltd., Springboard Business Centre,
Stokesley Business Park, Stokesley, North Yorkshire, UK; Email: rowe@
intelligensys.co.uk
Audrey Sabbagh, INSERM UMR745, Université Paris Descartes, Faculté des
Sciences Pharmaceutiques et Biologiques, 4 Avenue de l’Observatoire,
75270 Paris Cedex 06, France; Biochemistry and Molecular Genetics
Department, Beaujon Hospital, 100 Boulevard Général Leclerc, 92110
CLICHY Cedex, France; Email:
Alexey V. Tarasov, InformaGenesis Ltd., 12/1, Krasnoprudnaya ul., 107140
Moscow, Russia; Email:
Andy Vaithiligam, St. Matthews University School of Medicine, Safehaven,
Leeward Three, Grand Cayman Island.
Martin Vogt, Department of Life Science Informatics, B-IT, LIMES Program
Unit Chemical Biology & Medicinal Chemistry, Rheinische FriedrichWilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany; Email:

Elizabeth A. Winzeler, Genomics Institute of the Novartis Research
Foundation, San Diego, California and The Department of Cell Biology,
The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla,
California 92037, USA; Email:
S. Frank Yan,
Yingyao Zhou, Genomics Institute of the Novartis Research Foundation,
10675 John Jay Hopkins Drive, San Diego, California 92121, USA; Email:




PART I
DATA MINING IN THE
PHARMACEUTICAL INDUSTRY:
A GENERAL OVERVIEW



1
A HISTORY OF THE
DEVELOPMENT OF DATA MINING
IN PHARMACEUTICAL RESEARCH
David J. Livingstone and John Bradshaw
Table
1.1
1.2
1.3

of Contents
Introduction
Technology
Computers
1.3.1 Mainframes
1.3.2 General-Purpose Computers
1.3.3 Graphic Workstations
1.3.4 PCs
1.4 Data Storage and Manipulation
1.5 Molecular Modeling
1.6 Characterizing Molecules and QSAR
1.7 Drawing and Storing Chemical Structures

1.7.1 Line Notations
1.8 Databases
1.9 Libraries and Information
1.10 Summary
References

4
4
5
5
6
6
7
7
8
10
13
14
17
19
19
20

Pharmaceutical Data Mining: Approaches and Applications for Drug Discovery,
Edited by Konstantin V. Balakin
Copyright © 2010 John Wiley & Sons, Inc.

3



4

1.1

A HISTORY OF THE DEVELOPMENT OF DATA MINING

INTRODUCTION

From the earliest times, chemistry has been a classification science. For
example, even in the days when it was emerging from alchemy, substances
were put into classes such as “metals.” This “metal” class contained things
such as iron, copper, silver, and gold but also mercury, which, even though it
was liquid, still had enough properties in common with the other members of
its class to be included. In other words, scientists were grouping together
things that were related or similar but were not necessarily identical, all important elements of the subject of this book: data mining. In today’s terminology,
there was an underlying data model that allowed data about the substances
to be recorded, stored, analyzed, and conclusions drawn. What is remarkable
in chemistry is that not only have the data survived more than two centuries
in a usable way but that the data have continued to leverage contemporary
technologies for its storage and analysis.
In the early 19th century, Berzelius was successful in persuading chemists
to use alphabetic symbols for the elements: “The chemical signs ought to be
letters, for the greater facility of writing, and not to disfigure a printed book”
[1]. This Berzelian system [2] was appropriate for the contemporary storage
and communication medium, i.e., paper, and the related recording technology,
i.e., manuscript or print.
One other thing that sets chemical data apart from other data is the need
to store and to search the compound structure. These structural formulas are
much more than just pictures; they have the power such that “the structural
formula of, say, p-rosaniline represents the same substance to Robert B.

Woodward say, in 1979 as it did to Emil Fischer in 1879” [3]. As with the
element symbols, the methods and conventions for drawing chemical structures were agreed at an international level. This meant that chemists could
record and communicate accurately with each other, the nature of their work.
As technologies moved on and volumes of data grew, chemists would need
to borrow methodology from other disciplines. Initially, systematic naming of
compounds allowed indexing methods, which had been developed for text
handling and were appropriate for punch card sorting, to deal with the explosion of known structures. Later, graph theory was used to be able to handle
structures directly in computers. Without these basic methodologies to store
the data, data mining would be impossible.
The rest of this chapter represents the authors’ personal experiences in the
development of chemistry data mining technologies since the early 1970s.
1.2 TECHNOLOGY
When we began our careers in pharmaceutical research, there were no computers in the laboratories. Indeed, there was only one computer in the company
and that was dedicated to calculating the payroll! Well, this is perhaps a slight
exaggeration. A Digital Equipment Corporation (DEC) PDP-8 running in-


COMPUTERS

5

house regression software was available to one of us and the corporate mainframes were accessible via teleprinter terminals, although there was little
useful scientific software running on them.
This was a very different world to the situation we have today. Documents
were typed by a secretary using a typewriter, perhaps one of the new electric
golf ball typewriters. There was no e-mail; communication was delivered by
post, and there was certainly no World Wide Web. Data were stored on sheets
of paper or, perhaps, punched cards (see later), and molecular models were
constructed by hand from kits of plastic balls. Compounds were characterized
for quantitative structure–activity relationship (QSAR) studies by using

lookup tables of substituent constants, and if an entry was missing, it could
only be replaced by measurement. Mathematical modeling consisted almost
entirely of multiple linear regression (MLR) analysis, often using self-written
software as already mentioned.
So, how did we get to where we are today? Some of the necessary elements were already in existence but were simply employed in a different
environment; statistical software such as BMDP, for example, was widely
used by academics. Other functionalities, however, had to be created. This
chapter traces the development of some of the more important components
of the systems that are necessary in order for data mining to be carried out
at all.

1.3

COMPUTERS

The major piece of technology underlying data mining is, of course, the computer. Other items of technology, both hardware and software, are of course
important and are covered in their appropriate sections, but the huge advances
in our ability to mine data have gone hand in hand with the development of
computers. These machines can be split into four main types: mainframes,
general-purpose computers, graphic workstations, and personal computers
(PCs).
1.3.1

Mainframes

These machines are characterized by a computer room or a suite of rooms
with a staff of specialists who serve the needs of the machine. Mainframe
computers were expensive, involving considerable investment in resource,
and there was thus a requirement for a computing department or even division within the organizational structure of the company. As computing
became available within the laboratories, a conflict of interest was perceived

between the computing specialists and the research departments with competition for budgets, human resources, space, and so on. As is inevitable
in such situations, there were sometimes “political” difficulties involved in
the acquisition of both hardware and software by the research functions.


6

A HISTORY OF THE DEVELOPMENT OF DATA MINING

Mainframe computers served some useful functions in the early days of data
mining. At that time, computing power was limited compared with the
requirements of programs such as ab initio and even semi-empirical quantum
chemistry packages, and thus the company mainframe was often employed
for these calculations, which could often run for weeks. As corporate
databases began to be built, the mainframe was an ideal home for them since
this machine was accessible company-wide, a useful feature when the organization had multiple sites, and was professionally maintained with scheduled
backups, and so on.
1.3.2

General-Purpose Computers

DEC produced the first retail computers in the 1960s. The PDP-1 (PDP stood
for programmable data processor) sold for $120,000 when other computers
cost over a million. The PDP-8 was the least expensive general-purpose computer on the market [4] in the mid-1960s, and this was at a time when all the
other computer manufacturers leased their machines. The PDP-8 was also a
desktop machine so it did not require a dedicated computing facility with
support staff and so on. Thus, it was the ideal laboratory computer. The PDP
range was superseded by DEC’s VAX machines and these were also very
important, but the next major step was the development of PCs.
1.3.3


Graphic Workstations

The early molecular modeling programs required some form of graphic display
for their output. An example of this is the DEC GT40, which was a monochrome display incorporating some local processing power, actually a PDP-11
minicomputer. A GT40 could only display static images and was usually connected to a more powerful computer, or at least one with more memory, on
which the modeling programs ran. An alternative lower-cost approach was the
development of “dumb” graphic displays such as the Tektronix range of
devices. These were initially also monochrome displays, but color terminals
such as the Tek 4015 were soon developed and with their relatively low cost
allowed much wider access to molecular modeling systems. Where molecular
modeling was made generally available within a company, usually using inhouse software, this was most often achieved with such terminals.
These devices were unsuitable, however, for displaying complicated systems
such as portions of proteins or for animations. Dedicated graphic workstations, such as the Evans and Sutherland (E&S) picture systems, were the first
workstations used to display the results of modeling macromolecules. These
were expensive devices and thus were limited to the slowly evolving computational chemistry groups within the companies. E&S workstations soon faced
competition from other companies such as Sun and, in particular, Silicon
Graphics International Corporation (SGI). As prices came down and computing performance went up, following Moore’s law, the SGI workstation became


DATA STORAGE AND MANIPULATION

7

the industry standard for molecular modeling and found its way into the
chemistry departments where medicinal chemists could then do their own
molecular modeling. These days, of course, modeling is increasingly being
carried out using PCs.
1.3.4


PCs

IBM PCs or Apple Macintoshes gradually began to replace dumb terminals
in the laboratories. These would usually run some terminal emulation software
so that they could still be used to communicate with the large corporate computers but would also have some local processing capability and, perhaps, an
attached printer. At first, the local processing would be very limited, but this
soon changed with both the increasing sophistication of “office” suites and the
usual increasing performance/decreasing price evolution of computers in
general. Word processing on a PC was a particularly desirable feature as there
was a word processing program running on a DEC VAX (MASS-11), which
was nearly a WYSIWYG (what you see is what you get) word processor, but
not quite! These days, the PC allows almost any kind of computing job to be
carried out.
This has necessarily been a very incomplete and sketchy description of the
application of computers in pharmaceutical research. For a detailed discussion, see the chapter by Boyd and Marsh [5].

1.4

DATA STORAGE AND MANIPULATION

Information on compounds such as structure, salt, melting point, molecular
weight, and so on, was filed on paper sheets. These were labeled numerically
and were often sorted by year of first synthesis and would be stored as a complete collection in a number of locations. The data sheets were also microfilmed as a backup, and this provided a relatively faster way of searching the
corporate compound collection for molecules with specific structural features
or for analogues of compounds of interest. Another piece of information
entered on the data sheets was an alphanumeric code called the Wiswesser
line notation (WLN), which provided a means of encoding the structure of
the compound in a short and simple string, which later, of course, could be
used to represent the compound in a computer record. WLN is discussed
further in a later section.

Experimental data, such as the results of compound screening, were stored
in laboratory notebooks and then were collated into data tables and eventually
reports. Individual projects sometimes used a system of edge-notched cards
to store both compound and experimental information. Figure 1.1 shows one
of these edge-notched cards.
Edge-notched cards were sets of printed cards with usually handwritten
information. Along the edge were a series of holes, which could be clipped to


×