Published by Woodhead Publishing Limited, 2012
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
41R
Open source software in
life science research
Published by Woodhead Publishing Limited, 2012
Woodhead Publishing Series in Biomedicine
1 Practical leadership for biopharmaceutical executives
J. Y. Chin
2 Outsourcing biopharma R&D to India
P. R. Chowdhury
3 Matlab® in bioscience and biotechnology
L. Burstein
4 Allergens and respiratory pollutants
Edited by M. A. Williams
5 Concepts and techniques in genomics and proteomics
N. Saraswathy and P. Ramalingam
6 An introduction to pharmaceutical sciences
J. Roy
7 Patently innovative: How pharmaceutical fi rms use emerging patent law to
extend monopolies on blockbuster drugs
R. A. Bouchard
8 Therapeutic protein drug products: Practical approaches to formulation in
the laboratory, manufacturing and the clinic
Edited by B. K. Meyer
9 A biotech manager’s handbook: A practical guide
Edited by M. O’Neill and M. H. Hopkins
10 Clinical research in Asia: Opportunities and challenges
U. Sahoo
11 Therapeutic antibody engineering: Current and future advances driving the
strongest growth area in the pharma industry
W. R. Strohl and L. M. Strohl
12 Commercialising the stem cell sciences
O. Harvey
13
14 Human papillomavirus infections: From the laboratory to clinical practice
F. Cobo
15 Annotating new genes: From
in silico
to validations by experiments
S. Uchida
16 Open-source software in life science research: Practical solutions in the
pharmaceutical industry and beyond
Edited by L. Harland and M. Forster
17 Nanoparticulate drug delivery: A perspective on the transition from
laboratory to market
V. Patravale, P. Dandekar and R. Jain
18 Bacterial cellular metabolic systems: Metabolic regulation of a cell system
with
13
C-metabolic fl ux analysis
K. Shimizu
19 Contract research and manufacturing services (CRAMS) in India: The
business, legal, regulatory and tax environment
M. Antani and G. Gokhale
Published by Woodhead Publishing Limited, 2012
20 Bioinformatics for biomedical science and clinical applications
K-H. Liang
21 Deterministic versus stochastic modelling in biochemistry and systems
biology
P. Lecca, I. Laurenzi and F. Jordan
22 Protein folding in silico : Protein folding versus protein structure prediction
I. Roterman
23 Computer-aided vaccine design
T. J. Chuan and S. Ranganathan
24 An introduction to biotechnology
W. T. Godbey
25 RNA interference: Therapeutic developments
T. Novobrantseva, P. Ge and G. Hinkle
26 Patent litigation in the pharmaceutical and biotechnology industries
G. Morgan
27 Clinical research in paediatric psychopharmacology: A practical guide
P. Auby
28 The application of SPC in the pharmaceutical and biotechnology industries
T. Cochrane
29 Ultrafi ltration for bioprocessing
H. Lutz
30 Therapeutic risk management of medicines
A. K. Banerjee and S. Mayall
31 21st century quality management and good management practices: Value
added compliance for the pharmaceutical and biotechnology industry
S. Williams
32
33 CAPA in the pharmaceutical and biotech industries
J. Rodriguez
34 Process validation for the production of biopharmaceuticals: Principles and
best practice
A. R. Newcombe and P
. Thillaivinayagalingam
35 Clinical trial management: An overview
U. Sahoo and D. Sawant
36 Impact of regulation on drug development
H. Guenter Hennings
37 Lean biomanufacturing
N. J. Smart
38 Marine enzymes for biocatalysis
Edited by A. Trincone
39 Ocular transporters and receptors in the eye: Their role in drug delivery
A. K. Mitra
40 Stem cell bioprocessing: For cellular therapy, diagnostics and drug
development
T. G. Fernandes, M. M. Diogo and J. M. S. Cabral
41
42 Fed-batch fermentation: A practical guide to scalable recombinant protein
production in Escherichia coli
G. G. Moulton and T. Vedvick
43 The funding of biopharmaceutical research and development
D. R. Williams
44 Formulation tools for pharmaceutical development
Edited by J. E. A. Diaz
Published by Woodhead Publishing Limited, 2012
45 Drug-biomembrane interaction studies: The application of calorimetric
techniques
R. Pignatello
46 Orphan drugs: Understanding the rare drugs market
E. Hernberg-Ståhl
47 Nanoparticle-based approaches to targeting drugs for severe diseases
J. L. A. Mediano
48 Successful biopharmaceutical operations
C. Driscoll
49 Electroporation-based therapies for cancer
Edited by R. Sundarajan
50 Transporters in drug discovery and development
Y. Lai
51 The life-cycle of pharmaceuticals in the environment
R. Braund and B. Peake
52 Computer-aided applications in pharmaceutical technology
Edited by J. Petrovi
53 From plant genomics to plant biotechnology
Edited by P. Poltronieri, N. Burbulis and C. Fogher
54 Bioprocess engineering: An introductory engineering and life science
approach
K. G. Clarke
55 Quality assurance problem solving and training strategies for success in the
pharmaceutical and life science industries
G. Welty
56 Nanomedicine: Prognostic and curative approaches to cancer
K. Scarberry
57 Gene therapy: Potential applications of nanotechnology
S. Nimesh
58 Controlled drug delivery: The role of self-assembling multi-task excipients
M. Mateescu
59 In silico protein design
C. M. Frenz
60 Bioinformatics for computer science: Foundations in modern biology
K. Revett
61 Gene expression analysis in the RNA world
J. Q. Clement
62 Computational methods for fi nding inferential bases in molecular genetics
Q-N. T
ran
63 NMR metabolomics in cancer research
M. C
ˇ
uperlovi c´ -Culf
64 Virtual worlds for medical education, training and care delivery
K. Kahol
Published by Woodhead Publishing Limited, 2012
Woodhead Publishing Series in Biomedicine: Number 16
Open source software in
life science research
Practical solutions in the
pharmaceutical industry
and beyond
E dited by
L ee H arland and M ark F orster
Published by Woodhead Publishing Limited, 2012
Woodhead Publishing Limited, 80 High Street, Sawston, Cambridge, CB22 3HJ, UK
www.woodheadpublishing.com
www.woodheadpublishingonline.com
Woodhead Publishing, 1518 Walnut Street, Suite 1100, Philadelphia, PA 19102–3406, USA
Woodhead Publishing India Private Limited, G-2, Vardaan House, 7/28 Ansari Road,
Daryaganj, New Delhi – 110002, India
www.woodheadpublishingindia.com
First published in 2012 by Woodhead Publishing Limited
ISBN: 978-1-907568-97-8 (print); ISBN: 978-1-908818-24-9 (online)
Woodhead Publishing Series in Biomedicine ISSN: 2050-0289 (print); ISSN: 2050-0297 (online)
© The editor, contributors and the Publishers, 2012
The right of Lee Harland and Mark Forster to be identifi ed as authors of the editorial material in this Work
has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents
Act 1988.
British Library Cataloguing-in-Publication Data: A catalogue record for this book is available from the British
Library.
Library of Congress Control Number: 2012944355
All rights reserved. No part of this publication may be reproduced, stored in or introduced into a retrieval
system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or
otherwise) without the prior written permission of the Publishers. This publication may not be lent, resold,
hired out or otherwise disposed of by way of trade in any form of binding or cover other than that in which it
is published without the prior consent of the Publishers. Any person who does any unauthorised act in relation
to this publication may be liable to criminal prosecution and civil claims for damages.
Permissions may be sought from the Publishers at the above address.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identifi ed as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights. The Publishers are not associated with any product or vendor mentioned in this publication.
The Publishers, editors and contributors have attempted to trace the copyright holders of all material
reproduced in this publication and apologise to any copyright holders if permission to publish in this form has
not been obtained. If any copyright material has not been acknowledged, please write and let us know so we
may rectify in any future reprint. Any screenshots in this publication are the copyright of the website owner(s),
unless indicated otherwise.
Limit of Liability/Disclaimer of Warranty
The Publishers, editors and contributors make no representations or warranties with respect to the accuracy
or completeness of the contents of this publication and specifi cally disclaim all warranties, including without
limitation warranties of fi tness of a particular purpose. No warranty may be created or extended by sales of
promotional materials. The advice and strategies contained herein may not be suitable for every situation.
This publication is sold with the understanding that the Publishers are not rendering legal, accounting or other
professional services. If professional assistance is required, the services of a competent professional person
should be sought. No responsibility is assumed by the Publishers, editor(s) or contributors for any loss of
profi t or any other commercial damages, injury and/or damage to persons or property as a matter of products
liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas
contained in the material herein. The fact that an organisation or website is referred to in this publication as
a citation and/or potential source of further information does not mean that the Publishers nor the editor(s)
and contributors endorse the information the organisation or website may provide or recommendations it
may make. Further, readers should be aware that internet websites listed in this work may have changed or
disappeared between when this publication was written and when it is read. Because of rapid advances in
medical sciences, in particular, independent verifi cation of diagnoses and drug dosages should be made.
Typeset by Refi neCatch Limited, Bungay, Suffolk
Printed in the UK and USA
Published by Woodhead Publishing Limited, 2012
For Anna, for making everything possible
Lee Harland
Thanks to my wife, children and other family members, for their
support and understanding during this project
Mark Forster
Published by Woodhead Publishing Limited, 2012
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
41R
ix
Contents
List of fi gures and tables xvii
Foreword xxvii
About the editors xxxi
About the contributors xxxiii
Introduction 1
1 Building research data handling systems with open source tools 9
Claus Stie Kallesøe
1.1 Introduction 9
1.2 Legacy 11
1.3 Ambition 12
1.4 Path chosen 14
1.5 The ’ilities 15
1.6 Overall vision 21
1.7 Lessons learned 21
1.8 Implementation 23
1.9 Who uses LSP today? 24
1.10 Organisation 27
1.11 Future aspirations 29
1.12 References 32
2 Interactive predictive toxicology with Bioclipse and OpenTox 35
Egon Willighagen, Roman Affentranger, Roland C. Grafström,
Barry Hardy, Nina Jeliazkova and Ola Spjuth
2.1 Introduction 35
2.2 Basic Bioclipse–OpenTox interaction examples 39
2.3 Use Case 1: Removing toxicity without interfering with
pharmacology 45
2.4 Use Case 2: Toxicity prediction on compound collections 52
x
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
2.5 Discussion 58
2.6 Availability 59
2.7 References 59
3 Utilizing open source software to facilitate communication
of chemistry at RSC 63
Aileen Day, Antony Williams, Colin Batchelor, Richard Kidd and
Valery Tkachenko
3.1 Introduction 63
3.2 Project Prospect and open ontologies 64
3.3 ChemSpider 68
3.4 ChemDraw Digester 72
3.5 LearnChemistry 78
3.6 Conclusion 83
3.7 Acknowledgments 84
3.8 References 84
4 Open source software for mass spectrometry and metabolomics 89
Mark Earll
4.1 Introduction 89
4.2 A short mass spectrometry primer 90
4.3 Metabolomics and metabonomics 93
4.4 Data types 94
4.5 Metabolomics data processing 104
4.6 Metabolomics data processing using the open source
workfl ow engine, KNIME 112
4.7 Open source software for multivariate analysis 115
4.8 Performing PCA on metabolomics data in R/KNIME 117
4.9 Other open source packages 121
4.10 Perspective 124
4.11 Acknowledgements 126
4.12 References 126
5 Open source software for image processing and analysis:
picture this with ImageJ 131
Rob Lind
5.1 Introduction 131
5.2 ImageJ 133
xi
Contents
Published by Woodhead Publishing Limited, 2012
5.3 ImageJ macros: an overview 140
5.4 Graphical user interface 144
5.5 Industrial applications of image analysis 146
5.6 Summary 148
5.7 References 149
6 Integrated data analysis with KNIME 151
Thorsten Meinl, Bernd Jagla and Michael R. Berthold
6.1 The KNIME platform 151
6.2 The KNIME success story 156
6.3 Benefi ts of ‘professional open source’ 157
6.4 Application examples 158
6.5 Conclusion and outlook 170
6.6 Acknowledgments 170
6.7 References 171
7 Investigation-Study-Assay, a toolkit for standardizing
data capture and sharing 173
Philippe Rocca-Serra, Eamonn Maguire, Chris Taylor, Dawn Field,
Timo Wittenberger, Annapaola Santarsiero and
Susanna-Assunta Sansone
7.1 The growing need for content curation in industry 174
7.2 The BioSharing initiative: cooperating standards needed 175
7.3 The ISA framework – principles for progress 176
7.4 Lessons learned 185
7.5 Acknowledgments 186
7.6 References 187
8 GenomicTools: an open source platform for developing
high-throughput analytics in genomics 189
Aristotelis Tsirigos, Niina Haiminen, Erhan Bilal and Filippo Utro
8.1 Introduction 190
8.2 Data types 194
8.3 Tools overview 195
8.4 C++ API for developers 202
8.5 Case study: a simple ChIP-seq pipeline 207
8.6 Performance 215
xii
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
8.7 Conclusion 217
8.8 Resources 218
8.9 References 218
9 Creating an in-house ’omics data portal using EBI Atlas software 221
Ketan Patel, Misha Kapushesky and David P. Dean
9.1 Introduction 221
9.2 Leveraging ’omics data for drug discovery 222
9.3 The EBI Atlas software 226
9.4 Deploying Atlas in the enterprise 231
9.5 Conclusion and learnings 234
9.6 Acknowledgments 237
9.7 References 237
10 Setting up an ’omics platform in a small biotech 239
Jolyon Holdstock
10.1 Introduction 239
10.2 General changes over time 240
10.3 The hardware solution 241
10.4 Maintenance of the system 244
10.5 Backups 245
10.6 Keeping up-to-date 246
10.7 Disaster recovery 248
10.8 Personnel skill sets 257
10.9 Conclusion 258
10.10 Acknowledgements 259
10.11 References 259
11 Squeezing big data into a small organisation 263
Michael A. Burrell and Daniel MacLean
11.1 Introduction 263
11.2 Our service and its goals 265
11.3 Manage the data: relieving the burden of data-handling 267
11.4 Organising the data 267
11.5 Standardising to your requirements 271
11.6 Analysing the data: helping users work with their
own data 273
xiii
Contents
Published by Woodhead Publishing Limited, 2012
11.7 Helping biologists to stick to the rules 276
11.8 Running programs 276
11.9 Helping the user to understand the details 279
11.10 Summary 280
11.11 References 282
12 Design Tracker: an easy to use and fl exible hypothesis tracking
system to aid project team working 285
Craig Bruce and Martin Harrison
12.1 Overview 285
12.2 Methods 289
12.3 Technical overview 290
12.4 Infrastructure 294
12.5 Review 296
12.6 Acknowledgements 297
12.7 References 297
13 Free and open source software for web-based collaboration 299
Ben Gardner and Simon Revell
13.1 Introduction 299
13.2 Application of the FLOSS assessment framework 303
13.3 Conclusion 321
13.4 Acknowledgements 322
13.5 References 322
14 Developing scientifi c business applications using open
source search and visualisation technologies 325
Nick Brown and Ed Holbrook
14.1 A changing attitude 325
14.2 The need to make sense of large amounts of data 326
14.3 Open source search technologies 327
14.4 Creating the foundation layer 328
14.5 Visualisation technologies 334
14.6 Prefuse visualisation toolkit 334
14.7 Business applications 335
14.8 Other applications 346
14.9 Challenges and future developments 347
xiv
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
14.10 Refl ections 348
14.11 Thanks and acknowledgements 349
14.12 References 349
15 Utopia Documents: transforming how industrial scientists
interact with the scientifi c literature 351
Steve Pettifer, Terri Attwood, James Marsh and Dave Thorne
15.1 Utopia Documents in industry 355
15.2 Enabling collaboration 360
15.3 Sharing, while playing by the rules 361
15.4 History and future of Utopia Documents 363
15.5 References 364
16 Semantic MediaWiki in applied life science and industry:
building an Enterprise Encyclopaedia 367
Laurent Alquier
16.1 Introduction 367
16.2 Wiki-based Enterprise Encyclopaedia 368
16.3 Semantic MediaWiki 369
16.4 Conclusion and future directions 387
16.5 Acknowledgements 388
16.6 References 388
17 Building disease and target knowledge with Semantic
MediaWiki 391
Lee Harland, Catherine Marshall, Ben Gardner, Meiping Chang,
Rich Head and Philip Verdemato
17.1 The Targetpedia 391
17.2 The Disease Knowledge Workbench (DKWB) 408
17.3 Conclusion 418
17.4 Acknowledgements 419
17.5 References 419
18 Chem2Bio2RDF: a semantic resource for systems chemical
biology and drug discovery 421
David Wild
18.1 The need for integrated, semantic resources in drug discovery 421
xv
Contents
Published by Woodhead Publishing Limited, 2012
18.2 The Semantic Web in drug discovery 423
18.3 Implementation challenges 424
18.4 Chem2Bio2RDF architecture 427
18.5 Tools and methodologies that use Chem2Bio2RDF 427
18.6 Conclusions 432
18.7 References 432
19 TripleMap: a web-based semantic knowledge discovery
and collaboration application for biomedical research 435
Ola Bildtsen, Mike Hugo, Frans Lawaetz, Erik Bakke,
James Hardwick, Nguyen Nguyen, Ted Naleid and
Christopher Bouton
19.1 The challenge of Big Data 436
19.2 Semantic technologies 437
19.3 Semantic technologies overview 439
19.4 The design and features of TripleMap 442
19.5 TripleMap Generated Entity Master (‘GEM’) semantic
data core 444
19.6 TripleMap semantic search interface 446
19.7 TripleMap collaborative, dynamic knowledge maps 448
19.8 Comparison and integration with third-party systems 450
19.9 Conclusions 451
19.10 References 451
20 Extreme scale clinical analytics with open source software 453
Kirk Elder and Brian Ellenberger
20.1 Introduction 453
20.2 Interoperability 455
20.3 Mirth 458
20.4 Mule ESB 463
20.5 Unifi ed Medical Language System (UMLS) 463
20.6 Open source databases 465
20.7 Analytics 473
20.8 Final architectural overview 478
20.9 References 479
20.10 Bibliography 480
xvi
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
21 Validation and regulatory compliance of free/open
source software 481
David Stokes
21.1 Introduction 481
21.2 The need to validate open source applications 482
21.3 Who should validate open source software? 484
21.4 Validation planning 485
21.5 Risk management and open source software 491
21.6 Key validation activities 493
21.7 Ongoing validation and compliance 500
21.8 Conclusions 503
21.9 References 504
22 The economics of free/open source software in industry 505
Simon Thornber
22.1 Introduction 505
22.2 Background 506
22.3 Open source innovation 508
22.4 Open source software in the pharmaceutical industry 510
22.5 Open source as a catalyst for pre-competitive collaboration
in the pharmaceutical industry 510
22.6 The Pistoia Alliance Sequence Services Project 512
22.7 Conclusion 520
22.8 References 521
Index 523
Published by Woodhead Publishing Limited, 2012
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
20
1
2
3
4
5
6
7
8
9
30
1
2
3
4
5
6
7
8
9
40
41R
xvii
List of fi gures and tables
Figures
1.1 Technology stack of the current version of LSP
running internally at Lundbeck 16
1.2 LSP curvefi t, showing plate list, plate detail as well
as object detail and curve 20
1.3 LSP MedChem Designer, showing on the fl y calculated
properties and Google visualisations 25
1.4 LSP4Externals front page with access to the different
functionalities published to the external collaborators 26
1.5 LSP SAR grid with single row details form 28
1.6 IMI OpenPhacts GUI based on the LSP4All frame 31
2.1 Integration of online OpenTox descriptor calculation
services in the Bioclipse QSAR environment 40
2.2 The Bioclipse Graphical User Interface for
uploading data to OpenTox 41
2.3 OpenTox web page showing uploaded data 42
2.4 CPDB Signature Alert for Carcinogenicity for
TCMDC-135308 48
2.5 Identifi cation of the structural alert in the
ToxTree Benigni/Bossa model for carcinogenicity
and mutagenicity, available via OpenTox 49
2.6 Crystal structure of human TGF-
β
1 with the
inhibitor quinazoline 3d bound (PDB-entry 3HMM) 50
2.7 Replacing the dimethylamino group of
TCMDC-135308 with a methoxy group resolves
the CPDB signature alert as well as the ToxTree
xviii
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
Benigni/Bossa Structure Alerts for carcinogenicity
and mutagenicity as provided by OpenTox 51
2.8 Annotated kinase inhibitors of the TCAMS,
imported into Bioclipse as SDF together with data
on the association with human adverse events 52
2.9 Applying toxicity models to sets of compounds from
within the Bioclipse Molecule TableEditor 53
2.10 Adding Decision Support columns to the
molecule table 54
2.11 Opening a single compound from a table in the
Decision Support perspective 55
2.12 The highlighted compound – TCMDC-135174
(row 27) – is an interesting candidate as it is highly
active against both strains of P. falciparum while
being inactive against human HepG2 cells 56
2.13 Molecule Table view shows TCMDC-134695 in
row 19 57
2.14 The compound TCMDC-133807 is predicted to
be strongly associated with human adverse events,
and yields signature alerts with Bioclipse’s CPDB
and Ames models 57
3.1 A ‘prospected’ article from RSC 65
3.2 The header of the chemical record for domoic
acid in ChemSpider 69
3.3 Example of fi gure in article defi ning compounds 73
3.4 A review page of digested information 75
3.5 Examples of ChemDraw molecules which are
not converted correctly to MOL fi les by OpenBabel 77
3.6 The Learn Chemistry Wiki 79
4.1 Isotope pattern for cystine 97
4.2 Ion chromatogram produced in R (xcms) 100
4.3 A mass spectrum produced from R (xcms) 101
4.4 3D Image of a LC-MS scan using the plot surf
command from the RGL R-package
LC-MS spectrum using mzMine 102
4.5 A total ion chromatogram (TIC) plot from
mzMine 103
xix
List of fi gures and tables
Published by Woodhead Publishing Limited, 2012
4.6 Confi guring peak detection 103
4.7 Deconvoluted peak list 105
4.8 3D view of an LC-MS scan 106
4.9 Example of a Batch mode workfl ow 106
4.10 Confi guring mzMine for metabolomics processing 108
4.11 mzMine results 110
4.12 Data analysis in mzMine 111
4.13 A metabolomics componentisation workfl ow
in KNIME 113
4.14 Workfl ow to normalise to internal standard or
total signal 114
4.15 PCA analysis using KNIME and R 117
4.16 The plots from the R PCA nodes 122
5.1 The many faces of ImageJ 134
5.2 ImageJ can be customised by defi ning the contents
of the various menus 135
5.3 Smartroot displays a graphical user interface that
only Javascript can deliver within ImageJ 139
5.4 A KNIME workfl ow that integrates ImageJ
functions in nodes as well as custom macros 140
5.5 Example of a QR code that can be read by a
plug-in based on ZXing 141
5.6 An example of a GUI that can be generated within
the ImageJ macro language to capture user inputs 145
5.7 Imaging of seeds using a fl at bed scanner 146
5.8 Plant phenotyping to non-subjectively quantify the
areas of different colour classifi cations 147
6.1 Simple KNIME workfl ow building a decision
tree for predicting molecular activity 152
6.2 Hiliting a frequent fragment also hilites the
molecules in which the fragment occurs 154
6.3 Feature elimination is available as a loop inside
a meta-node 155
6.4 Outline of a workfl ow for comparing two SD fi les 159
6.5 Reading and combining two SD fi les 159
xx
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
6.6 Preparation of the molecules 160
6.7 Filtering duplicates 161
6.8 Writing out the results 161
6.9 Amide enumeration workfl ow 162
6.10 Contents of the meta-node 163
6.11 KNIME Enterprise Server web portal 165
6.12 Outline of a workfl ow for image processing 166
6.13 Black-and-white images in a KNIME data table 166
6.14 Image after binary thresholding has been applied 167
6.15 Meta-node that computes various features on
the cell images 167
6.16 A workfl ow for large-scale analysis of sequencing data 168
6.17 Identifi cation of regions of interest 170
7.1 Overview of the ISA-Tab format 178
7.2 An overview of the depth and breadth of the PredTox
experimental design 182
7.3 The ontology widget illustrates here how CHEBI
and other ontologies can be browsed and searched
for term selection 184
8.1 Diverse types of genomic data 191
8.2 Flow-chart describing the various functionalities
of the GenomicTools suite 196
8.3 Example entry from the user’s manual for the
‘shuffl e’ operation of the genomic_regions tool 199
8.4 Example entry (partial) from the C++ API
documentation produced using Doxygen and
available online with the source code distribution 203
8.5 Example of TSS read profi le for genes of high
and low expression 209
8.6 Example of TSS read heatmap for select genes 210
8.7 Example of window-based read densities in
wiggle format 212
8.8 Example of window-based peaks in bed format 213
8.9 Time evaluation of the overlap operation between
a set of sequenced reads of variable size (1 through
xxi
List of fi gures and tables
Published by Woodhead Publishing Limited, 2012
64 million reads in logarithmic scale) and a
reference set comprising annotated exons and
repeat elements (~6.4 million entries) 216
8.10 Memory evaluation of the overlap operation
presented in Figure 8.10 217
9.1 Applications of ’omics data throughout the drug
discovery and development process 223
9.2 User communities for ’omics data 225
9.3 Atlas query results screen 227
9.4 Atlas architecture 228
9.5 Atlas administration interface 230
9.6 Federated query model for Atlas installations 236
10.1 Overview of the IT system showing the Beowulf
compute cluster comprising a master server that
passes out jobs to the four nodes 242
10.2 The current IT system following a modular
approach with dedicated servers 244
10.3 NAS box implementation showing the primary NAS
at site 1 mirrored to the secondary NAS at different site 247
10.4 A screenshot of our ChIP-on-chip microarray
data viewing application 256
11.1 Changes in bases of sequence stored in GenBank
and the cost of sequencing over the last decade 264
11.2 Our fi lesharing setup 268
11.3 Connectivity between web browsers, web service
genome browsers and web services hosting genomic data 274
12.1 A sample DesignSet 286
12.2 The progress chart for the DDD1 project 288
12.3 Using the smiles tag within our internal wiki 290
12.4 Adoption of Design Tracker by users and projects 294
13.1 The FLOSS assessment framework 302
13.2 A screenshot of Pfollow showing the
‘tweets’ within the public timeline 306
13.3 A screenshot of tags.pfi zer.com 310
13.4 A screenshot showing Pfi zerpedia’s home page 316
xxii
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
13.5 A screenshot showing an example profi le page
for the Therapeutic Area Scientifi c Information
Services (TA SIS) group 316
13.6 A screenshot of the tags.pfi zer.com social
bookmarking service page from the R&D
Application catalogue 317
14.1 Schematic overview of the system from
accessing document sources 329
14.2 Node/edge networks for disease-mechanism linkage 335
14.3 KOL Miner in action 337
14.4 Further KOL views 338
14.5 Drug repurposing matrix 339
14.6 Early snapshot of our drug-repositioning system 340
14.7 Article-level information 341
14.8 An example visual biological process map describing
how our drugs work at the level of the cell and tissue 343
14.9 A screenshot of the Atlas Of Science system 345
14.10 Typical representation of three layout approaches
that are built into Prefuse 346
15.1 A static PDF table 356
15.2 In (a), Utopia Documents shows meta-data relating
to the article currently being read; in (b), details of a
specifi c term are displayed, harvested in real time
from online data sources 358
15.3 A text-mining algorithm has identifi ed chemical
entities in the article being read, details of which are
displayed in the sidebar and top ‘fl ow browser’ 359
15.4 Comments added to an article can be shared with
other users, without the need to share a specifi c copy
of the PDF 361
15.5 Utopia Library provides a mechanism for
managing collections of articles 362
16.1 An example of semantic annotations 370
16.2 Semantic Form in action 371
16.3 Page template corresponding to the form in Figure 16.2 372
16.4 The KnowIt landing page 374
xxiii
List of fi gures and tables
Published by Woodhead Publishing Limited, 2012
16.5 The layout of KnowIt pages is focused on content 375
16.6 Advanced functions are moved to the bottom
of pages to avoid clutter 375
16.7 Semantic MediaWiki and Linked Data Triple Store
working in parallel 384
16.8 Wiki-based contextual menus 386
17.1 Data loading architecture 397
17.2 Properties of PDE5 stored semantically in the wiki 399
17.3 A protein page in Targetpedia 400
17.4 The competitor intelligence section 402
17.5 A protein family view 403
17.6 Social networking around targets and projects
in Targetpedia 405
17.7 Dividing sepsis into physiological subcomponents 411
17.8 The Semantic Form for creating a new assertion 414
17.9 (a) An assertion page as seen after editing.
(b) A semantic tag and automatic identifi cation
of related assertions 415
17.10 The sepsis project page 416
18.1 Chem2Bio2RDF organization, showing data sets
and the links between them 428
18.2 Tools and algorithms that employ Chem2Bio2RDF 430
19.1 The TripleMap architecture 445
19.2 Entities and their associations comprise the GEM
data network 447
19.3 TripleMap web application with knowledge maps 449
20.1 Architecture for analytical processing 454
20.2 HL7 V2.x message sample 455
20.3 HL7 V3.x CDA sample 457
20.4 Mirth Connect showing the channels from the
data sources to the databases 460
20.5 Screenshot of Mirth loading template 461
20.6 The resulting Mirth message tree 462
20.7 CDA data model 467
xxiv
Open source software in life science research
Published by Woodhead Publishing Limited, 2012
20.8 MapReduce 469
20.9 Riak RESTful API query 473
20.10 Complete analytic architecture 478
21.1 Assess the open source software package 486
21.2 Assess the open source community 489
21.3 Risk management process 492
21.4 Typical validation activities 494
21.5 Software development, change control and testing 498
21.6 Development environments and release cycles 502
22.1 Deploying open source software and data inside
the data centres of corporations 516
22.2 Vision for a new cloud-based shared architecture 517
Tables
2.1 Bioclipse–OpenTox functionality from the Graphical
User Interface is also available from the scripting
environment 37
2.2 Description of the local endpoints provided by the default
Bioclipse Decision Support extension 44
2.3 Various data types are used by the various predictive
models described in Table 2.2 to provide detailed
information about what aspects of the molecules
contributed to the decision on the toxicity 45
2.4 Structures created from SMILES representations
with the Bioclipse New from SMILES wizard for
various structures discussed in the use cases 46
8.1 Summary of operations of the genomic_regions tool 198
8.2 Summary of usage and operations of the
genomic_overlaps tool 200
8.3 Summary of usage and operations of the
genomic_scans tool 201
8.4 Supported statistics for the permutation tests 201
13.1 Comparison of the differences between Web 2.0 and
Enterprise 2.0 environmental drivers 300