Tải bản đầy đủ (.pdf) (104 trang)

Big data in healthcare

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.19 MB, 104 trang )

SPRINGER BRIEFS IN PHARMACEUTIC AL
SCIENCE & DRUG DEVELOPMENT

Pouria Amirian
Trudie Lang
Francois van Loggerenberg Editors

Big Data in
Healthcare
Extracting
Knowledge from
Point-of-Care
Machines
123


SpringerBriefs in Pharmaceutical
Science & Drug Development


More information about this series at />

Pouria Amirian Trudie Lang
Francois van Loggerenberg


Editors

Big Data in Healthcare
Extracting Knowledge from Point-of-Care
Machines



123


Editors
Pouria Amirian
Centre for Tropical Medicine and Global
Health
University of Oxford
Oxford
UK

Francois van Loggerenberg
Centre for Tropical Medicine and Global
Health
University of Oxford
Oxford
UK

Trudie Lang
Centre for Tropical Medicine and Global
Health
University of Oxford
Oxford
UK

ISSN 1864-8118
ISSN 1864-8126 (electronic)
SpringerBriefs in Pharmaceutical Science & Drug Development
ISBN 978-3-319-62988-9

ISBN 978-3-319-62990-2 (eBook)
DOI 10.1007/978-3-319-62990-2
Library of Congress Control Number: 2017946047
Editors keep the copyright
© The Editors and Authors 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland


Contents

1 Introduction—Improving Healthcare with Big Data. . . . . . . . . . . . . .
Francois van Loggerenberg, Tatiana Vorovchenko and Pouria Amirian

1


2 Data Science and Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pouria Amirian, Francois van Loggerenberg and Trudie Lang

15

3 Big Data and Big Data Technologies . . . . . . . . . . . . . . . . . . . . . . . . . .
Pouria Amirian, Francois van Loggerenberg and Trudie Lang

39

4 Big Data Analytics for Extracting Disease Surveillance
Information: An Untapped Opportunity . . . . . . . . . . . . . . . . . . . . . . .
Pouria Amirian, Trudie Lang, Francois van Loggerenberg,
Arthur Thomas and Rosanna Peeling
5 #Ebola and Twitter. What Insights Can Global Health Draw
from Social Media? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tatiana Vorovchenko, Proochista Ariana, Francois van Loggerenberg
and Pouria Amirian
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

85

99

v



About the Editors

Pouria Amirian has a Ph.D. in Geospatial Information Science (GIS) and is a
Principal Research Scientist in Data Science and Big Data at the Ordnance
Survey GB and a Data Science Research Associate with the Global Health
Network. He managed and led a joint project (Oxford and Stanford) on “Using Big
Data Analysis Tools to Extract Disease Surveillance Information from
Point-of-Care Diagnostic Machines”. Pouria has done research and development
projects and lectured about Big Data, Data Science, Machine Learning, Spatial
Databases, GIS and Spatial Analytics since 2008.
Trudie Lang is Professor of Global Health Research, Head of the Global Health
Network, Senior Research Scientist in Tropical Medicine at Nuffield Department of
Medicine and Research Fellow at Green Templeton College at the University of
Oxford. She has a Ph.D. from the London School of Hygiene and Tropical
Medicine and has worked within the industry, the World Health Organisation
(WHO), NGOs and academia conducting clinical research studies in low-resource
settings. Dr. Lang is a clinical trial research methodologist with specific expertise in
the capacity development and trial operations in low-resource settings. She currently leads the Global Health Network (GHN), which is a focused network of
researchers to help clinical researchers with trial design, methods, interpretation of
regulations and general operations.
Francois van Loggerenberg is Scientific Lead of the Global Health Network,
based out of the Centre for Tropical Medicine and Global Health, Nuffield
Department of Medicine. Originally trained as a research psychologist, from 2002
to 2012, Francois was employed at the Nelson R. Mandela School of Medicine in
Durban, South Africa, where he worked initially as the study coordinator on a large
HIV pathogenesis study at the Centre for the AIDS Programme of Research in
South Africa (CAPRISA). In 2005, he was awarded a Doris Duken Foundation
Operations Research For AIDS Care and Treatment In Africa grant that funded his
Ph.D. work on enhancing adherence to antiretroviral therapy (2011, London School
of Hygiene and Tropical Medicine).


vii


Chapter 1

Introduction—Improving Healthcare
with Big Data
Francois van Loggerenberg, Tatiana Vorovchenko
and Pouria Amirian

1.1

Introduction

With the advancement of computing systems and availability of new types of
sensors, there has been a huge increase in the amount, type and variety of data that
are collected and stored [1]. By some estimates in 2013, over 90% of the world’s
data had been created in the previous two years [2]. In terms of health data, this has
been impacted on by the increased use of Electronic Health Records (EHR), personalized medicine, and administrative data. Although it is difficult to comprehensively and simply characterise what constitutes Big Data, in terms of data itself,
several key characteristics have been identified, which create particular opportunities and challenges [3, 4]. These characteristics include the large size (volume) of
these datasets, the speed with which these data are generated and collected (velocity), the diversity of the data generated (variety). Some sources add a fourth ‘V’,
veracity, to highlight the fact that the quality of data collected this way needs to be
carefully considered [1]. However, we discuss veracity later in this book and we
prove that this is not a characteristics of data in Big Data and, more importantly, Big
Data is not just about data [5]. As often used, Big Data also refers to datasets that
have been collected for a specific purpose, but used in new secondary analyses, the
linking of datasets collected for different purposes, or for datasets that are generated
from routine activity, and often collected and stored autonomously and automatically. These characteristics create huge and rapidly expanding datasets that are ripe
for linking, and for algorithmic analysis to detect and characterise relationships and


F. van Loggerenberg (&) Á T. Vorovchenko Á P. Amirian
University of Oxford, Oxford, UK
e-mail:
P. Amirian
e-mail:
© The Editors and Authors 2017
P. Amirian et al. (eds.), Big Data in Healthcare, SpringerBriefs in Pharmaceutical
Science & Drug Development, DOI 10.1007/978-3-319-62990-2_1

1


2

F. van Loggerenberg et al.

patterns that would be very difficult to detect in smaller and individual
purpose-collected datasets.

1.2

Big Data and Health

The use of Big Data in biomedical and health sciences has received a lot of attention
in recent years. These data present a significant opportunity for the improvement of
the diagnosis, treatment and prevention of various diseases, and to interventions to
improve health outcomes [1, 6]. However, this is tied to the obvious risks to privacy
and trust of this sensitive information and the exposure of the vulnerability of
people requiring interventions or treatments. The Big Data revolution has impacted

on the biomedical sciences largely due to the technological advances in genome
sequencing, improvements and digitalisation of imaging, the development and
growth of vast patient data repositories, the rapid growth in biomedical knowledge,
as well as the central role patients are taking in the management of their own health
data, including collection of personal activity and health data [3].
Some of the key sources of data for biomedicine and health that have contributed
to the volume, velocity, variety and veracity of health related data are [3]:
• Medical Records—Increased digitalisation of electronic health records (EHR);
these data are collected for patient care and follow-up, but are key data sources
for secondary analysis and combination with other large data sets of longitudinal
free text, laboratory and other parameters, imaging, medication records, and a
vast array of other key data. When combined with data like genomic data, these
represent potential sources of making genotype-phenotype associations at the
population level.
• Administrative Data—These data are usually generated for billing or insurance claims, and are not generally available as immediately as EHR data.
However, they do have the benefit of usually being coded in a standardised way,
and verified with errors corrected, and so represent, usually, higher quality,
comparable data.
• Web Search Logs, click streams and interaction-based—The internet has
become an increasingly important source of information for people about their
health complaints, especially prior to seeking professional help, and the systematic collection and analysis of these data have yielded insights into syndromic surveillance and potential public health interventions based on concerns.
These data have been used to identify epidemic outbreaks [7], and have been
useful at highlighting potential issues with pharmaceutical side effects, for
example.
• Social Media—As social media continues to evolve, its definition is constantly
changing to capture all its features and reflect the role it plays in the modern
world. Social media has been describe as being “the platforms that enable the
interactive web by engaging users to participate in, comment on and create



1 Introduction—Improving Healthcare with Big Data

3

content as means of communicating with their social graph, other users and the
public” [8]. Social media continues developing and integrating deeply into
human lives, and may serve a variety of purposes such as social interaction,
information seeking, time passing, entertainment, relaxation, communicatory
utility, expression of opinions, convenience utility, information sharing, and
surveillance and watching others [9]. For example, LinkedIn allows its users to
build professional connections, Facebook is widely used to connect with friends,
Twitter allows public broadcasting of short messages, Instagram is used to share
favourite pictures, and YouTube allows the sharing of videos. This area of data
collection and analysis has grown rapidly over recent years, as populations have
greater access to, and generate more and more, social data. This areas also
entails blogs, Q and A sites (like Quora), networking sites, and the data have
been used to find things like unreported side effects, for monitoring
disease-related beliefs, and to identify or track disasters or disease outbreaks. As
one of the projects outlined in this book deals with social media, a bit more will
be said about this specific data type.
The number of active social media users has been growing rapidly. As of 2015, it is
estimated that nearly 2 billion people globally use social networks (Fig. 1.1). Social
media platforms have differing levels of popularity and a number of active users.
As of June 2016 Facebook is the most popular platform with 1590 million users
(Fig. 1.2).
Big Data Analytics is also being used for health and human welfare. One example
of this is Google Flu Trends. Millions of users around the world search for health
information online. Google estimates how much flu is circulating in different

Number of users in billions

3.00
2.72
2.55
2.39

2.50
2.22
2.04
1.87

2.00
1.59
1.40

1.50
1.22
1.00

0.97

0.50

0.00
2010

2011

2012

2013


2014

2015

2016

2017

2018

2019

Fig. 1.1 Number of social network users worldwide from 2010 to 2014, with projections to 2019
[10]


4

F. van Loggerenberg et al.

Number of acƟve members in millions
0
Facebook
WhatsApp
Facebook Messenger
QQ
WeChat
QZone
Tumblr

Instagram
TwiƩer
Baidu Tieba
Skype
Viber
Sina Weibo
LINE
Snapchat
YY
Ykontakte
Pinterest
BBM
LinkedIn
Telegram

200

400

600

800

1,000

1,200

1,400

1,600


1,800

1,590
1,000
900
853
697
640
555
400
320
300
300
249
222
215
200
122
100
100
100
100
100

Fig. 1.2 Leading social networks worldwide as of June 2016, ranked by number of active users
[11]

Fig. 1.3 Correlation between Google Flu Trends and National Institute of Infectious Diseases for
Japan (2004–2009)


countries around the world using the data of particular search queries on its search
engine and complex algorithms [7]. These data correlate with the data from traditional flu surveillance systems [12] (Fig. 1.3). The reporting lag of these predictions
is around one day, whereas traditional surveillance systems might take weeks to
collect and report the data. Although Google are no longer publishing these data
publicly in real time, historical datasets remain available, and newer data are
available to academic research groups on request.


1 Introduction—Improving Healthcare with Big Data

5

Twitter is an online social networking and micro-blogging platform that enables
users to send and read short 140-character messages called “tweets”.
Micro-blogging allows users to exchange small elements of content: short sentences, individual images, or video links [13]. Twitter is currently primarily an
online service accessible from computers, tablets and mobile phones. Since its
launch in 2006, the population of Twitter users has been constantly growing, and as
of June 2016 has 400 million active users (Fig. 1.2) contributing up to 500 million
tweets per day [13]. This is very appealing to Big Data analysts as the data show, in
real time and useful for analysing historical events or patterns, what the concerns
are of people from all around the world, suggesting potential research areas and
public health intervention opportunities in health and human development. This
will be explored further in Chap. 5. A recent review identified three key areas in
which Twitter has been used in health research: Health issues and problems (cancer,
dementia, acne, cardiac arrest, and tobacco use), health promotion (like diet, cancer
screening, vaccination, diabetes etc.), and professional communication (evaluative
feedback to students in clinical settings, and promoting journal articles and other
scientific publications) [13].
The ubiquity of smartphones has gone hand-in-hand with the increase in social

media posting, especially of geo-located data, for example in tweets. This has also
led to an increase in the number and types of personal monitoring activities which
have been exploited by health and other personal monitoring applications [14]. This
has led to vast amounts of monitoring data about personal behaviours, positioning,
logging diet, medication adherence, blood sugar levels in the blood, coffee consumption, sleep quality, psychological or mental states, health and physical activity
indicators being made available from self-monitoring, GPS tracking, and technology like accelerometers, which has been referred to as the quantified self [2]. These
applications have been used to create health improvement applications, like
smoking cessation or weight loss promotion and support, but in the process are also
generating vast and varied datasets of these indicators which could be mined to find
potentially useful health data. It is possible that these may be used to identify risk
factors, which might be linked back to EHR to identify those requiring intervention
or support to prevent the development of illness. At a population level, public
health interventions could be targeted at specific geographic groups where issues
like obesity, for example, may be identified by these means.

1.3

Big Data and Health in Low- and Middle-Income
Countries

The significant and rapid advances in using Big Data technologies and Cloud
Computing in developed countries has not been matched by the pace in Low- and
Middle-income countries (LMICs) which has been slower, despite the potential for
these approaches to improve healthcare delivery to improve population health.


6

F. van Loggerenberg et al.


A review of articles looking specifically at the use of Big Data in healthcare in
LMICs summarises some of the key potential benefits as well as the challenges that
need to be overcome [15]. In these settings, healthcare is most often delivered in
vertical programmes (for HIV, TB, Malaria etc.), all of which have stringent data
requirements which have to be addressed, usually by cadres of community
healthcare workers. New ways of collecting data (on smartphones, tablets, or
portable computers), and real-time data collection by connecting healthcare devices
to the Internet have made it possible to get around some of the more pressing
logistical and technical barriers to electronic data capture, storage and integration.
Technological advances in LMICs are often able to leap-frog some of the developmental steps observed in developed countries. For example, mobile phone
penetration in LMICs, especially Sub-Saharan Africa, is often very good and
positively associated with other good development indices, as fixed line installations were often lacking and mobile phone technology was able to be rolled out
more efficiently and more easily as there was no existing infrastructure or technology to compete with [16]. This means that there have been rapid and unexpected
advances in access to technologies that have sometimes taken longer to be adopted
in more developed countries.
The benefit to LMICs of good uses of Big Data analytics would be to ensure
good healthcare delivery, identification of risk factors for disease, and rapid identification of individuals who might benefit from early prevention or intervention
efforts. This is particularly true given that currently there may be poor service
delivery, poor governance, and poor data coordination, meaning that modest
improvements in these could reap significant benefits by ensuring that limited
resources are used constructively [15]. Currently health systems in these regions are
driven largely by focussing on individual diseases, and the integrative nature of Big
Data may help to move to a more integrated, horizontal, approach to the research
into and prevention and treatment of diseases and the causes of poor health.
Provision of essentials like clean water, food and good sanitation remain
pressing problems, but Big Data analytics could be as useful in supporting human
development as they could be in improving health, and the infrastructure and skills
put in place could be leveraged. Certainly, good health and good development are
mutually supportive and highly related.
For the potential benefits to be properly realised, it is important that the current

generally poor governance of global health be addressed to ensure that the properly
informed, considered and adequately resourced collection of data receives proper
oversight and stewardship [15]. In 2009, the Global Pulse initiative was established
by the United Nations () in order “to accelerate discovery, development and scaled adoption of Big Data innovation for sustainable
development and humanitarian action” [17]. This project has also focussed on some
health-based applications. These include projects that had a strong health-based
focus, and many more that related to the overlapping concerns of development and
welfare, with some key examples here [18]:


1 Introduction—Improving Healthcare with Big Data

7

• Monitoring of the implementation of mother-to-child prevention of HIV in
Uganda, using real-time indicator data from health centres across the country to
populate an online dashboard. This data collection and sharing allowed for the
identification of bottlenecks in the rollout of the Option B+ treatment (where
expectant mothers are offered HIV treatment irrespective of the CD4 t-cell
count), and to reveal correlations such as the relationship between stock outs and
drop outs from the programme.
• Data visualisation with interactive maps to support disease outbreak responses
in Uganda capturing free text location data for disease reports, and automatic
techniques to convert these to geo-referenced positions, in combination with
map overlays of existing geographical and other data to create interactive
visualisations in an online dashboard.
• Using social media to understand public perceptions of immunisation in
Indonesia, using a database of over 88,000 Bahasa Indonesian language tweets
from between January 2012 and December 2013. Content analysis and filters
were used to determine relevant tweets, and this project revealed how social

media was being used to share information relevant to immunisation, analysable
in real time. Especially useful was the identification of a core of influencers on
Twitter that could be leveraged to provide rapid response communication if
needed, and if provided with relevant and accurate messages to disseminate.
• Understanding awareness of immunisation and the sentiment towards this by
using social media and news content analysis in India, Kenya, Nigeria and
Pakistan, using data from Twitter and Facebook along with traditional media.
Spikes in content were linked to key events (like the attacks on polio workers
and campaigners in Pakistan, for example). Network analysis and demographic
data on users were used to identify key influencers in the networks. This work
led to a better understanding of the utility of social media monitoring to gain
deeper understanding of public sentiment regarding immunisation.
• Using social media to analyse attitudes towards contraception and teenage
pregnancy in Uganda, by extracting data from Facebook pages and UNICEF’s
U-report platform between 2009 and 2014. Facebook data were anonymised and
filtered to identify messages relating to contraception and family planning. An
interactive dashboard was developed and this is publicly accessible (http://
familyplanning.unglobalpulse.net/uganda/). This platform provided for the
real-time extraction of data on changing sentiments around family planning and
contraception, which would impact on any public health programme or intervention addressing these concerns.
• Analysing public perceptions towards sanitation by analysing social media
content, using filtered Twitter data and analysed on a social media data analytics
platform. Overall trends in data volume, influencers, and key hashtags were
reported. This study showed how by monitoring baseline indicators over time,
the changing social media discussion around sanitation could be tracked,
making it possible to evaluate the reach and effectiveness of educational campaigns, especially public engagement with these campaigns.


8


F. van Loggerenberg et al.

• Using data to analyse seasonal mobility patterns in Senegal where anonymised
mobile telephone data were used to indicate the position of people, and their
movement, in order to show differences in mobility patterns over the various
seasons. Movements were characterised both daily, as well as over the period of
a month. Understanding where people were, and their migration patterns over
the seasons, is potentially extremely useful for health surveillance and outbreak
assessments, as well as for resource and response planning.
Another novel proposal is the use of new online sources of data, like social media,
and combine this with epidemiological environmental data to create real-time and
constantly updating disease distribution maps that are more relevant and nuanced
than the traditional, static maps [19]. This is considered key, as the accurate and
up-to-date understanding of disease distribution, especially in LMICs, is central to
effective, targeted and appropriate interventions to prevent, treat, and manage diseases and vectors, and to understanding the global burden of disease that currently
drives much of the investment in and deployment of public health initiatives.
Two projects listed in this volume (Chaps. 4 and 5) give further examples of
how Big Data analytics may be utilized and deployed (Chap. 4) in support of health
and human welfare concerns, and more detail can be found in those chapters.

1.3.1

Analytical Challenges

Traditionally health data for research purposes has been collected in ways that serve
the statistical analysis approaches used [2]. This means that data were studied from
samples of defined populations, extrapolated to the whole population of interest
based on very clearly defined sampling strategies. In addition to this, very clearly
defined and operationalised measures were collected and the data were rigorously
and continuously monitored for quality and accuracy. These data were then carefully ‘cleaned’ prior to any analysis. In part due to all of these protections and

procedures, the cost and complexity of running big trials has grown rapidly,
especially in LMICs, and there have been calls for more pragmatic approaches [20].
Big Data may ameliorate some of the impact of the high expense and complexity
of these trials data quality procedures, as a clearly defined focus and data quality are
traded for quantity and variety. However, the volume, velocity and variety of these
data create some potential pitfalls to their analysis [3]:
• The development of selection algorithms for selecting patients whose EHR or
administrative data are to be used is very problematic. Often this requires the
analysis of data of several different types, analogous to clinical diagnosis by a
health professional, and errors in developing algorithms can lead to erroneous
conclusions. Complex, iterative approaches based on using clinical judgements
from practitioners have been suggested as potential solutions to this issue.


1 Introduction—Improving Healthcare with Big Data

9

• Importantly, the data used in these analyses are observational and usually not
collected under controlled experimental or randomised conditions, and so there
exists a real worry that observations are susceptible to biases and confounding.
Usually the identification of potential confounders to control for in analyses is
done by researchers and experts which means that this is not likely to happen in
many of the automated algorithms used in Big Data applications. This makes
interpretation of results more difficult.
• As the datasets grow both in volume and variety, analysis techniques used to
find associations and patterns that are meaningful become complicated by the
increase in the likelihood of chance findings being significant. Unless this is
taken into account, the number of false positive associations is uncontrolled,
leading to spurious conclusions based on chance associations.


1.3.2

Ethical Challenges

Big Data approaches to biomedical and behavioural data are likely to yield significant insights and advances for global and public health. However, this comes
with some key ethical challenges that need to be identified and addressed. A recent
comprehensive review of ethical issues in Big Data, which reviewed information
from 68 studies, highlights some key challenges, and also suggests some additional
challenges that are not yet clearly outlined in the existing literature [6]. Although
the technology for producing, processing and sharing data mean that large datasets
are easily available, and that linking and sharing can be quite ubiquitous, this is not
without significant issues. The example is given of Facebook’s Beacon software,
released in 2007, and developed to automatically link external online purchases
with Facebook profiles. Intended to improve the level of personal advertising, what
this service inadvertently did was to expose sensitive private characteristics such as,
for example, sexual orientation, or information about items that had been purchased
as gifts. The service was terminated after being the focus of litigation [21]. Using
readily available Big Data can create unanticipated consequences and ethical issues,
and associations with high profile media stories about the risks of data sharing and
access may run the risk of impacting on well-thought out, health-related uses of
similar data.
From the literature reviewed, five key areas of consideration around ethical
issues were identified, and these will be briefly outlined [6].

1.3.2.1

Informed Consent

Traditionally informed consent for data to be used in research related to clear and

unambiguous consent for the collection of specific data for the use in specific, or at
least clearly related, research studies. This is not suitable for Big Data applications


10

F. van Loggerenberg et al.

where vast amounts of novel and routine data are collected, often with the express
purpose of creatively identifying surprising or novel associations in the vast and
interconnected datasets. This means that the very concept of informed consent may
be difficult to apply to Big Data research. The certainty and singular approach to
traditional consent must be adapted to work in Big Data research. Clear tension
exists between being able to utilise data for Big Data analysis, and the inability in
most cases to get explicit informed consent for every possible future use of these
data. This is particularly salient as the data used for these analyses are often collected routinely, in huge amounts, and used in analyses that could not have been
envisaged when the data were collected. Although it is well beyond the scope of
this chapter to resolve this issue, it is a key consideration for the remainder of this
book. There are many ways this may be addressed, either pragmatically (by considering the data sources as having an altruistic interest in their data being used for
the public good, or the decision about the use of data being made by identified,
impartial third parties) or substantively (for example, by requiring participants to
opt out of data sharing). These issues are not simply resolved by ‘de-identifying’
data, as illustrated in relation to privacy concerns.

1.3.2.2

Privacy

More routine and personal data are being collected automatically and anonymously.
This is often done with little awareness of the people on whom these data are being

collected as to the extent and scope of information that is reasonably easily
available for scraping and using. This is a key characteristic of the Big Data age [6]
and in contrast with research data historically, which tended to focus on discrete and
obvious measurements. Privacy issues in Big Data are frequently with confidentiality, and the ease with which linking data sets can reveal associations, and
potentially identity. This means that simple anonymization of data is very difficult
to attain and impossible to assure. Additionally, it is clear that harm can occur not
only at the individual, but also at a population level from data collected, through
stigmatisation or discrimination. To assume that anonymization and use of data at
the population level is an acceptable way to avoid requiring consent is problematic.
What is key for things like social media analysis, is the concern that just because
things occur in public does not mean that they should be viewed as freely accessible. Nor should it be assumed that the individuals on whom the data have been
collected are able to understand how easily and widely these data can be accessed
and used. The fact that data are now being stored for much longer is a related
concern. The length of time for which these personal data are being kept increases
the potential risk that data privacy may be violated. A real tension exists between
overly restrictive regulations and procedures that could prevent useful and helpful
research, and too open access and sharing of data that may too easily be used to
discriminate or cause other harms.


1 Introduction—Improving Healthcare with Big Data

1.3.2.3

11

Ownership

Data ownership can be an already complex issue, but when large and interrelated
datasets are shared, and the collection, analysis and publication of results of data

amalgamation happen in a shared space, the issue of who controls the data becomes
even more difficult. How data are redistributed and who can make changes or
conduct analysis can be complex issues in Big Data analysis that need to be
resolved. It is very difficult for individuals to control what is done with their data
and it is widely accepted that there needs to be some controls as to, for example,
third parties being able to benefit commercially from data that have been collected
outside of the agency or knowledge of those providing the data. Open access to
your own data has also been widely discussed as important. However, this does
have risks as direct access to raw data could lead to misconceptions, misunderstandings or flawed analyses that create incorrect conclusions if the data are handled
and analysed by individuals who lack the knowledge or skills to do this rigorously,
or to interpret the results correctly.

1.3.2.4

Epistemology and Objectivity

Although expert knowledge and skills have always meant that understanding data
and its outputs is difficult, as the vastness and complexity of datasets increase, the
means of analysing these data has also changed. This usually also means that it is
necessary to apply machine learning analysis techniques [2]. This inverts the usual
approach in science; from specific, hypothesis driven statistical tests, we have
moved into the arena of complex machine-learning algorithms which process vast
datasets to create analyses and conclusions that may be well beyond the understanding of those who are processing the data. It is key that these findings are
viewed as new hypotheses about empiric relationships rather than clear predictions
about behaviour or outcomes [2].
A related issue is that of objectivity. Because the outputs of Big Data come from
so many varied and vast datasets, there is a tendency to assume that they are
‘objective’. However, as with all analyses, the methods, the questions asked, and
analysis decisions made are all driven by positions and decisions that mean that
there is a great deal of subjective influence over both the data, as well as the outputs

and analyses. Since much of these data come from routinely collected sources, often
for other purposes, and the collection and analyses may become routine and
automated, the lack of quality and consistency checks may lead to the dangerous
position of not questioning the validity of conclusions made on the basis of varied
and non-quality checked data. For example, a review of EHR has estimated that
although analyses may highlight key issues in patient care, from 4.3 to 86% of data
are missing, incomplete, or inaccurate [22].


12

F. van Loggerenberg et al.

1.3.2.5

Big Data ‘Divides’

The collection and analysis of Big Data places new and considerable technical and
resource demands on organisations, meaning that the number of these which are
equipped and able to deal with these challenges is limited. This is particularly
salient when considering whether or not individuals will have the means or rights to
access their own data, or have a say about how this is used by a few large data
organisations. Those who simply choose to opt out of personal data collection, by
definition become invisible and un-represented in the datasets, which can create
another key Big Data divide. In addition to this, communities that are not able to
implement EHR, will also not benefit from any insights that could be generated
through the analysis of the data collected on them.

1.4


Conclusion and Structure of the Book

As with many new and promising technologies or methods, the risk exists to view
Big Data as overly beneficial and applicable to all areas of science and human
behaviour. The risk that the size and variety of the data included in Big Data
analyses leads to a sense that these analyses are all ‘objective’ and value free, or
most likely to discover ‘truths’, needs to be taken into account. This very brief
outline of some of the opportunities and challenges of Big Data, especially the
ethical issues, outlines some of the key concerns that need to be addressed or
investigated as this area of research develops. It is envisaged that as common
standards become established, and the numerous technical, analysis and ethical
challenges are addressed, that Big Data in health should contribute significantly to a
more personalised approach to medicine, and smarter, adaptive, health strategies
[1]. This would be the second wave of Big Data.
This short book is organized as follows; Chapter 2 describes the concept of data
science and analytics. Some good examples of using data science methods also
described briefly. Chapter 3 explains the elements of Big Data. The chapter illustrates five components of Big Data. Chapter 4 describes a real-world implementation of a Big Data analytics system. The chapter describes many real-world
challenges and solutions in LMICs. Also, the chapter illustrates the benefits of the
approach for patients, healthcare settings, healthcare authorities as well as companies that manufacture healthcare devices (especially point of care devices).
Finally, Chap. 5 describes a case of social media data mining during Ebola outbreak
and presents the valuable insights that can be extracted from social media.


1 Introduction—Improving Healthcare with Big Data

13

References
1. Koutkias, V., Thiessard, F.: Big data—smart health strategies. Findings from the yearbook
2014 special theme. Yearb. Med. Inform. 9, 48–51 (2014)

2. Hansen, M.M., Miron-Shatz, T., Lau, A.Y., et al.: Big data in science and healthcare: a review
of recent literature and perspectives. Contribution of the IMIA Social Media Working
Group. Yearb. Med. Inform. 9, 21–26 (2014)
3. Peek, N., Holmes, J.H., Sun, J.: Technical challenges for big data in biomedicine and health:
data sources, infrastructure, and analytics. Yearb. Med. Inform. 9, 42–47 (2014)
4. Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential.
Health Inf. Sci. Syst. 2, 3 (2014)
5. Amirian, P., Lang, T., Van Loggerenberg, F.: Geospatial big data for finding useful insights
from machine data. GIS Research UK (2014)
6. Mittelstadt, B.D., Floridi, L.: The ethics of big data: current and foreseeable issues in
biomedical contexts. Sci. Eng. Ethics 22(2), 303–341 (2016)
7. Ginsberg, J., Mohebbi, M.H., Patel, R.S., et al.: Detecting influenza epidemics using search
engine query data. Nature 457(7232), 1012–1014 (2009)
8. Cohen, H.: Social media definitions. (2011)
9. Whiting, A., Williams, D.: Why people use social media: a uses and gratifications approach.
Qual. Market Res. Int. J. 16(4), 362–369 (2013)
10. Statista: Number of worldwide social network users 2010–2018. />statistics/278414/number-of-worldwide-social-network-users/. Statista (2016)
11. Statista: Leading social networks worldwide as of April 2016, ranked by number of active
users. (2016)
12. Google: Google Flu Trends. (2016)
13. Finfgeld-Connett, D.: Twitter and health science research. West. J. Nurs. Res. 37(10), 1269–
1283 (2015)
14. Ross, M.K., Wei, W., Ohno-Machado, L.: “Big data” and the electronic health record. Yearb.
Med. Inform. 9, 97–104 (2014)
15. Wyber, R., Vaillancourt, S., Perry, W., et al.: Big data in global health: improving health in
low- and middle-income countries. Bull. World Health Organ. 93(3), 203–208 (2015)
16. Asongu, S.A., Nwachukwu, J.C.: The role of governance in mobile phones for inclusive
human development in Sub-Saharan Africa. Technovation
17. UN: United Nations Global Pulse. United Nations (2016)
18. UN: United Nations Global Pulse Projects United

Nations (2016)
19. Hay, S.I., George, D.B., Moyes, C.L., et al.: Big data opportunities for global infectious
disease surveillance. PLoS Med. 10(4), e1001413 (2013)
20. Lang, T., Siribaddana, S.: Clinical trials have gone global: is this a good thing? PLoS Med.
9(6), e1001228 (2012)
21. Welsh, K., Cruz, L.: The danger of big data: social media as computational social science.
First Monday 17(7), 1 (2012)
22. Balas, E.A., Vernon, M., Magrabi, F., et al.: Big data clinical research: validity, ethics, and
regulation. Stud. Health Technol. Inform. 216, 448–452 (2015)


Chapter 2

Data Science and Analytics
Pouria Amirian, Francois van Loggerenberg and Trudie Lang

2.1

What Is Data Science?

Thanks to advancement of sensing, computation and communication technologies,
data are generated and collected at unprecedented scale and speed. Virtually every
aspect of many businesses is now open to data collection; operations, manufacturing, supply chain management, customer behavior, marketing, workflow procedures and so on. This broad availability of data has led to increasing interest in
methods for extracting useful information and knowledge from data and data-driven
decision making. Data Science is the science and art of using computational
methods to identify and discover influential patterns in data. The goal of Data
Science is to gain insight from data and often to affect decisions to make them more
reliable [1]. Data is necessarily a measure of historic information so, by definition,
Data Science examines historic data. However, the data in Data Science can be
collected a few years or a few milliseconds ago, continuously or in a one off

process. Therefore, Data Science procedure can be based on real-time or near
real-time data collection.
The term Data Science arose in large part due to the advancements in computational methods; especially new or improved methods in machine learning, artificial intelligence and pattern recognition. In addition, due to increasing the
computational capacities through cloud computing and distributed computational
models, use of data for extracting useful information even in large volume is more
P. Amirian (&) Á F. van Loggerenberg Á T. Lang
University of Oxford, Oxford, UK
e-mail: ;
F. van Loggerenberg
e-mail:
T. Lang
e-mail:
© The Editors and Authors 2017
P. Amirian et al. (eds.), Big Data in Healthcare, SpringerBriefs in Pharmaceutical
Science & Drug Development, DOI 10.1007/978-3-319-62990-2_2

15


16

P. Amirian et al.

affordable. Nevertheless, the ideas behind Data Science are not new at all but have
been represented by different terms throughout the decades, including data mining,
data analysis, pattern recognition, statistical learning, knowledge discovery and
cybernetics.
As a recent phenomenon, the rise of Data Science is pragmatic. Virtually every
aspect of many organizations is now open to data collection and often even
instrumented for data collection. At the same time, information is now widely

available on external events such as trends, news, and movements. This broad
availability of data has led to increasing interest in methods for extracting useful
information and knowledge from data (Data Science) and data driven decision
making [2]. With availability of relevant data and technologies, decision making
procedures which previously were based on experience, guesswork or on constrained models of reality, can now be made based on the data and data products. In
other words, as organizations collect more data and begin to summarize and analyze
it, there is a natural progression toward using the data to scientifically improve
approximations, estimates, forecasts, decisions, and ultimately, efficiency and
productivity.

2.2

Methods in Data Science

Data Science is the process of discovering interesting and meaningful patterns in
data using computational analytics methods. Analytical methods in the Data
Science are drawn from several related disciplines, some of which have been used
to discover patterns and trends in data for more than 100 years, including statistics.
Figure 2.1, shows some of disciplines related to Data Science.
The fact that most methods are data driven is the most important characteristic of
methods in Data Science. They try to find hidden and hopefully useful patterns
which are not based on the assumption made by the data collection procedures or
made by the analysts. In other words, methods in Data Science are data-driven, and
mostly explore hidden patterns in data rather than confirm hypotheses which are set
by data analysts. The data-driven algorithms induce models from the data. In
modern methods in Data Science, the induction process can include identification of
variables to be included in the model, parameters that define the model, weights or
coefficients in the model, or model complexity.
Despite the large number of specific Data Science methods developed over the
years, there are only a handful of fundamentally different types of analytical tasks

these methods address. In general, there are a few types of analytical tasks in Data
Science which can be classified as supervised or unsupervised learning.
Supervised learning involves building a model for predicting, or estimating, an
output based on one or more inputs. Problems of this nature occur in fields as
diverse as business, medicine, astrophysics, and public policy. With unsupervised
learning, there are inputs but no supervising output; nevertheless, we can learn
relationships and structure from such data [3]. Following sections first introduce the


2 Data Science and Analytics

17

Fig. 2.1 Methods in Data Science are drawn from many disciplines

concept of supervised and unsupervised learning in more depth, and then give brief
description of major analytical tasks in Data Science.

2.2.1

Supervised and Unsupervised Learning

Algorithms or methods in the Data Science try learn from data. Most of time, data
need to be in a certain shape or structure in order to be used in a Data Science
method. Mathematically speaking usually data need to be in form of a matrix. Rows
(records) in the matrix represents data points or observations and columns represent
values for various attributes in an observation. In many Data Science problems,
the number of rows is higher than the number of attributes. However, it is quite
common to see higher number of attributes in problems like gene sequencing and
sentiment analysis. In some problems an attribute is called target variable since the

Data Science methods tries to find a function for estimation of the target variable
based on other variables in data. The target variable also can be called response,
dependent variable, label, output and outcome. In this case other attributes in the
data are called independent variables, predictors, features or inputs [4].
Algorithms for Data Science are often divided into two groups: supervised
learning methods and unsupervised learning methods. Suppose a dataset that is
collected in a controlled trail. Data in this dataset consists of attributes like id, age,


18

P. Amirian et al.

sex, BMI, life style, years of education, income, number of children, and respond to
drug. Consider two similar questions one might ask about a health condition of
sample of patients. The first is: “Do the patients naturally fall into different groups?”
Here no specific purpose or target has been specified for the grouping. When there
is no such target, the data science problem is referred to as unsupervised learning.
Contrast this with a slightly different question: “Can we find groups of patients who
have particularly high likelihoods of positive response for a certain drug?” Here
there is a specific target defined: will a newly admitted patient (who did not take
part in the trial) respond to certain drug? In this case, segmentation is being done for
a specific reason: to take action based on likelihood of response to drug. In other
words, response to the drug is the target variable in this problem, and a specific
Data Science tasks tries to find the attributes which have impact on the target
variable and more importantly their importance in predicting the target value. This
is called a supervised learning problem.
In supervised learning problems, the supervisor is the target variable, and the
goal is to predict the target variable from other attributes in the data. The target
variable is chosen to represent the answer to a question an analyst or an organization would like to answer. In order to build a supervised learning model, the

dataset needs to contain both target variables as well as other attributes. After the
model is created based on existing data, the model can be used for predicting a
target value for a dataset without target variables. That is why sometimes supervised learning is also called predictive modeling. The primary predictive modeling
algorithms are classification for categorical target variables (like yes/no) or regression for continuous target variables (numeric values). Examples of target
variables include whether a patient responded to a certain drug (yes/no), the amount
of a treatment (120, 250 mg, etc.), if a tumor size increased in 6 months (yes/no)
and probability of increase in tumor size (0–100%).
In unsupervised learning, the model has no target variable. The inputs are
analyzed and grouped or clustered based on the proximity or similarity of input
values to one another. Each group or cluster is given a label to indicate which group
a record belongs to.

2.2.2

Data Science Analytical Tasks

In addition to the typical statistical analysis tasks (like causal modelling) in the
context of healthcare, there are several analytical tasks in healthcare from a Data
Science point of view. The analytical tasks can be categorized as regression,
classification, clustering, similarity matching (recommender systems), profiling,
simulation and content analysis.
Regression tries to estimate or predict a target value for numerical variables. An
example regression question would be: “How much will a given customer use the
health insurance service?” The target variable to be predicted here is health
insurance service usage, and a model could be generated by looking at other, similar


2 Data Science and Analytics

19


individuals in the population (from health condition and records point of view).
A regression procedure produces a model that, given a set of inputs, estimates the
value of the particular variable specific to that individual.
While regression algorithms are used to predict target variables with numerical
outcomes, classification algorithms are utilized for predicting the target variable
with finite categories (classes). Classification and class probability estimation
attempt to predict, for each individual in a population, which of a set of classes the
individual belongs to. Usually the classes are mutually exclusive. An example
classification question would be: “Among all the participants in a particular trial,
which are likely to respond to a given drug?” In this example the two classes could
be called “will respond” (or positive) and “will not respond” (or negative). For a
classification task, the Data Science procedure produces a model that, given a new
individual, determines which class that individual belongs to. A closely related task
is scoring or class probability estimation. A scoring model applies to an individual
and produces a score representing the probability that the individual belongs to each
class. In the trial, a scoring model would be able to evaluate each individual
participant and produce a score of how likely each is to respond to the drug. Both
regression and classification algorithms are used for solving supervised learning
problems, meaning that the data need to have target variables before the model
building process begins. Regression is to some extent similar to classification, but
the two are different. Informally, classification predicts whether something will
happen, whereas regression predicts how much something will happen. The classification and regression compose core of predictive analytics. Nowadays, much
work is focusing now on predictive analytics, especially in clinical settings
attempting to optimize health and financial outcomes [5].
Clustering uses unsupervised learning to group data into distinct clusters or
segments. In other words, clustering tries to find natural grouping in the data. An
example clustering question would be: “Do the patients form natural groups or
segments?” Clustering is useful in preliminary domain exploration to see which
natural groups exist because these groups in turn may suggest other Data Science

tasks or approaches. A major difference between clustering and classification
problems is that the outcome of clustering is unknown beforehand and need human
interpretation and further processing. In contrast, outcome of classification for an
observation is a membership or probability of membership in a certain class.
The fourth type of analytical task in Data Science is similarity matching.
Similarity matching attempts to identify similar individuals based on available data.
Similarity matching can be used directly to find similar entities based on criteria.
For example, a health insurance company is interested in finding similar individuals, in order to offer them most efficient insurance policies. They use similarity
matching based on data describing health characteristics of the individuals.
Similarity matching is the basis for one of the most popular methods for creating
recommendations engines or recommender systems. Recommendation engines
have been used extensively by online retailers like Amazon.com to recommend
products based on users’ preferences and historical behavior (browsing behavior
and past purchases). The same concepts and techniques can be used for


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×