IT training LNAI 7867 trends and applications in knowledge discovery and data mining li, cao, wang, tan, liu, pei tseng 2013 09 05

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.86 MB, 571 trang )

LNAI 7867

Jiuyong Li Longbing Cao
Can Wang Kay Chen Tan Bo Liu
Jian Pei Vincent S. Tseng (Eds.)

Trends and Applications
in Knowledge Discovery
and Data Mining
PAKDD 2013 International Workshops:
DMApps, DANTH, QIMIE, BDM, CDA, CloudSD
Gold Coast, QLD, Australia, April 2013
Revised Selected Papers

123

Lecture Notes in Artificial Intelligence
Subseries of Lecture Notes in Computer Science
LNAI Series Editors
Randy Goebel
University of Alberta, Edmonton, Canada
Yuzuru Tanaka
Hokkaido University, Sapporo, Japan
Wolfgang Wahlster
DFKI and Saarland University, Saarbrücken, Germany

LNAI Founding Series Editor
Joerg Siekmann
DFKI and Saarland University, Saarbrücken, Germany

7867

Jiuyong Li Longbing Cao
Can Wang Kay Chen Tan Bo Liu
Jian Pei Vincent S. Tseng (Eds.)

Trends and Applications
in Knowledge Discovery
and Data Mining
PAKDD 2013 International Workshops:
DMApps, DANTH, QIMIE, BDM, CDA, CloudSD
Gold Coast, QLD, Australia, April 14-17, 2013
Revised Selected Papers

13

Volume Editors
Jiuyong Li
University of South Australia, Adelaide, SA, Australia
E-mail:
Longbing Cao
Can Wang
University of Technology, Sydney, NSW, Australia
E-mail: ;
Kay Chen Tan
National University of Singapore, Singapore
E-mail:
Bo Liu

Guangdong University of Technology, Guangzhou, China
E-mail:
Jian Pei
Simon Fraser University, Burnaby, BC, Canada
E-mail:
Vincent S. Tseng
National Cheng Kung University, Tainan, Taiwan
E-mail:
ISSN 0302-9743
e-ISSN 1611-3349
ISBN 978-3-642-40318-7
e-ISBN 978-3-642-40319-4
DOI 10.1007/978-3-642-40319-4
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2013944975
CR Subject Classification (1998): H.2.8, I.2, H.3, H.5, H.4, I.5
LNCS Sublibrary: SL 7 – Artificial Intelligence
© Springer-Verlag Berlin Heidelberg 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in its current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface

This volume contains papers presented at PAKDD Workshops 2013, aﬃliated
with the 17th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) held on April 14, 2013 on the Gold Coast, Australia. PAKDD has
established itself as the premier event for data mining researchers in the PaciﬁcAsia region. The workshops aﬃliated with PAKDD 2013 were: Data Mining Applications in Industry and Government (DMApps), Data Analytics for Targeted
Healthcare (DANTH), Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models (QIMIE), Biologically Inspired Techniques for Data
Mining (BDM), Constraint Discovery and Application (CDA), Cloud Service
Discovery (CloudSD), and Behavior Informatics (BI). This volume collects the
revised papers from the ﬁrst six workshops. The papers of BI will appear in a
separate volume.
The ﬁrst six workshops received 92 submissions. All papers were reviewed
by at least two reviewers. In all, 47 papers were accepted for presentation, and
their revised versions are collected in this volume. These papers mainly cover
the applications of data mining in industry, government, and health care. The
papers also cover some fundamental issues in data mining such as interestingness
measures and result evaluation, biologically inspired design, constraint and cloud
service discovery.
These workshops featured ﬁve invited speeches by distinguished researchers:
Geoﬀrey I. Webb (Monash University, Australia), Osmar R. Za¨ıane (University

of Albert, Canada), Jian Pei (Simon Fraser University, Canada), Ning Zhong
(Maebashi Institute of Technology, Japan), and Longbing Cao (University of
Technology Sydney, Australia). Their talks cover current challenging issues and
advanced applications in data mining.
The workshops would not be successful without the support of the authors,
reviewers, and organizers. We thank the many authors for submitting their research papers to the PAKDD workshops. We thank the successful authors whose
papers are published in this volume for their collaboration in the paper revision
and ﬁnal submission. We appreciate all PC members for their timely reviews
working to a tight schedule. We also thank members of the Organizing Committees for organizing the paper submission, reviews, discussion, feedback and the
ﬁnal submission. We appreciate the professional service provided by the Springer
LNCS editorial teams, and Mr. Zhong She’s assistance in formatting.
June 2013

Jiuyong Li
Longbing Cao
Can Wang
Kay Chen Tan
Bo Liu

Organization

PAKDD Conference Chairs
Hiroshi Motoda
Longbing Cao

Osaka University, Japan
University of Technology, Sydney, Australia

Workshop Chairs

Jiuyong Li
Kay Chen Tan
Bo Liu

University of South Australia, Australia
National University of Singapore, Singapore
Guangdong University of Technology, China

Workshop Proceedings Chair
Can Wang

University of Technology, Sydney, Australia

Organizing Chair
Xinhua Zhu

University of Technology, Sydney, Australia

DMApps Chairs
Warwick Graco
Yanchang Zhao
Inna Kolyshkina
Clifton Phua

Australian Taxation Oﬃce, Australia
Department of Immigration and Citizenship,
Australia
Institute of Analytics Professionals of Australia
SAS Institute Pte Ltd, Singapore

DANTH Chairs
Yanchun Zhang
Michael Ng
Xiaohui Tao
Guandong Xu
Yidong Li
Hongmin Cai
Prasanna Desikan
Harleen Kaur

Victoria University, Australia
Hong Kong Baptist University, Hong Kong
University of Southern Queensland, Australia
University of Technology, Sydney, Australia
Beijing Jiaotong University, China
South China University of Technology, China
Allina Health, USA
United Nations University, International
Institute for Global Health, Malaysia

VIII

Organization

QIMIE Chairs
St´ephane Lallich
Philippe Lenca

ERIC, Universit´e Lyon 2, France

Lab-STICC, Telecom Bretagne, France

BDM Chairs
Mengjie Zhang
Shaﬁq Alam Burki
Gillian Dobbie

Victoria University of Wellington, New Zealand
University of Auckland, New Zealand
University of Auckland, New Zealand

CDA Chairs
Chengfei Liu
Jixue Liu

Swinburne University of Technology, Australia
University of South Australia, Australia

CloudSD Chairs
Michael R. Lyu
Jian Yang
Jian Wu
Zibin Zheng

The Chinese University of Hong Kong, China
Macquarie University, Australia
Zhejiang University, China
The Chinese University of Hong Kong, China

Combined Program Committee

Aiello Marco
Al´ıpio Jorge
Amadeo Napoli
Arturas Mazeika
Asifullah Khan
Bagheri Ebrahim
Blanca Vargas-Govea
Bo Yang
Bouguettaya Athman
Bruno Cr´emilleux
Chaoyi Pang
David Taniar
Dianhui Wang
Emilio Corchado
Eng-Yeow Cheu

University of Groningen, The Netherlands
University of Porto, Portugal
Lorraine Research Laboratory in Computer
Science and Its Applications, France
Max Planck Institute for Informatics, Germany
PIEAS, Pakistan
Ryerson University, Canada
Monterrey Institute of Technology
and Higher Education, Mexico
University of Electronic Science and
Technology of China
RMIT, Australia
Universit´e de Caen, France
CSIRO, Australia

Monash University, Australia
La Trobe University, Australia
University of Burgos, Spain
Institute for Infocomm Research, Singapore

Organization

Evan Stubbs
Fabien Rico
Fabrice Guillet
Fatos Xhafa
Fedja Hadzic
Feiyue Ye
Ganesh Kumar
Venayagamoorthy
Gang Li
Gary Weiss
Graham Williams
Guangfei Yang
Guoyin Wang
Hai Jin
Hangwei Qian
Hidenao Abe
Hong Cheu Liu
Ismail Khalil Johannes
Izabela Szczech
Jan Rauch
J´erˆ
ome Az´e

Jean Diatta
Jean-Charles Lamirel
Jeﬀ Tian
Jeﬀrey Soar
Jerzy Stefanowski
Ji Wang
Ji Zhang
Jianwen Su
Jianxin Li
Jie Wan
Jierui Xie
Jogesh K. Muppala
Joo-Chuan Tong
Jos´e L. Balc´azar
Julia Belford
Jun Ma
Junhu Wang
Kamran Shaﬁ

IX

SAS, Australia
Universit´e Lyon 2, France
Universit´e de Nantes, France
Universitat Polit`ecnica de Catalunya,
Barcelona, Spain
Curtin University, Australia
Jiangsu Teachers University of Technology,
China
Missouri University of Science

and Technology, USA
Deakin University, Australia
Fordham University, USA
ATO, Australia
Dalian University of Technology, China
Chongqing University of Posts and
Telecommunications, China
Huazhong University of Science and
Technology, China
VMware Inc., USA
Shimane University, Japan
University of South Australia, Australia
Kepler University, Austria
Poznan University of Technology, Poland
University of Economics, Prague,
Czech Republic
Universit´e Paris-Sud, France
Universit´e de la R´eunion, France
LORIA, France
Southern Methodist University, USA
University of Southern Queensland, Australia
Poznan University of Technology, Poland
National University of Defense Technology,
China
University of Southern Queensland, Australia
UC Santa Barbara, USA
Swinburne University of Technology, Australia
University College Dublin, Ireland
Oracle, USA
University of Science and Technology of

Hong Kong, Hong Kong
SAP Research, Singapore
Universitat Polit`ecnica de Catalunya, Spain
University of California, Berkeley, USA
University of Wollongong, Australia
Griﬃth University, Australia
University of New South Wales, Australia

X

Organization

Kazuyuki Imamura
Khalid Saeed
Kitsana Waiyamai
Kok-Leong Ong
Komate Amphawan
Kouroush Neshatian
Kyong-Jin Shim
Liang Chen
Lifang Gu
Lin Liu
Ling Chen
Xumin Liu
Luis Cavique
Martin Holeˇ
na
Md Sumon Shahriar
Michael Hahsler

Michael Sheng
Mingjian Tang
Mirek Malek
Mirian Halfeld Ferrari Alves
Mohamed Gaber
Mohd Saberi Mohamad
Mohyuddin Mohyuddin
Motahari-Nezhad Hamid
Reza
Neil Yen
Patricia Riddle
Paul Kwan
Peter Christen
Peter Dolog
Peter O’Hanlon
Philippe Lenca
Qi Yu
Radina Nikolic
Redda Alhaj
Ricard Gavald`
a
Richi Nayek
Ritu Chauhan
Ritu Khare
Robert Hilderman

Maebashi Institute of Technology, Japan
AGH Krakow, Poland
Kasetsart University, Thailand
Deakin University, Australia

Burapha University, Thailand
University of Canterbury, Christchurch,
New Zealand
Singapore Management University
Zhejiang University, China
Australian Taxation Oﬃce, Australia
University of South Australia, Australia
University of Technology, Sydney, Australia
Rochester Institute of Technology, USA
Universidade Aberta, Portugal
Academy of Sciences of the Czech Republic
CSIRO ICT Centre, Australia
Southern Methodist University, USA
The University of Adelaide, Australia
Department of Human Services, Australia
University of Lugano, Switzerland
University of Orleans, France
University of Portsmouth, UK
Universiti Teknologi Malaysia, Malaysia
King Abdullah International Medical Research
Center, Saudi Arabia
HP, USA
The University of Aizu, Japan
University of Auckland, New Zealand
University of New England, Australia
Australian National University, Australia
Aalborg University, Denmark
Experian, Australia
Telecom Bretagne, France
Rochester Institute of Technology, USA

British Columbia Institute of Technology,
Canada
University of Calgary, Canada
Universitat Polit`ecnica de Catalunya, Spain
Queensland University of Technology, Australia
Amity Institute of Biotechnology, India
National Institutes of Health, USA
University of Regina, Canada

Organization

Robert Stahlbock
Rohan Baxter
Ross Gayler
Rui Zhou
Sami Bhiri
Sanjay Chawla
Shangguang Wang
Shanmugasundaram
Hariharan
Shusaku Tsumoto
Sorin Moga
St´ephane Lallich
Stephen Chen
Sy-Yen Kuo
Tadashi Dohi
Thanh-Nghi Do
Ting Yu
Tom Osborn

Vladimir Estivill-Castro
Wei Luo
Weifeng Su
Xiaobo Zhou
Xiaoyin Xu
Xin Wang
Xue Li
Yan Li
Yanchang Zhao
Yanjun Yan
Yin Shan
Yue Xu
Yun Sing Koh
Zbigniew Ras
Zhenglu Yang
Zhiang Wu
Zhiquan George Zhou
Zhiyong Lu
Zongda Wu

XI

University of Hamburg, Germany
Australian Taxation Oﬃce, Australia
La Trobe University, Australia
Swinburne University of Technology, Australia
National University of Ireland, Ireland
University of Sydney, Australia
Beijing University of Posts and
Telecommunications, China

Abdur Rahman University, India
Shimane University, Japan
Telecom Bretagne, France
Universit´e Lyon 2, France
York University, Canada
National Taiwan University, Taiwan
Hiroshima University, Japan
Can Tho University, Vietnam
University of Sydney, Australia
Brandscreen, Australia
Griﬃth University, Australia
The University of Queensland, Australia
United International College, Hong Kong
The Methodist Hospital, USA
Brigham and Women’s Hospital, USA
University of Calgary, Canada
University of Queensland, Australia
University of Southern Queensland, Australia
Department of Immigration and Citizenship,
Australia
ARCON Corporation, USA
Department of Human Services, Australian
Queensland University of Technology, Australia
University of Auckland, New Zealand
University of North Carolina at Charlotte, USA
University of Tokyo, Japan
Nanjing University of Finance and Economics,
China
University of Wollongong, Australia
National Institutes of Health, USA

Wenzhou University, China

Table of Contents

Data Mining Applications
in Industry and Government
Using Scan-Statistical Correlations for Network Change Analysis . . . . . . .
Adriel Cheng and Peter Dickinson

1

Predicting High Impact Academic Papers Using Citation Network
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Daniel McNamara, Paul Wong, Peter Christen, and Kee Siong Ng

14

An OLAP Server for Sensor Networks Using Augmented Statistics
Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Neil Dunstan

26

Indirect Information Linkage for OSINT through Authorship Analysis
of Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Robert Layton, Charles Perez, Babiga Birregah, Paul Watters, and
Marc Lemercier
Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity
Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Banda Ramadan, Peter Christen, Huizhi Liang,
Ross W. Gayler, and David Hawking

36

47

Identifying Dominant Economic Sectors and Stock Markets: A Social
Network Mining Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ram Babu Roy and Uttam Kumar Sarkar

59

Ensemble Learning Model for Petroleum Reservoir Characterization:
A Case of Feed-Forward Back-Propagation Neural Networks . . . . . . . . . . .
Fatai Anifowose, Jane Labadin, and Abdulazeez Abdulraheem

71

Visual Data Mining Methods for Kernel Smoothed Estimates of Cox
Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
David Rohde, Ruth Huang, Jonathan Corcoran, and Gentry White

83

Real-Time Television ROI Tracking Using Mirrored Experimental
Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Brendan Kitts, Dyng Au, and Brian Burdick

95

On the Evaluation of the Homogeneous Ensembles
with CV-Passports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vladimir Nikulin, Aneesha Bakharia, and Tian-Hsiang Huang

109

XIV

Table of Contents

Parallel Sentiment Polarity Classiﬁcation Method with Substring
Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yaowen Zhang, Xiaojun Xiang, Cunyan Yin, and Lin Shang

121

Identifying Authoritative and Reliable Contents in Community
Question Answering with Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . .
Lifan Guo and Xiaohua Hu

133

Data Analytics for Targeted Healthcare
On the Application of Multi-class Classiﬁcation in Physical Therapy
Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jing Zhang, Douglas Gross, and Osmar R. Za¨ıane
EEG-MINE: Mining and Understanding Epilepsy Data . . . . . . . . . . . . . . .
SunHee Kim, Christos Faloutsos, and Hyung-Jeong Yang

A Constraint and Rule in an Enhancement of Binary Particle Swarm
Optimization to Select Informative Genes for Cancer Classiﬁcation . . . . .
Mohd Saberi Mohamad, Sigeru Omatu, Safaai Deris, and
Michifumi Yoshioka
Parameter Estimation Using Improved Diﬀerential Evolution (IDE)
and Bacterial Foraging Algorithm to Model Tyrosine Production in
Mus Musculus (Mouse) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jia Xing Yeoh, Chuii Khim Chong, Yee Wen Choon, Lian En Chai,
Safaai Deris, Rosli Md. Illias, and Mohd Saberi Mohamad
Threonine Biosynthesis Pathway Simulation Using IBMDE
with Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chuii Khim Chong, Mohd Saberi Mohamad, Safaai Deris,
Mohd Shahir Shamsir, Yee Wen Choon, and Lian En Chai
A Depression Detection Model Based on Sentiment Analysis
in Micro-blog Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Xinyu Wang, Chunhong Zhang, Yang Ji, Li Sun, Leijia Wu, and
Zhana Bao
Modelling Gene Networks by a Dynamic Bayesian Network-Based
Model with Time Lag Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lian En Chai, Mohd Saberi Mohamad, Safaai Deris,
Chuii Khim Chong, and Yee Wen Choon
Identifying Gene Knockout Strategy Using Bees Hill Flux Balance
Analysis (BHFBA) for Improving the Production of Succinic Acid and
Glycerol in Saccharomyces cerevisiae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yee Wen Choon, Mohd Saberi Mohamad, Safaai Deris,
Rosli Md. Illias, Lian En Chai, and Chuii Khim Chong

143
155

168

179

191

201

214

223

Table of Contents

XV

Mining Clinical Process in Order Histories Using Sequential Pattern
Mining Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shusaku Tsumoto and Hidenao Abe

234

Multiclass Prediction for Cancer Microarray Data Using Various
Variables Range Selection Based on Random Forest . . . . . . . . . . . . . . . . . .
Kohbalan Moorthy, Mohd Saberi Mohamad, and Safaai Deris

247

A Hybrid of SVM and SCAD with Group-Speciﬁc Tuning Parameters

in Identiﬁcation of Informative Genes and Biological Pathways . . . . . . . . .
Muhammad Faiz Misman, Weng Howe Chan,
Mohd Saberi Mohamad, and Safaai Deris
Structured Feature Extraction Using Association Rules . . . . . . . . . . . . . . .
Nan Tian, Yue Xu, Yuefeng Li, and Gabriella Pasi

258

270

Quality Issues, Measures of Interestingness
and Evaluation of Data Mining Models
Evaluation of Error-Sensitive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
William Wu and Shichao Zhang
Mining Correlated Patterns with Multiple Minimum All-Conﬁdence
Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R. Uday Kiran and Masaru Kitsuregawa
A Novel Proposal for Outlier Detection in High Dimensional Space . . . . .
Zhana Bao and Wataru Kameyama

283

295

307

CPPG: Eﬃcient Mining of Coverage Patterns Using Projected Pattern
Growth Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
P. Gowtham Srinivas, P. Krishna Reddy, and A.V. Trinath

319

A Two-Stage Dual Space Reduction Framework for Multi-label
Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eakasit Pacharawongsakda and Thanaruk Theeramunkong

330

Eﬀective Evaluation Measures for Subspace Clustering of Data
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marwan Hassani, Yunsu Kim, Seungjin Choi, and Thomas Seidl

342

Objectively Evaluating Interestingness Measures for Frequent Itemset
Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Albrecht Zimmermann

354

XVI

Table of Contents

A New Feature Selection and Feature Contrasting Approach Based
on Quality Metric: Application to Eﬃcient Classiﬁcation of Complex
Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jean-Charles Lamirel, Pascal Cuxac,
Aneesh Sreevallabh Chivukula, and Kaﬁl Hajlaoui

367

Evaluation of Position-Constrained Association-Rule-Based
Classiﬁcation for Tree-Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dang Bach Bui, Fedja Hadzic, and Michael Hecker

379

Enhancing Textual Data Quality in Data Mining: Case Study and
Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yi Feng and Chunhua Ju

392

Cost-Based Quality Measures in Subgroup Discovery . . . . . . . . . . . . . . . . .
Rob M. Konijn, Wouter Duivesteijn, Marvin Meeng, and
Arno Knobbe

404

Biological Inspired Techniques for Data Mining
Applying Migrating Birds Optimization to Credit Card Fraud
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ekrem Duman and Ilker Elikucuk

416

Clustering in Conjunction with Quantum Genetic Algorithm
for Relevant Genes Selection for Cancer Microarray Data . . . . . . . . . . . . .

Manju Sardana, R.K. Agrawal, and Baljeet Kaur

428

On the Optimality of Subsets of Features Selected by Heuristic and
Hyper-heuristic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kourosh Neshatian and Lucianne Varn

440

A PSO-Based Cost-Sensitive Neural Network for Imbalanced Data
Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Peng Cao, Dazhe Zhao, and Osmar R. Za¨ıane

452

Binary Classiﬁcation Using Genetic Programming: Evolving
Discriminant Functions with Dynamic Thresholds . . . . . . . . . . . . . . . . . . . .
Jill de Jong and Kourosh Neshatian

464

Constraint Discovery and Cloud Service Discovery
Incremental Constrained Clustering: A Decision Theoretic Approach . . .
Swapna Raj Prabakara Raj and Balaraman Ravindran

475

Querying Compressed XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Olfa Arfaoui and Minyar Sassi-Hidri

487

Table of Contents

XVII

Mining Approximate Keys Based on Reasoning from XML Data . . . . . . .
Liu Yijun, Ye Feiyue, and He Sheng

499

A Semantic-Based Dual Caching System for Nomadic Web Service . . . . .
Panpan Han, Liang Chen, and Jian Wu

511

FTCRank: Ranking Components for Building Highly Reliable Cloud
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hanze Xu, Yanan Xie, Dinglong Duan, Liang Chen, and Jian Wu

522

Research on SaaS Resource Management Method Oriented to Periodic
User Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jun Guo, Hongle Wu, Hao Huang, Fang Liu, and Bin Zhang

533

Weight Based Live Migration of Virtual Machines . . . . . . . . . . . . . . . . . . . .
Baiyou Qiao, Kai Zhang, Yanpeng Guo, Yutong Li,
Yuhai Zhao, and Guoren Wang

543

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

555

Using Scan-Statistical Correlations
for Network Change Analysis
Adriel Cheng and Peter Dickinson
Command, Control, Communications and Intelligence Division
Defence Science and Technology Organisation, Department of Defence, Australia
{adriel.cheng,peter.dickinson}@dsto.defence.gov.au

Abstract. Network change detection is a common prerequisite for identifying
anomalous behaviours in computer, telecommunication, enterprise and social
networks. Data mining of such networks often focus on the most significant
change only. However, inspecting large deviations in isolation can lead to other
important and associated network behaviours to be overlooked. This paper
proposes that changes within the network graph be examined in conjunction
with one another, by employing correlation analysis to supplement networkwide change information. Amongst other use-cases for mining network graph
data, the analysis examines if multiple regions of the network graph exhibit
similar degrees of change, or is it considered anomalous for a local network
change to occur independently. Building upon Scan-Statistics network change
detection, we extend the change detection technique to correlate localised
network changes. Our correlation inspired techniques have been deployed for

use on various networks internally. Using real-world datasets, we demonstrate
the benefits of our correlation change analysis.
Keywords: Mining graph data, statistical methods for data mining, anomaly
detection.

1

Introduction

Detecting changes in computer, telecommunication, enterprise or social networks is
often the first step towards identifying anomalous activity or suspicious participants
within such networks. In recent times, it has become increasingly common for a
changing network to be sampled at various intervals and represented naturally as a
time-series of graphs [1,2]. In order to uncover network anomalies within these
graphs, the challenge in network change detection lies not only with the type of
network graph changes to observe and how to measure such variations, but also the
subsequent change analysis that is to be conducted.
Scan-statistics [3] is a change detection technique that employs statistical methods
to measure variations in network graph vertices and surrounding vertex
neighbourhood regions. The technique uncovers large localised deviations in
behaviours exhibited by subgraph regions of the network. Traditionally, the
subsequent change analysis focuses solely on local regions with largest deviations.
J. Li et al. (Eds.): PAKDD 2013 Workshops, LNAI 7867, pp. 1–13, 2013.
© Commonwealth of Australia 2013

2

A. Cheng and P. Dickinson

Despite usefulness in tracking such network change to potentially anomalous
subgraphs, simply distinguishing which subgraph vertices contributed most to the
network deviation is insufficient.
In many instances, the cause of significant network changes may not be restricted
to a single vertex or subgraph region only, but multiple vertices or subgraphs may
also experience similar degrees of deviations. Such vertex deviations could be
interrelated, acting collectively with other vertices as the primary cause of the overall
network change. Concentrating solely on the most significantly changed vertex or
subgraph, other localised change behaviours would be hidden from examination. The
dominant change centric analysis may in fact hinder evaluation of the actual change
scenario experienced by the network.
To examine the network more conclusively, rather than inspect the most deviated
vertex or subgraph, scan-statistic change analysis should characterise the types of
localised changes and their relationships with one another across the entire network.
With this in mind, we extend scan-statistics with correlation based computations
and change analysis. Using our approach, correlations between the edge-connectivity
changes experienced by each pair of network graph vertices (or subgraphs) are
examined. Correlation measurements are also aggregated to describe the correlation
of each vertex (or subgraph) change with all other graph variations, and to assess the
overall correlation of changes experienced by the network as a whole.
The goal in supporting scan-statistical change detections with our correlationsbased analysis is to seek-out and characterise any relational patterns in the localised
change behaviours exhibited by vertices and subgraphs. For instance, if a significant
network change is detected and attributed to a particular vertex, do any other vertices
in the network show similar deviation in behaviours? If so, how many vertices are
considered similar? Do the majority of the vertices experience similar changes, or are
these localised changes independent and not related to other regions of the network.
Accounting for correlations between localised vertex or subgraph variations
provides further context into the possible scenarios triggering such network
deviations. For example, if localised changes in the majority of vertices are highly
correlated with one another, this could imply a scenario whereby a network-wide

initialisation or re-configuration took place. In a social network, such high
correlations of increased edge-linkages may correspond to some common holiday
festive event, whereby individuals send/receive greetings to everyone on the network
collectively within the same time period. Or if the communication links (and traffic)
of a monitored network of terrorist suspects intensifies as a group, this could signal an
impending attack.
On the other hand, a localised vertex or subgraph change which is uncorrelated to
other members of the graph may indicate a command-control network scenario, In
this case, any excessive change in network edge-connectivity would be largely
localised to the single command vertex. Another example could involve the failure of
a domain name system (DNS) server or a server under a denial-of-service attack. In
this scenario, re-routing of traffic from the failed server to an alternative server would
take place. The activity changes at these two server vertices would be highly localised
and not correlated to the remainder of the network.

Using Scan-Statistical Correlations for Network Change Analysis

3

To the best of our knowledge, examining scan-statistical correlations of network
graphs in support of further change analysis has not been previously explored. Hence,
the contributions of this paper are two-fold. First, to extend scan-statistics network
change detection with correlations analysis at multiple levels of the network graphs.
And second, to facilitate visualisation of vertex clusters and reveal interrelated groups
of vertices whose collective behaviour requires further investigation.
The remainder of this paper is as follows. Related work is discussed next. Section 3
gives a brief overview of scan-statistics. Sections 4 to 6 describe the correlation
extensions and correlation inspired change analysis. This is followed by experiments
demonstrating the practicality of our methods before the paper concludes in Section 8.

2

Related Work

The correlations based change analysis bears closest resemblance to the anomaly
event detection work of Akoglu and Faloutsos [4]. Both our technique and that of
Akoglu and Faloutsos employ ‘Pearson ρ’ correlation matrix manipulations. The
aggregation methods to compute vertex and graph level correlation values are also
similar. However, only correlations between vertices in/out degrees are considered by
Akoglu and Faloutsos, whereas our method can be adapted to examine other vertexinduced k hop subgraph correlations as well – e.g. diameter, number of triangles,
centrality, or other traffic distribution subgraph metrics [2,9]. The other key
difference between our methods lies with their intended application usages.
Whilst their method exposes significant graph-wide deviations, employing
correlation solely for change detection suffers from some shortcomings. Besides
detecting change from the majority of network nodes, we are also interested in other
types of network changes, such as anomalous deviations in behaviours from a few (or
single) dominant vertices. In this sense, our approach is not to deploy correlations for
network change detection directly, but to aid existing change detection methods and
extend subsequent change analysis.
In another related paper from Akoglu and Dalvi [5], anomaly and change detection
using similar correlation methods from [4] is described. However, their technique is
formalised and designated for detecting ‘Eigenbehavior’ based changes only. In
comparison, our methods are general in nature and not restricted to any particular type
of network change or correlation outcome.
Another relevant paper from Ide and Kashima [6] is their Eigenspace inspired
anomaly detection work for web-based computer systems. Both our approach and [6]
follow similar procedural steps. But whilst our correlation method involves a graph
adjacency matrix populated and aggregated with simplistic correlation computations,
the technique in [6] employs graph dependency matrix values directly and consists of

complex Eigenvector manipulations.
Other areas of research related to our work arise from Ahmed and Clark [7], and
Fukuda et. al. [8]. These papers describe change detections and correlations that share
similar philosophy with our methods. However, their underlying change detection and
correlation methods, along with the type of network data differ from our approach.

4

A. Cheng and P. Dickinson

The remaining schemes akin to our correlation methodology are captured by the
MetricForensics tool [9]. Our technique span multiple levels of the network graphs, in
contrast, MetricForensics applies correlation analysis exclusively at a global level.

3

Scan-Statistics

This section summaries the scan-statistics method. For a full treatment of scanstatistics, we refer the reader to [3]. Scan-statistics is a change detection technique
that applies statistical analysis to sub-regions of a network graph. Statistical analysis
is performed on graph vertices and vertex-induced subgraphs in order to measure
local changes across a time-series of graphs. Whenever the network undergoes
significant global change, scan-statistics detects and identifies the network vertices (or
subgraphs) which exhibited greatest deviation from prior network behaviours.
In scan-statistics, local graph elements are denoted by their vertex induced k-hop
subgraph regions. For every k-hop subgraph region, a vertex-standardized locality
statistic is measured for that region. In order to monitor changes experienced by these
subgraph regions, their locality statistics are measured for every graph throughout the
~

time-series of network graphs. The vertex-standardized statistic Ψ is :
Ψk ,t ( v ) − μˆ k ,t ,τ ( v )
~
(1)
Ψk ,t ( v ) =
max( σˆ k ,t ,τ ( v ), 1)
where k is the number of hops (edges) from vertex v to create the induced subgraph, v
is the vertex from which the subgraph is induced from, t is the time denoting the timeseries graph, τ is the number (window) of previous graphs in the time-series to
evaluate against current graph at t, Ψ is the local statistic that provides some
measurement of behavioural change exhibited by v, and μ and σ are the mean and
variance of Ψ.
The vertex-standardized locality statistic equation (1) above is interpreted as
follows. For the network graph at time t, and for each k-hop vertex v induced
subgraph, equation (1) measures the local subgraph change statistic Ψ in terms of the
number of standard deviations from prior variations.
With the aid of equation (1), scan-statistics detects any subgraph regions whose
chosen behavioural characteristics Ψ deviated significantly from its recent history. By
applying equation (1) iteratively to every vertex induced subgraph, scan-statistics
uncovers local regions within the network that exhibit the greatest deviations from
their expected behaviours.
With scan-statistics change detection, typically the subsequent change analysis
focuses on individual vertices (or subgraphs) that exhibited greatest deviation from
their expected prior behaviours only. Our scan-statistical correlation method bridges
this gap by examining all regions of change within the network and their relationships
with one another using correlation analysis.

Using Scan-Statistical Correlations for Network Change Analysis

4

5

Scan-Statistical Correlations

In order to examine correlations between local network changes uncovered by scanstatistics, we use Pearson’s ρ correlation. We examine and quantify possible
relationships in local behavioural changes between every pair of vertex induced k-hop
subgraphs in the network. For every pair of vertices v1 and v2 induced subgraphs, we
extend scan-statistics with correlation computations using Pearson’s ρ equation :
t −1

ρ k ,t ,τ ( v1 , v 2 ) =



t = t −τ
'

t −1



t ' = t −τ

( Ψk ,t ' ( v1 ) − μˆ k ,t ,τ ( v1 ))( Ψk ,t ' ( v 2 ) − μˆ k , t ,τ ( v 2 ))

( Ψk ,t ' ( v1 ) − μˆ k ,t ,τ ( v1 )) 2

t −1



t ' = t −τ

(2)

( Ψk ,t ' ( v 2 ) − μˆ k ,t ,τ ( v 2 )) 2

where k, v, t, τ , Ψ, and μˆ are defined the same as for (1).
The scan-statistical correlation scheme is outlined in Fig. 1. For every network
graph in the time-series, correlations between local vertex (or subgraph) changes are
computed according to corresponding vertex behaviours from the recent historical
window τ of time-series graphs. The raw correlations data are then populated into an
n×n matrix of n vertices from the network graph. This matrix provides a simplistic
assessment of positive, low, or possibly opposite correlations in change behaviours.

6OLGLQJ ZLQGRZ DORQJ
WLPHVHULHV RI JUDSKV

W W

W±

ORFDO ZLQGRZ W

<

<

<

FRUUHODWLRQ

<

<

<
<

W

Y

Y

Y
Y

U Y Y

U Y Y

U Y Y

U Y Y

YQ

YQ
W

W±

&RUUHODWLRQ PDWUL[ FRPSXWHG IRU HYHU\
JUDSK DORQJ WKH WLPHVHULHV

Fig. 1. Correlation is computed for each time-series graph and populated into a matrix

5

Multi-level Correlations Analysis

5.1

Aggregation of Correlation Data

To facilitate analysis of correlations amongst behavioural changes at higher network
graph levels, the raw correlation data (i.e. correlation matrix in Fig. 1) is aggregated

into other representative results. A number of aggregation schemes were examined.
However, compared to basic aggregation methods, spectral, Perron-Frobenius,
Eigenvector, and other matrix-based methods did not present any additional benefits
and took longer processing times. Hence, for the remainder of this paper, we restrict
our discussions to aggregation schemes employing straightforward averaging.
The aggregation of correlation data is described in Fig. 2. In the first step, for each
vertex, the vertex’s correlation with every other vertex is aggregated together. The
outcome is to provide an overall correlation measure for every vertex against the

6

A. Cheng and P. Dickinson

majority of other vertices throughout the network. From the perspective of an
individual vertex, this aggregated correlation value indicates if the behavioural change
experienced by that vertex is also exhibited by the majority or only a small number of
other vertices (discussed further in Section 5.3).
In the second step, using individually aggregated vertex correlation values from
Step 1, an overall correlation measure is acquired for the network graph. The network
graph correlation indicates if the change experienced by the network is part of a
broader graph-wide change, or if the network deviation is due to few local regions.

Y
Y

YQ

Y

Y

U Y Y

IT training LNAI 7867 trends and applications in knowledge discovery and data mining li, cao, wang, tan, liu, pei tseng 2013 09 05

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về