Tải bản đầy đủ (.pdf) (968 trang)

Mathematical analysis for machine learning and data mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.52 MB, 968 trang )


10702_9789813229686_tp.indd 1

2/5/18 10:45 AM


b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6

01-Sep-16 11:03:06 AM


10702_9789813229686_tp.indd 2

2/5/18 10:45 AM


Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data
Names: Simovici, Dan A., author.
Title: Mathematical analysis for machine learning and data mining / by Dan Simovici
(University of Massachusetts, Boston, USA).
Description: [Hackensack?] New Jersey : World Scientific, [2018] |


Includes bibliographical references and index.
Identifiers: LCCN 2018008584 | ISBN 9789813229686 (hc : alk. paper)
Subjects: LCSH: Machine learning--Mathematics. | Data mining--Mathematics.
Classification: LCC Q325.5 .S57 2018 | DDC 006.3/101515--dc23
LC record available at />
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.

Copyright © 2018 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
For any available supplementary material, please visit
/>Desk Editors: V. Vishnu Mohan/Steven Patt
Typeset by Stallion Press
Email:
Printed in Singapore

Vishnu Mohan - 10702 - Mathematical Analysis for Machine Learning.indd 1

23-04-18 2:48:01 PM


May 2, 2018 11:28

Mathematical Analysis for Machine Learning


9in x 6in

b3234-main

page v

v

Making mathematics accessible to the educated layman, while
keeping high scientific standards, has always been considered
a treacherous navigation between the Scylla of professional
contempt and the Charybdis of public misunderstanding.

Gian-Carlo Rota


b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6

01-Sep-16 11:03:06 AM


May 2, 2018 11:28

Mathematical Analysis for Machine Learning


9in x 6in

b3234-main

page vii

Preface

Mathematical Analysis can be loosely described as is the area of mathematics whose main object is the study of function and of their behaviour with
respect to limits. The term “function” refers to a broad collection of generalizations of real functions of real arguments, to functionals, operators,
measures, etc.
There are several well-developed areas in mathematical analysis that
present a special interest for machine learning: topology (with various flavors: point-set topology, combinatorial and algebraic topology), functional
analysis on normed and inner product spaces (including Banach and Hilbert
spaces), convex analysis, optimization, etc. Moreover, disciplines like measure and integration theory which play a vital role in statistics, the other
pillar of machine learning are absent from the education of a computer
scientists. We aim to contribute to closing this gap, which is a serious
handicap for people interested in research.
The machine learning and data mining literature is vast and embraces a
diversity of approaches, from informal to sophisticated mathematical presentations. However, the necessary mathematical background needed for
approaching research topics is usually presented in a terse and unmotivated
manner, or is simply absent. This volume contains knowledge that complements the usual presentations in machine learning and provides motivations
(through its application chapters that discuss optimization, iterative algorithms, neural networks, regression, and support vector machines) for the
study of mathematical aspects.
Each chapter ends with suggestions for further reading. Over 600 exercises and supplements are included; they form an integral part of the
material. Some of the exercises are in reality supplemental material. For
these, we include solutions. The mathematical background required for

vii



May 2, 2018 11:28

viii

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page viii

Mathematical Analysis for Machine Learning and Data Mining

making the best use of this volume consists in the typical sequence calculus — linear algebra — discrete mathematics, as it is taught to Computer
Science students in US universities.
Special thanks are due to the librarians of the Joseph Healy Library
at the University of Massachusetts Boston whose diligence was essential
in completing this project. I also wish to acknowledge the helpfulness and
competent assistance of Steve Patt and D. Rajesh Babu of World Scientific.
Lastly, I wish to thank my wife, Doina, a steady source of strength and
loving support.

Dan A. Simovici
Boston and Brookline
January 2018


May 2, 2018 11:28


Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page ix

Contents

Preface

vii

Part I. Set-Theoretical and Algebraic Preliminaries
1. Preliminaries
1.1
Introduction . . . . . . . . . . . .
1.2
Sets and Collections . . . . . . .
1.3
Relations and Functions . . . . .
1.4
Sequences and Collections of Sets
1.5
Partially Ordered Sets . . . . . .
1.6
Closure and Interior Systems . .
1.7

Algebras and σ-Algebras of Sets
1.8
Dissimilarity and Metrics . . . .
1.9
Elementary Combinatorics . . . .
Exercises and Supplements . . . . . . .
Bibliographical Comments . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

2. Linear Spaces
2.1
Introduction . . . . . . . . . . . . . . . . . .
2.2
Linear Spaces and Linear Independence . .
2.3
Linear Operators and Functionals . . . . . .
2.4
Linear Spaces with Inner Products . . . . .
2.5

Seminorms and Norms . . . . . . . . . . . .
2.6
Linear Functionals in Inner Product Spaces
2.7
Hyperplanes . . . . . . . . . . . . . . . . . .
Exercises and Supplements . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . .
ix

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

1

.
.
.
.
.
.
.
.
.
.
.

3
3
4
8
16
18
28
34
43
47

54
64

.
.
.
.
.
.
.
.
.

65
65
65
74
85
88
107
110
113
116


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in


b3234-main

page x

Mathematical Analysis for Machine Learning and Data Mining

x

3. Algebra of Convex Sets
3.1
Introduction . . . . . . . . . . . . .
3.2
Convex Sets and Affine Subspaces
3.3
Operations on Convex Sets . . . .
3.4
Cones . . . . . . . . . . . . . . . .
3.5
Extreme Points . . . . . . . . . . .
3.6
Balanced and Absorbing Sets . . .
3.7
Polytopes and Polyhedra . . . . . .
Exercises and Supplements . . . . . . . .
Bibliographical Comments . . . . . . . .

117
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.

Part II. Topology

159

4. Topology
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2
Topologies . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Closure and Interior Operators in Topological Spaces
4.4
Neighborhoods . . . . . . . . . . . . . . . . . . . . .
4.5
Bases . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
Compactness . . . . . . . . . . . . . . . . . . . . . .
4.7
Separation Hierarchy . . . . . . . . . . . . . . . . . .
4.8
Locally Compact Spaces . . . . . . . . . . . . . . . .
4.9

Limits of Functions . . . . . . . . . . . . . . . . . . .
4.10 Nets . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Continuous Functions . . . . . . . . . . . . . . . . .
4.12 Homeomorphisms . . . . . . . . . . . . . . . . . . . .
4.13 Connected Topological Spaces . . . . . . . . . . . . .
4.14 Products of Topological Spaces . . . . . . . . . . . .
4.15 Semicontinuous Functions . . . . . . . . . . . . . . .
4.16 The Epigraph and the Hypograph of a Function . .
Exercises and Supplements . . . . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . . . . .
5. Metric Space Topologies
5.1
5.2
5.3

117
117
129
130
132
138
142
150
158

161
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

161
162
166
174
180
189
193

197
201
204
210
218
222
225
230
237
239
253
255

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Sequences in Metric Spaces . . . . . . . . . . . . . . . . . 260
Limits of Functions on Metric Spaces . . . . . . . . . . . . 261


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

Contents

xi


5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11

Continuity of Functions between Metric Spaces
Separation Properties of Metric Spaces . . . . .
Completeness of Metric Spaces . . . . . . . . .
Pointwise and Uniform Convergence . . . . . .
The Stone-Weierstrass Theorem . . . . . . . . .
Totally Bounded Metric Spaces . . . . . . . . .
Contractions and Fixed Points . . . . . . . . .
The Hausdorff Metric Hyperspace of Compact
Subsets . . . . . . . . . . . . . . . . . . . . . .
5.12 The Topological Space (R, O) . . . . . . . . . .
5.13 Series and Schauder Bases . . . . . . . . . . . .
5.14 Equicontinuity . . . . . . . . . . . . . . . . . .
Exercises and Supplements . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . .

.
.
.
.
.
.

.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.

.
.
.
.
.
.
.

264
270
275
283
286
291
295

.
.
.
.
.
.

.
.
.

.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.

300
303
307
315
318
327

6. Topological Linear Spaces

329

6.1
Introduction . . . . . . . . . . . . . . . . . .
6.2
Topologies of Linear Spaces . . . . . . . . .
6.3
Topologies on Inner Product Spaces . . . .
6.4
Locally Convex Linear Spaces . . . . . . . .
6.5
Continuous Linear Operators . . . . . . . .
6.6
Linear Operators on Normed Linear Spaces
6.7
Topological Aspects of Convex Sets . . . . .
6.8
The Relative Interior . . . . . . . . . . . . .
6.9

Separation of Convex Sets . . . . . . . . . .
6.10 Theorems of Alternatives . . . . . . . . . .
6.11 The Contingent Cone . . . . . . . . . . . .
6.12 Extreme Points and Krein-Milman Theorem
Exercises and Supplements . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

Part III. Measure and Integration

Introduction . . . . . .
Measurable Spaces . .
Borel Sets . . . . . . .
Measurable Functions

.
.
.
.

.
.
.
.

329
329
337
338
340
341

348
351
356
366
370
373
375
381

383

7. Measurable Spaces and Measures
7.1
7.2
7.3
7.4

page xi

385
.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

385
385
388
392


May 2, 2018 11:28

Mathematical Analysis for Machine Learning


9in x 6in

b3234-main

page xii

Mathematical Analysis for Machine Learning and Data Mining

xii

7.5
Measures and Measure Spaces .
7.6
Outer Measures . . . . . . . . .
7.7
The Lebesgue Measure on Rn .
7.8
Measures on Topological Spaces
7.9
Measures in Metric Spaces . . .
7.10 Signed and Complex Measures
7.11 Probability Spaces . . . . . . .
Exercises and Supplements . . . . . .
Bibliographical Comments . . . . . .

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

8. Integration
8.1
8.2

Introduction . . . . . . . . . . . . . . . . . . . . . . .

The Lebesgue Integral . . . . . . . . . . . . . . . . .
8.2.1 The Integral of Simple Measurable Functions
8.2.2 The Integral of Non-negative Measurable
Functions . . . . . . . . . . . . . . . . . . . .
8.2.3 The Integral of Real-Valued Measurable
Functions . . . . . . . . . . . . . . . . . . . .
8.2.4 The Integral of Complex-Valued Measurable
Functions . . . . . . . . . . . . . . . . . . . .
8.3
The Dominated Convergence Theorem . . . . . . . .
8.4
Functions of Bounded Variation . . . . . . . . . . . .
8.5
Riemann Integral vs. Lebesgue Integral . . . . . . .
8.6
The Radon-Nikodym Theorem . . . . . . . . . . . .
8.7
Integration on Products of Measure Spaces . . . . .
8.8
The Riesz-Markov-Kakutani Theorem . . . . . . . .
8.9
Integration Relative to Signed Measures and
Complex Measures . . . . . . . . . . . . . . . . . . .
8.10 Indefinite Integral of a Function . . . . . . . . . . . .
8.11 Convergence in Measure . . . . . . . . . . . . . . . .
8.12 Lp and Lp Spaces . . . . . . . . . . . . . . . . . . . .
8.13 Fourier Transforms of Measures . . . . . . . . . . . .
8.14 Lebesgue-Stieltjes Measures and Integrals . . . . . .
8.15 Distributions of Random Variables . . . . . . . . . .
8.16 Random Vectors . . . . . . . . . . . . . . . . . . . .

Exercises and Supplements . . . . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . . . . .

398
417
427
450
453
456
464
470
484
485

. . . 485
. . . 485
. . . 486
. . . 491
. . . 500
.
.
.
.
.
.
.

.
.
.

.
.
.
.

.
.
.
.
.
.
.

505
508
512
517
525
533
540

.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

547
549
551
556
565
569

572
577
582
593


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

Contents

xiii

Part IV. Functional Analysis and Convexity

595

9. Banach Spaces

597

9.1
Introduction . . . . . . . . . . . . . . . . . . . .
9.2
Banach Spaces — Examples . . . . . . . . . . .

9.3
Linear Operators on Banach Spaces . . . . . .
9.4
Compact Operators . . . . . . . . . . . . . . .
9.5
Duals of Normed Linear Spaces . . . . . . . . .
9.6
Spectra of Linear Operators on Banach Spaces
Exercises and Supplements . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.

10. Differentiability of Functions Defined on Normed Spaces
10.1 Introduction . . . . . . . . . . . . . . . . . .
10.2 The Fr´echet and Gˆateaux Differentiation . .
10.3 Taylor’s Formula . . . . . . . . . . . . . . .
10.4 The Inverse Function Theorem in Rn . . . .
10.5 Normal and Tangent Subspaces for Surfaces
Exercises and Supplements . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . .

. . . .
. . . .
. . . .
. . . .
in Rn
. . . .
. . . .

Introduction . . . . . . . . . . . . . . . . . . . .
Hilbert Spaces — Examples . . . . . . . . . . .
Classes of Linear Operators in Hilbert Spaces .
11.3.1 Self-Adjoint Operators . . . . . . . . .
11.3.2 Normal and Unitary Operators . . . . .
11.3.3 Projection Operators . . . . . . . . . .
11.4 Orthonormal Sets in Hilbert Spaces . . . . . .
11.5 The Dual Space of a Hilbert Space . . . . . . .

11.6 Weak Convergence . . . . . . . . . . . . . . . .
11.7 Spectra of Linear Operators on Hilbert Spaces
11.8 Functions of Positive and Negative Type . . . .
11.9 Reproducing Kernel Hilbert Spaces . . . . . . .
11.10 Positive Operators in Hilbert Spaces . . . . . .
Exercises and Supplements . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . .

597
597
603
610
612
616
619
623
625

.
.
.
.
.
.
.

.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

11. Hilbert Spaces
11.1
11.2
11.3

page xiii

625
625
649

658
663
666
675
677

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

677
677
679
681
683
684
686
703
704
707
712
722
733
736
745


May 2, 2018 11:28


xiv

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page xiv

Mathematical Analysis for Machine Learning and Data Mining

12. Convex Functions

747

12.1 Introduction . . . . . . . . . . . . . . . . . . .
12.2 Convex Functions — Basics . . . . . . . . . .
12.3 Constructing Convex Functions . . . . . . . .
12.4 Extrema of Convex Functions . . . . . . . . .
12.5 Differentiability and Convexity . . . . . . . .
12.6 Quasi-Convex and Pseudo-Convex Functions
12.7 Convexity and Inequalities . . . . . . . . . . .
12.8 Subgradients . . . . . . . . . . . . . . . . . .
Exercises and Supplements . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . .

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.

Part V. Applications

817

13. Optimization

819

13.1 Introduction . . . . . . . . . . . . . . . . . . . .
13.2 Local Extrema, Ascent and Descent Directions
13.3 General Optimization Problems . . . . . . . . .
13.4 Optimization without Differentiability . . . . .
13.5 Optimization with Differentiability . . . . . . .
13.6 Duality . . . . . . . . . . . . . . . . . . . . . .
13.7 Strong Duality . . . . . . . . . . . . . . . . . .

Exercises and Supplements . . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

14. Iterative Algorithms
14.1 Introduction . . . . . . . . . . . . . .
14.2 Newton’s Method . . . . . . . . . . .
14.3 The Secant Method . . . . . . . . .
14.4 Newton’s Method in Banach Spaces
14.5 Conjugate Gradient Method . . . . .
14.6 Gradient Descent Algorithm . . . . .
14.7 Stochastic Gradient Descent . . . . .
Exercises and Supplements . . . . . . . . .
Bibliographical Comments . . . . . . . . .

747
748
756
759
760
770
775
780
793
815

819
819
826
827
831

843
849
854
863
865

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

865
865
869
871
874
879
882
884
892


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

Contents

xv

15. Neural Networks


893

15.1 Introduction . . . . . . . . . . . . . . . . . . .
15.2 Neurons . . . . . . . . . . . . . . . . . . . . .
15.3 Neural Networks . . . . . . . . . . . . . . . .
15.4 Neural Networks as Universal Approximators
15.5 Weight Adjustment by Back Propagation . .
Exercises and Supplements . . . . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.

.
.
.
.
.
.

16. Regression

893
893
895
896
899
902
907
909

16.1 Introduction . . . . . . . . . . . . . . . .
16.2 Linear Regression . . . . . . . . . . . . .
16.3 A Statistical Model of Linear Regression
16.4 Logistic Regression . . . . . . . . . . . .
16.5 Ridge Regression . . . . . . . . . . . . .
16.6 Lasso Regression and Regularization . .
Exercises and Supplements . . . . . . . . . . .
Bibliographical Comments . . . . . . . . . . .

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

17. Support Vector Machines
17.1 Introduction . . . . . . . . . . . . . .
17.2 Linearly Separable Data Sets . . . .

17.3 Soft Support Vector Machines . . . .
17.4 Non-linear Support Vector Machines
17.5 Perceptrons . . . . . . . . . . . . . .
Exercises and Supplements . . . . . . . . .
Bibliographical Comments . . . . . . . . .

page xv

909
909
912
914
916
917
920
924
925

.
.
.
.
.
.
.

.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

925
925
930
933
939
941

947

Bibliography

949

Index

957


PART I

Set-Theoretical and Algebraic
Preliminaries


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page 3

Chapter 1

Preliminaries


1.1

Introduction

This introductory chapter contains a mix of preliminary results and notations that we use in further chapters, ranging from set theory, and combinatorics to metric spaces.
The membership of x in a set S is denoted by x ∈ S; if x is not a
member of the set S, we write x ∈ S.
Throughout this book, we use standardized notations for certain important sets of numbers:
C
R

0

R

0

ˆ
C
ˆ
R
ˆ
R

Q
Z

0


R
R>0

the set of complex numbers
the set of non-negative real
numbers
the set of non-positive real
numbers
the set C ∪ {∞}
the set R ∪ {−∞, +∞}
the set R 0 ∪ {−∞}
the set of rational numbers
the set of integers

R<0

ˆ 0
R
ˆ >0
R
I
N

the set of real numbers
the set of positive real
numbers
the set of negative real
numbers
the
the

the
the

set
set
set
set

R 0 ∪ {+∞}
R<>0 ∪ {+∞}
of irrational numbers
of natural numbers

ˆ by −∞ < x <
The usual order of real numbers is extended to the set R
+∞ for every x ∈ R. Addition and multiplication are extended by
x + ∞ = ∞ + x = +∞, and , x − ∞ = −∞ + x = −∞,
for every x ∈ R. Also, if x = 0 we assume that
+∞ if x > 0,

x·∞ =∞·x =

−∞ if x < 0,
3


May 2, 2018 11:28

Mathematical Analysis for Machine Learning


9in x 6in

b3234-main

page 4

Mathematical Analysis for Machine Learning and Data Mining

4

and
x · (−∞) = (−∞) · x =

−∞ if x > 0,


if x < 0.

Additionally, we assume that 0 ·∞ = ∞·0 = 0 and 0 ·(−∞) = (−∞)·0 = 0.
Note that ∞ − ∞, −∞ + ∞ are undefined.
Division is extended by x/∞ = x/ − ∞ = 0 for every x ∈ R.
The set of complex numbers C is extended by adding a single “infinity”
element ∞. The sum ∞ + ∞ is not defined in the complex case.
If S is a finite set, we denote by |S| the number of elements of S.
1.2

Sets and Collections

We assume that the reader is familiar with elementary set operations:
union, intersection, difference, etc., and with their properties. The empty

set is denoted by ∅.
We give, without proof, several properties of union and intersection of
sets:
(1) S ∪ (T ∪ U ) = (S ∪ T ) ∪ U (associativity of union),
(2) S ∪ T = T ∪ S (commutativity of union),
(3) S ∪ S = S (idempotency of union),
(4) S ∪ ∅ = S,
(5) S ∩ (T ∩ U ) = (S ∩ T ) ∩ U (associativity of intersection),
(6) S ∩ T = T ∩ S (commutativity of intersection),
(7) S ∩ S = S (idempotency of intersection),
(8) S ∩ ∅ = ∅,
for all sets S, T, U .
The associativity of union and intersection allows us to denote unambiguously the union of three sets S, T, U by S ∪ T ∪ U and the intersection
of three sets S, T, U by S ∩ T ∩ U .
Definition 1.1. The sets S and T are disjoint if S ∩ T = ∅.
Sets may contain other sets as elements. For example, the set
C = {∅, {0}, {0, 1}, {0, 2}, {1, 2, 3}}
contains the empty set ∅ and {0}, {0, 1},{0, 2},{1, 2, 3} as its elements. We
refer to such sets as collections of sets or simply collections. In general, we
use calligraphic letters C, D, . . . to denote collections of sets.


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main


Preliminaries

page 5

5

If C and D are two collections, we say that C is included in D, or that
C is a subcollection of D, if every member of C is a member of D. This is
denoted by C ⊆ D.
Two collections C and D are equal if we have both C ⊆ D and D ⊆ C.
This is denoted by C = D.
Definition 1.2. Let C be a collection of sets. The union of C, denoted by
C, is the set defined by
C = {x | x ∈ S for some S ∈ C}.
If C is a non-empty collection, its intersection is the set

C given by

C = {x | x ∈ S for every S ∈ C}.
If C = {S, T }, we have x ∈ C if and only if x ∈ S or x ∈ T and
x ∈ C if and only if x ∈ S and y ∈ T . The union and the intersection of
this two-set collection are denoted by S ∪ T and S ∩ T and are referred to
as the union and the intersection of S and T , respectively.
The difference of two sets S, T is denoted by S − T . When T is a subset
of S we write T for S − T , and we refer to the set T as the complement of
T with respect to S or simply the complement of T .
The relationship between set difference and set union and intersection
is well-known: for every set S and non-empty collection C of sets, we have
S−


C=

{S − C | C ∈ C} and S −

C=

{S − C | C ∈ C}.

For any sets S, T, U , we have
S − (T ∪ U ) = (S − T ) ∩ (S − U ) and S − (T ∩ U ) = (S − T ) ∪ (S − U ).
With the notation previously introduced for the complement of a set, the
above equalities become:
T ∪ U = T ∩ U and T ∩ U = T ∪ U .
For any sets T , U , V , we have
(U ∪ V ) ∩ T = (U ∩ T ) ∪ (V ∩ T ) and (U ∩ V ) ∪ T = (U ∪ T ) ∩ (V ∪ T ).
Note that if C and D are two collections such that C ⊆ D, then
C⊆

D and

D⊆

C.

We initially excluded the empty collection from the definition of the intersection of a collection. However, within the framework of collections
of subsets of a given set S, we extend the previous definition by taking


May 2, 2018 11:28


6

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page 6

Mathematical Analysis for Machine Learning and Data Mining

∅ = S for the empty collection of subsets of S. This is consistent with
the fact that ∅ ⊆ C implies C ⊆ S.
The symmetric difference of sets denoted by ⊕ is defined by U ⊕ V =
(U − V ) ∪ (V − U ) for all sets U, V .
We leave to the reader to verify that for all sets U, V, T we have
(i) U ⊕ U = ∅;
(ii) U ⊕ V = V ⊕ T ;
(iii) (U ⊕ V ) ⊕ T = U ⊕ (V ⊕ T ).
The next theorem allows us to introduce a type of set collection of
fundamental importance.
Theorem 1.1. Let {{x, y}, {x}} and {{u, v}, {u}} be two collections such
that {{x, y}, {x}} = {{u, v}, {u}}. Then, we have x = u and y = v.
Proof. Suppose that {{x, y}, {x}} = {{u, v}, {u}}.
If x = y, the collection {{x, y}, {x}} consists of a single set, {x}, so
the collection {{u, v}, {u}} also consists of a single set. This means that
{u, v} = {u}, which implies u = v. Therefore, x = u, which gives the
desired conclusion because we also have y = v.
If x = y, then neither (x, y) nor (u, v) are singletons. However, they

both contain exactly one singleton, namely {x} and {u}, respectively, so
x = u. They also contain the equal sets {x, y} and {u, v}, which must be
equal. Since v ∈ {x, y} and v = u = x, we conclude that v = y.
Definition 1.3. An ordered pair is a collection of sets {{x, y}, {x}}.
Theorem 1.1 implies that for an ordered pair {{x, y}, {x}}, x and y are
uniquely determined. This justifies the following definition.
Definition 1.4. Let {{x, y}, {x}} be an ordered pair. Then x is the first
component of p and y is the second component of p.
From now on, an ordered pair {{x, y}, {x}} is denoted by (x, y). If both
x, y ∈ S, we refer to (x, y) as an ordered pair on the set S.
Definition 1.5. Let X, Y be two sets. Their product is the set X × Y that
consists of all pairs of the form (x, y), where x ∈ X and y ∈ Y .
The set product is often referred to as the Cartesian product of sets.
Example 1.1. Let X = {a, b, c} and let Y = {1, 2}. The Cartesian product
X × Y is given by
X × Y = {(a, 1), (b, 1), (c, 1), (a, 2), (b, 2), (c, 2)}.


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

Preliminaries

page 7


7

Definition 1.6. Let C and D be two collections of sets such that C =
D. D is a refinement of C if, for every D ∈ D, there exists C ∈ C such
that D ⊆ C.
This is denoted by C D.
Example 1.2. Consider the collection C = {(a, ∞) | a ∈ R} and D =
{(a, b) | a, b ∈ R, a < b}. It is clear that C = D = R.
Since we have (a, b) ⊆ (a, ∞) for every a, b ∈ R such that a < b, it
follows that D is a refinement of C.
Definition 1.7. A collection of sets C is hereditary if U ∈ C and W ⊆ U
implies W ∈ C.
Example 1.3. Let S be a set. The collection of subsets of S, denoted by
P(S), is a hereditary collection of sets since a subset of a subset T of S is
itself a subset of S.
The set of subsets of S that contain k elements is denoted by Pk (S).
Clearly, for every set S, we have P0 (S) = {∅} because there is only one
subset of S that contains 0 elements, namely the empty set. The set of all
finite subsets of a set S is denoted by Pfin (S). It is clear that Pfin (S) =
k∈N Pk (S).
Example 1.4. If S = {a, b, c}, then P(S) consists of the following eight
sets: ∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}. For the empty set, we
have P(∅) = {∅}.
Definition 1.8. Let C be a collection of sets and let U be a set. The trace
of the collection C on the set U is the collection CU = {U ∩ C | C ∈ C}.
We conclude this presentation of collections of sets with two more operations on collections of sets.
Definition 1.9. Let C and D be two collections of sets. The collections
C ∨ D, C ∧ D, and C − D are given by
C ∨ D = {C ∪ D | C ∈ C and D ∈ D},
C ∧ D = {C ∩ D | C ∈ C and D ∈ D},

C − D = {C − D | C ∈ C and D ∈ D}.
Example 1.5. Let C and D be the collections of sets defined by
C = {{x}, {y, z}, {x, y}, {x, y, z}},
D = {{y}, {x, y}, {u, y, z}}.


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main

page 8

Mathematical Analysis for Machine Learning and Data Mining

8

We have
C ∨ D = {{x, y}, {y, z}, {x, y, z}, {u, y, z}, {u, x, y, z}},
C ∧ D = {∅, {x}, {y}, {x, y}, {y, z}},
C − D = {∅, {x}, {z}, {x, z}},
D − C = {∅, {u}, {x}, {y}, {u, z}, {u, y, z}}.
Unlike “∪” and “∩”, the operations “∨” and “∧” between collections of
sets are not idempotent. Indeed, we have, for example,
D ∨ D = {{y}, {x, y}, {u, y, z}, {u, x, y, z}} = D.
The trace CK of a collection C on K can be written as CK = C ∧ {K}.
We conclude this section by introducing a special type of collection of

subsets of a set.
Definition 1.10. A partition of a non-empty set S is a collection π of
non-empty subsets of S that are pairwise disjoint and whose union equals
S.
The members of π are referred to as the blocks of the partition π.
The collection of partitions of a set S is denoted by PART(S). A partition is finite if it has a finite number of blocks. The set of finite partitions
of S is denoted by PARTfin (S).
If π ∈ PART(S) then a subset T of S is π-saturated if it is a union of
blocks of π.
Example 1.6. Let π = {{1, 3}, {4}, {2, 5, 6}} be a partition of S =
{1, 2, 3, 4, 5, 6}. The set {1, 3, 4} is π-saturated because it is the union of
blocks {1, 3} and 4.
1.3

Relations and Functions

Definition 1.11. Let X, Y be two sets. A relation on X, Y is a subset ρ
of the set product X × Y .
If X = Y = S we refer to ρ as a relation on S.
The




relation ρ on S is:
reflexive if (x, x) ∈ ρ for every x ∈ S;
irreflexive if (x, x) ∈ ρ for every x ∈ S;
symmetric if (x, y) ∈ ρ implies (y, x) ∈ ρ for all x, y ∈ S;



May 2, 2018 11:28

Mathematical Analysis for Machine Learning

Preliminaries

9in x 6in

b3234-main

page 9

9

• antisymmetric if (x, y) ∈ ρ and (y, x) ∈ ρ imply x = y for all
x, y ∈ S;
• transitive if (x, y) ∈ ρ and (y, z) ∈ ρ imply (x, z) ∈ ρ for all x, y, z ∈
S.
Denote by REFL(S), SYMM(S), ANTISYMM(S) and TRAN(S) the sets of
reflexive relations, the set of symmetric relations, the set of antisymmetric,
and the set of transitive relations on S, respectively.
A partial order on S is a relation ρ that belongs to REFL(S) ∩
ANTISYMM(S) ∩ TRAN(S), that is, a relation that is reflexive, symmetric
and transitive.
Example 1.7. Let δ be the relation that consists of those pairs (p, q) of
natural numbers such that q = pk for some natural number k. We have
(p, q) ∈ δ if p evenly divides q. Since (p, p) ∈ δ for every p it is clear that δ
is symmetric.
Suppose that we have both (p, q) ∈ δ and (q, p) ∈ δ. Then q = pk and
p = qh. If either p or q is 0, then the other number is clearly 0. Assume

that neither p nor q is 0. Then 1 = hk, which implies h = k = 1, so p = q,
which proves that δ is antisymmetric.
Finally, if (p, q), (q, r) ∈ δ, we have q = pk and r = qh for some k, h ∈ N,
which implies r = p(hk), so (p, r) ∈ δ, which shows that δ is transitive.
Example 1.8. Define the relation λ on R as the set of all ordered pairs
(x, y) such that y = x + t, where t is a non-negative number. We have
(x, x) ∈ λ because x = x + 0 for every x ∈ R. If (x, y) ∈ λ and (y, x) ∈ λ we
have y = x+t and x = y+s for two non-negative numbers t, s, which implies
0 = t + s, so t = s = 0. This means that x = y, so λ is antisymmetric.
Finally, if (x, y), (y, z) ∈ λ, we have y = x + u and z = y + v for two
non-negative numbers u, v, which implies z = x + u + v, so (x, z) ∈ λ.
In current mathematical practice, we often write xρy instead on (x, y) ∈
ρ, where ρ is a relation of S and x, y ∈ S. Thus, we write pδq and xλy
instead on (p, q) ∈ δ and (x, y) ∈ λ. Furthermore, we shall use the standard
notations “ | ” and “ ” for δ and λ, that is, we shall write p | q and x y
if p divides q and x is less or equal to y. This alternative way to denote the
fact that (x, y) belongs to ρ is known as the infix notation.
Example 1.9. Let P(S) be the set of subsets of S. It is easy to verify that
the inclusion between subsets “⊆” is a partial order relation on P(S). If
U, V ∈ P(S), we denote the inclusion of U in V by U ⊆ V using the infix
notation.


May 2, 2018 11:28

Mathematical Analysis for Machine Learning

9in x 6in

b3234-main


page 10

Mathematical Analysis for Machine Learning and Data Mining

10

Functions are special relation that enjoy the property described in the
next definition.
Definition 1.12. Let X, Y be two sets. A function (or a mapping) from
X to Y is a relation f on X, Y such that (x, y), (x, y ) ∈ f implies y = y .
In other words, the first component of a pair (x, y) ∈ f determines
uniquely the second component of the pair. We denote the second component of a pair (x, y) ∈ f by f (x) and say, occasionally, that f maps x to
y.
If f is a function from X to Y we write f : X −→ Y .
Definition 1.13. Let X, Y be two sets and let f : X −→ Y .
The domain of f is the set
Dom(f ) = {x ∈ X | y = f (x) for some y ∈ Y }.
The range of f is the set
Ran(f ) = {y ∈ Y | y = f (x) for some x ∈ X}.
Definition 1.14. Let X be a set, Y = {0, 1} and let L be a subset of S.
The characteristic function is the function 1L : S −→ {0, 1} defined by:
1L (x) =

1

if x ∈ L,

0


otherwise

for x ∈ S.
The indicator function of L is the function IL : S −→ rr
ˆ defined by
IL (x) =

0

if x ∈ L,

∞ otherwise

for x ∈ S.
It is easy to see that:
1P ∩Q (x) = 1P (x) · 1Q (x),
1P ∪Q (x) = 1P (x) + 1Q (x) − 1P (x) · 1Q (x),
1P¯ (x) = 1 − 1P (x),
for every P, Q ⊆ S and x ∈ S.
Theorem 1.2. Let X, Y, Z be three sets and let f : X −→ Y and g : Y −→
Z be two functions. The relation gf : X −→ Z that consists of all pairs
(x, z) such that y = f (x) and g(y) = z for some y ∈ Y is a function.


×