Tải bản đầy đủ (.pdf) (462 trang)

6454 pentaho data integration cookbook (2nd ed)

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.57 MB, 462 trang )


Pentaho Data
Integration Cookbook
Second Edition

Over 100 recipes for building open source ETL solutions
with Pentaho Data Integration

Alex Meadows
Adrián Sergio Pulvirenti
María Carina Roldán

BIRMINGHAM - MUMBAI


Pentaho Data Integration Cookbook
Second Edition

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly
or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies
and products mentioned in this book by the appropriate use of capitals. However, Packt
Publishing cannot guarantee the accuracy of this information.



First published: June 2011
Second Edition: November 2013

Production Reference: 2221113

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-067-4
www.packtpub.com

Cover Image by Aniket Sawant ()


Credits
Author
Alex Meadows

Project Coordinator
Wendell Palmer

Adrián Sergio Pulvirenti
María Carina Roldán
Reviewers
Wesley Seidel Carvalho
Daniel Lemire
Coty Sutherland
Acquisition Editor

Usha Iyer
Meeta Rajani
Lead Technical Editor
Arvind Koul
Technical Editors
Dennis John
Adrian Raposo
Gaurav Thingalaya

Proofreader
Kevin McGowan
Indexer
Monica Ajmera Mehta
Graphics
Ronak Dhruv
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite


About the Author
Alex Meadows has worked with open source Business Intelligence solutions for nearly

10 years and has worked in various industries such as plastics manufacturing, social and
e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in
Pentaho and other open source communities to learn, share, and help newcomers with the best
practices in BI, analytics, and data management. He received his Bachelor's degree in Business
Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree
in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.

First and foremost, thank you Christina for being there for me before, during,
and after taking on the challenge of writing and revising a book. I know
it's not been easy, but thank you for allowing me the opportunity. To my
grandmother, thank you for teaching me at a young age to always go for goals
that may just be out of reach. Finally, this book would be no where without
the Pentaho community and the friends I've made over the years being a part
of it.

Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his
Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in
South America.
He has dedicated more than 15 years to developing desktop and web-based software
solutions. Over the last few years he has been leading integration projects and development
of BI solutions.
I'd like to thank my lovely kids, Camila and Nicolas, who understood that
I couldn't share with them the usual video game sessions during the
writing process. I'd also like to thank my wife, who introduced me to the
Pentaho world.


María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelor's
degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in
Statistics at the University of Buenos Aires (UBA) in Buenos Aires city, where she has been
living since 1994.
She has worked as a BI consultant for more than 10 years. Over the last four years, she has
been dedicated full time to developing BI solutions using Pentaho Suite. Currently, she works
for Webdetails, one of the main Pentaho contributors. She is the author of Pentaho 3.2 Data
Integration: Beginner's Guide published by Packt Publishing in April 2010.
You can follow her on Twitter at @mariacroldan.
I'd like to thank those who have encouraged me to write this book: On one

hand, the Pentaho community; they have given me a rewarding feedback
after the Beginner's book. On the other side, my husband, who without
hesitation, agreed to write the book with me. Without them I'm not sure I
would have embarked on a new book project.
I'd also like to thank the technical reviewers for the time and dedication that
they have put in reviewing the book. In particular, thanks to my colleagues at
Webdetails; it's a pleasure and a privilege to work with them every day.


About the Reviewers
Wesley Seidel Carvalho got his Master's degree in Computer Science from the Institute
of Mathematics and Statistics, University of São Paulo (IME-USP), Brazil, where he researched
on (his dissertation) Natural Language Processing (NLP) for the Portuguese language. He
is a Database Specialist from the Federal University of Pará (UFPa). He has a degree in
Mathematics from the State University of Pará (Uepa).
Since 2010, he has been working with Pentaho and researching Open Data government.
He is an active member of the communities and lists of Free Software, Open Data, and
Pentaho in Brazil, contributing software "Grammar Checker for OpenOffice - CoGrOO" and
CoGrOO Community.
He has worked with technology, database, and systems development since 1997, Business
Intelligence since 2003, and has been involved with Pentaho and NLP since 2009. He is
currently serving its customers through its startups:
ff



ff




Daniel Lemire has a B.Sc. and a M.Sc. in Mathematics from the University of Toronto,
and a Ph.D. in Engineering Mathematics from the Ecole Polytechnique and the Université de
Montréal. He is a Computer Science professor at TELUQ (Université du Québec) where he
teaches Primarily Online. He has also been a research officer at the National Research Council
of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including
more than 25 journal articles. He has held competitive research grants for the last 15 years.
He has served as a program committee member on leading computer science conferences
(for example, ACM CIKM, ACM WSDM, and ACM RecSys). His open source software has been
used by major corporations such as Google and Facebook. His research interests include
databases, information retrieval, and high performance programming. He blogs regularly on
computer science at />

Coty Sutherland was first introduced to computing around the age of 10. At that time,
he was immersed in various aspects of computers and it became apparent that he had a
propensity for software manipulation. From then until now, he has stayed involved in learning
new things in the software space and adapting to the changing environment that is Software
Development. He graduated from Appalachian State University in 2009 with a Bachelor's
Degree in Computer Science. After graduation, he focused mainly on software application
development and support, but recently transitioned to the Business Intelligence field to
pursue new and exciting things with data. He is currently employed by the open source
company, Red Hat, as a Business Intelligence Engineer.


www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to
your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at

for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt books
and eBooks.
TM



Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's entire library of books. 

Why Subscribe?
ff

Fully searchable across every book published by Packt

ff

Copy and paste, print and bookmark content

ff

On demand and accessible via web browser

Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials for
immediate access.



Table of Contents
Preface1
Chapter 1: Working with Databases
7
Introduction7
Connecting to a database
9
Getting data from a database
14
Getting data from a database by providing parameters
16
Getting data from a database by running a query built at runtime
21
Inserting or updating rows in a table
23
Inserting new rows where a simple primary key has to be generated
28
Inserting new rows where the primary key has to be generated based
on stored values
32
Deleting data from a table
35
Creating or altering a database table from PDI (design time)
40
Creating or altering a database table from PDI (runtime)
43
Inserting, deleting, or updating a table depending on a field
45
Changing the database connection at runtime
51

Loading a parent-child table
53
Building SQL queries via database metadata
57
Performing repetitive database design tasks from PDI
62

Chapter 2: Reading and Writing Files

65

Introduction66
Reading a simple file
66
Reading several files at the same time
70
Reading semi-structured files
72
Reading files having one field per row
79
Reading files with some fields occupying two or more rows
82
Writing a simple file
84
Writing a semi-structured file
87


Table of Contents


Providing the name of a file (for reading or writing) dynamically
Using the name of a file (or part of it) as a field
Reading an Excel file
Getting the value of specific cells in an Excel file
Writing an Excel file with several sheets
Writing an Excel file with a dynamic number of sheets
Reading data from an AWS S3 Instance

90
93
95
97
101
105
107

Chapter 3: Working with Big Data and Cloud Sources

111

Chapter 4: Manipulating XML Structures

133

Chapter 5: File Management

171

Introduction111
Loading data into Salesforce.com

112
Getting data from Salesforce.com
114
Loading data into Hadoop
115
Getting data from Hadoop
119
Loading data into HBase
122
Getting data from HBase
127
Loading data into MongoDB
129
Getting data from MongoDB
130
Introduction133
Reading simple XML files
134
Specifying fields by using the Path notation
137
Validating well-formed XML files
143
Validating an XML file against DTD definitions
146
Validating an XML file against an XSD schema
148
Generating a simple XML document
153
Generating complex XML structures
155

Generating an HTML page using XML and XSL transformations
162
Reading an RSS Feed
165
Generating an RSS Feed
167
Introduction171
Copying or moving one or more files
172
Deleting one or more files
175
Getting files from a remote server
178
Putting files on a remote server
181
Copying or moving a custom list of files
183
Deleting a custom list of files
185
Comparing files and folders
188
Working with ZIP files
191
Encrypting and decrypting files
195
ii


Table of Contents


Chapter 6: Looking for Data

199

Chapter 7: Understanding and Optimizing Data Flows

231

Chapter 8: Executing and Re-using Jobs and Transformations

279

Chapter 9: Integrating Kettle and the Pentaho Suite

321

Introduction199
Looking for values in a database table
200
Looking for values in a database with complex conditions
204
Looking for values in a database with dynamic queries
207
Looking for values in a variety of sources
211
Looking for values by proximity
217
Looking for values by using a web service
222
Looking for values over intranet or the Internet

225
Validating data at runtime
227
Introduction232
Splitting a stream into two or more streams based on a condition
233
Merging rows of two streams with the same or different structures
240
Adding checksums to verify datasets
246
Comparing two streams and generating differences
249
Generating all possible pairs formed from two datasets
255
Joining two or more streams based on given conditions
258
Interspersing new rows between existent rows
261
Executing steps even when your stream is empty
265
Processing rows differently based on the row number
268
Processing data into shared transformations via filter criteria and
subtransformations272
Altering a data stream with Select values
274
Processing multiple jobs or transformations in parallel
275
Introduction280
Launching jobs and transformations

283
Executing a job or a transformation by setting static arguments
and parameters
284
Executing a job or a transformation from a job by setting arguments and
287
parameters dynamically
287
Executing a job or a transformation whose name is determined at runtime 290
Executing part of a job once for every row in a dataset
293
Executing part of a job several times until a condition is true
298
Moving part of a transformation to a subtransformation
309
Using Metadata Injection to re-use transformations
316

Introduction321
Creating a Pentaho report with data coming from PDI
324
iii


Table of Contents

Creating a Pentaho report directly from PDI
Configuring the Pentaho BI Server for running PDI jobs and transformations
Executing a PDI transformation as part of a Pentaho process
Executing a PDI job from the Pentaho User Console

Populating a CDF dashboard with data coming from a PDI transformation

329
332
334
341
350

Chapter 10: Getting the Most Out of Kettle

357

Chapter 11: Utilizing Visualization Tools in Kettle

401

Chapter 12: Data Analytics

417

Appendix A: Data Structures

427

Appendix B: References

433

Introduction357
Sending e-mails with attached files

358
Generating a custom logfile
362
Running commands on another server
367
Programming custom functionality
369
Generating sample data for testing purposes
378
Working with JSON files
381
Getting information about transformations and jobs (file-based)
385
Getting information about transformations and jobs (repository-based)
390
Using Spoon's built-in optimization tools
395
Introduction401
Managing plugins with the Marketplace
402
Data profiling with DataCleaner
404
Visualizing data with AgileBI
409
Using Instaview to analyze and visualize data
413

Introduction417
Reading data from a SAS datafile
417

Studying data via stream statistics
420
Building a random data sample for Weka
424
Books data structure
museums data structure
outdoor data structure
Steel Wheels data structure
Lahman Baseball Database

427
429
430
431
432

Books433
Online434

Index435

iv


Preface
Pentaho Data Integration (also known as Kettle) is one of the leading open source data
integration solutions. With Kettle, you can take data from a multitude of sources, transform
and conform the data to given requirements, and load the data into just as many target
systems. Not only is PDI capable of transforming and cleaning data, it also provides an
ever-growing number of plugins to augment what is already a very robust list of features.

Pentaho Data Integration Cookbook, Second Edition picks up where the first edition left off,
by updating the recipes to the latest edition of PDI and diving into new topics such as working
with Big Data and cloud sources, data analytics, and more.
Pentaho Data Integration Cookbook, Second Edition shows you how to take advantage of all
the aspects of Kettle through a set of practical recipes organized to find quick solutions to
your needs. The book starts with showing you how to work with data sources such as files,
relational databases, Big Data, and cloud sources. Then we go into how to work with data
streams such as merging data from different sources, how to take advantage of the different
tools to clean up and transform data, and how to build nested jobs and transformations. More
advanced topics are also covered, such as data analytics, data visualization, plugins, and
integration of Kettle with other tools in the Pentaho suite.
Pentaho Data Integration Cookbook, Second Edition provides recipes with easy step-by-step
instructions to accomplish specific tasks. The code for the recipes can be adapted and built
upon to meet individual needs.

What this book covers
Chapter 1, Working with Databases, shows you how to work with relational databases with
Kettle. The recipes show you how to create and share database connections, perform typical
database functions (select, insert, update, and delete), as well as more advanced tricks such
as building and executing queries at runtime.
Chapter 2, Reading and Writing Files, not only shows you how to read and write files, but also
how to work with semi-structured files, and read data from Amazon Web Services.


Preface
Chapter 3, Working with Big Data and Cloud Sources, covers how to load and read data from
some of the many different NoSQL data sources as well as from Salesforce.com.
Chapter 4, Manipulating XML Structures, shows you how to read, write, and validate XML.
Simple and complex XML structures are shown as well as more specialized formats such
as RSS feeds.

Chapter 5, File Management, demonstrates how to copy, move, transfer, and encrypt files
and directories.
Chapter 6, Looking for Data, shows you how to search for information through various
methods via databases, web services, files, and more. This chapter also shows you how
to validate data with Kettle's built-in validation steps.
Chapter 7, Understanding and Optimizing Data Flows, details how Kettle moves data through
jobs and transformations and how to optimize data flows.
Chapter 8, Executing and Re-using Jobs and Transformations, shows you how to launch jobs
and transformations in various ways through static or dynamic arguments and parameterization.
Object-oriented transformations through subtransformations are also explained.
Chapter 9, Integrating Kettle and the Pentaho Suite, works with some of the other tools in the
Pentaho suite to show how combining tools provides even more capabilities and functionality
for reporting, dashboards, and more.
Chapter 10, Getting the Most Out of Kettle, works with some of the commonly needed
features (e-mail and logging) as well as building sample data sets, and using Kettle to read
meta information on jobs and transformations via files or Kettle's database repository.
Chapter 11, Utilizing Visualization Tools in Kettle, explains how to work with plugins and
focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast
analysis of data sources.
Chapter 12, Data Analytics, shows you how to work with the various analytical tools built into
Kettle, focusing on statistics gathering steps and building datasets for Weka.
Appendix A, Data Structures, shows the different data structures used throughout the book.
Appendix B, References, provides a list of books and other resources that will help you
connect with the rest of the Pentaho community and learn more about Kettle and the other
tools that are part of the Pentaho suite.

2


Preface


What you need for this book
PDI is written in Java. Any operating system that can run JVM 1.5 or higher should be able to
run PDI. Some of the recipes will require other software, as listed:
ff

Hortonworks Sandbox: This is Hadoop in a box, and consists of a great environment
to learn how to work with NoSQL solutions without having to install everything.

ff

Web Server with ASP support: This is needed for two recipes to show how to work
with web services.

ff

DataCleaner: This is one of the top open source data profiling tools and integrates
with Kettle.

ff

MySQL: All the relational database recipes have scripts for MySQL provided. Feel free
to use another relational database for those recipes.

In addition, it's recommended to have access to Excel or Calc and a decent text editor (like
Notepad++ or gedit).
Having access to an Internet connection will be useful for some of the recipes that use
cloud services, as well as making it possible to access the additional links that provide more
information about given topics throughout the book.


Who this book is for
If you are a software developer, data scientist, or anyone else looking for a tool that will help
extract, transform, and load data as well as provide the tools to perform analytics and data
cleansing, then this book is for you! This book does not cover the basics of PDI, SQL, database
theory, data profiling, and data analytics.

Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of
information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Copy the
.jar file containing the driver to the lib directory inside the Kettle installation directory."
A block of code is set as follows:
"lastname","firstname","country","birthyear"
"Larsson","Stieg","Swedish",1954
"King","Stephen","American",1947
"Hiaasen","Carl ","American",1953

3


Preface
When we wish to draw your attention to a particular part of a code block, the relevant lines or
items are set in bold:
<request>
<type>City</type>
<query>Buenos Aires, Argentina</query>
</request>

New terms and important words are shown in bold. Words that you see on the screen, in

menus or dialog boxes for example, appear in the text like this: "clicking on the Next button
moves you to the next screen".
Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to develop
titles that you really get the most out of.
To send us general feedback, simply send an e-mail to , and
mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to
get the most from your purchase.

Downloading the example code
You can download the example code files for all Packt books you have purchased from
your account at . If you purchased this book elsewhere, you
can visit and register to have the files e-mailed
directly to you.

4


Preface


Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be
grateful if you would report this to us. By doing so, you can save other readers from frustration
and help us improve subsequent versions of this book. If you find any errata, please report
them by visiting selecting your book,
clicking on the errata submission form link, and entering the details of your errata. Once your
errata are verified, your submission will be accepted and the errata will be uploaded on our
website, or added to any list of existing errata, under the Errata section of that title. Any existing
errata can be viewed by selecting your title from />
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protection of our copyright and licenses very seriously. If you come across any
illegal copies of our works, in any form, on the Internet, please provide us with the location
address or website name immediately so that we can pursue a remedy.
Please contact us at with a link to the suspected
pirated material.
We appreciate your help in protecting our authors, and our ability to bring you
valuable content.

Questions
You can contact us at if you are having a problem with any
aspect of the book, and we will do our best to address it.

5



1


Working with Databases
In this chapter, we will cover:
ff

Connecting to a database

ff

Getting data from a database

ff

Getting data from a database by providing parameters

ff

Getting data from a database by running a query built at runtime

ff

Inserting or updating rows in a table

ff

Inserting new rows when a simple primary key has to be generated

ff

Inserting new rows when the primary key has to be generated based on stored values


ff

Deleting data from a table

ff

Creating or altering a table from PDI (design time)

ff

Creating or altering a table from PDI (runtime)

ff

Inserting, deleting, or updating a table depending on a field

ff

Changing the database connection at runtime

ff

Loading a parent-child table

ff

Building SQL queries via database metadata

ff


Performing repetitive database design tasks from PDI

Introduction
Databases are broadly used by organizations to store and administer transactional data such
as customer service history, bank transactions, purchases, sales, and so on. They are also
used to store data warehouse data used for Business Intelligence solutions.


Working with Databases
In this chapter, you will learn to deal with databases in Kettle. The first recipe tells you how to
connect to a database, which is a prerequisite for all the other recipes. The rest of the chapter
teaches you how to perform different operations and can be read in any order according to
your needs.
The focus of this chapter is on relational databases (RDBMS).
Thus, the term database is used as a synonym for relational
database throughout the recipes.

Sample databases
Through the chapter you will use a couple of sample databases. Those databases can be
created and loaded by running the scripts available at the book's website. The scripts are
ready to run under MySQL.
If you work with a different DBMS, you may have to modify
the scripts slightly.

For more information about the structure of the sample databases and the meaning of the
tables and fields, please refer to Appendix A, Data Structures. Feel free to adapt the recipes
to different databases. You could try some well-known databases; for example, Foodmart
(available as part of the Mondrian distribution at />mondrian/) or the MySQL sample databases (available at />index-other.html).

Pentaho BI platform databases

As part of the sample databases used in this chapter you will use the Pentaho BI platform
Demo databases. The Pentaho BI Platform Demo is a preconfigured installation that lets you
explore the capabilities of the Pentaho platform. It relies on the following databases:

8

Database name
hibernate

Description

Quartz

Repository for Quartz; the scheduler used by Pentaho.

Sampledata

Data for Steel Wheels, a fictional company that sells all
kind of scale replicas of vehicles.

Administrative information including user
authentication and authorization data.


Chapter 1
By default, all those databases are stored in Hypersonic (HSQLDB). The script for creating the
databases in HSQLDB can be found at />files. Under Business Intelligence Server | 1.7.1-stable look for pentaho_sample_data1.7.1.zip. While there are newer versions of the actual Business Intelligence Server, they all
use the same sample dataset.
These databases can be stored in other DBMSs as well. Scripts for creating and loading these
databases in other popular DBMSs for example, MySQL or Oracle can be found in Prashant

Raju's blog, at />Beside the scripts you will find instructions for creating and loading the databases.
Prashant Raju, an expert Pentaho developer, provides
several excellent tutorials related to the Pentaho platform.
If you are interested in knowing more about Pentaho, it's
worth taking a look at his blog.

Connecting to a database
If you intend to work with a database, either reading, writing, looking up data, and so on, the
first thing you will have to do is to create a connection to that database. This recipe will teach
you how to do this.

Getting ready
In order to create the connection, you will need to know the connection settings. At least you
will need the following:
ff

Host name: Domain name or IP address of the database server.

ff

Database name: The schema or other database identifier.

ff

Port number: The port the database connects to. Each database has its own default
port.

ff

Username: The username to access the database.


ff

Password: The password to access the database.

It's recommended that you also have access to the database at the moment of creating
a connection.

9


Working with Databases

How to do it...
Open Spoon and create a new transformation.
1. Select the View option that appears in the upper-left corner of the screen, right-click
on the Database connections option, and select New. The Database Connection
dialog window appears.
2. Under Connection Type, select the database engine that matches your DBMS.
3. Fill in the Settings options and give the connection a name by typing it in the
Connection Name: textbox. Your window should look like the following:

4. Press the Test button. A message should appear informing you that the connection to
your database is OK.
If you get an error message instead, you should recheck
the data entered, as well as the availability of the database
server. The server might be down, or it might not be
reachable from your machine.

10



Chapter 1

How it works...
A database connection is the definition that allows you to access a database from Kettle.
With the data you provide, Kettle can instantiate real database connections and perform the
different operations related to databases. Once you define a database connection, you will be
able to access that database and execute arbitrary SQL statements: create schema objects
like tables, execute SELECT statements, modify rows, and so on.
In this recipe you created the connection from the Database connections tree. You may
also create a connection by pressing the New... button in the Configuration window of any
database-related step in a transformation or job entry in a job. Alternatively, there is also a
wizard accessible from the Tools menu or by pressing the F3 key.
Whichever method you choose, a Settings window, like the one you saw in the recipe, shows
up, allowing you to define the connection. This task includes the following:
ff

Selecting a database engine (Connection Type:)

ff

Selecting the access method (Access:)
Native (JDBC) is the recommended access method, but you
can also use a predefined ODBC data source, a JNDI data
source, or an Oracle OCI connection.

ff

Providing the Host name or IP


ff

Providing the database name

ff

Entering the username and password for accessing the database

A database connection can only be created with a transformation or an opened job. Therefore,
in the recipe you were asked to create a transformation. The same could have been achieved
by creating a job instead.

There's more...
The recipe showed the simplest way to create a database connection. However, there is more
to know about creating database connections.

11


Working with Databases

Avoiding creating the same database connection over and
over again
If you intend to use the same database in more than one transformation and/or job, it's
recommended that you share the connection. You do this by right-clicking on the database
connection under the Database connections tree and clicking on Share. This way the
database connection will be available to be used in all transformations and jobs. Shared
database connections are recognized because they appear in bold. As an example, take a
look at the following sample screenshot:


The databases books and sampledata are shared; the others are not.
The information about shared connections is saved in a file named shared.xml located in
the Kettle home directory.

12


×