© 2007 Z RESEARCH
Z RESEARCH
Z RESEARCH, Inc.
Commoditizing Supercomputing and Superstorage
Massive Distributed Storage over
InfiniBand RDMA
© 2007 Z RESEARCH
Z RESEARCH
What is GlusterFS?
GlusterFS is a Cluster File System that aggregates multiple
storage bricks over InfiniBand RDMA into one large parallel
network file system
GlusterFS is MORE than making data available over a network
or the organization of data on disk storage….
• Typical clustered file systems work to aggregate storage and provide
unified views but….
- scalability comes with increased cost, reduced reliability, difficult
management, increased maintenance and recovery time….
- limited reliability means volume sizes are kept small….
- capacity and i/o performance can be limited…
GlusterFS allows scaling of capacity and I/O using industry
standard inexpensive modules!
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Features
1. Fully POSIX compliant!
2. Unified VFS!
3. More flexible volume management (stackable features)!
4. Application specific scheduling / load balancing
• roundrobin; adaptive least usage; non-uniform file access (NUFA)!
5. Automatic file replication (AFR); Snapshot! and Undelete!
6. Striping for performance!
7. Self-heal! No fsck!!!!
8. Pluggable transport modules (IB verbs, IB-SDP)!
9. I/O accelerators - I/O threads, I/O cache, read ahead and write
behind !
10. Policy driven - user group/directory level quotas , access
control lists (ACL)
© 2007 Z RESEARCH
Z RESEARCH
GigE
GlusterFS Design
GlusterFS Clustered Filesystem on x86-64 platform
Storage Clients
Cluster of Clients (Supercomputer, Data Center)
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
GLFS Client
Clustered Vol Manager
Clustered I/O Scheduler
Storage Brick N
GLFSD
Volume
GLFSD
Volume
Storage Brick 1
GLFSD
Volume
Storage Brick 2
GLFSD
Volume
Storage Brick 3
GLFSD
Volume
GLFSD
Volume
GLFSD
Volume
Storage Brick 4
GLFSD
Volume
RDMA
RDMA
Storage Gateway
NFS/Samba
GLFS Client
Storage Gateway
NFS/Samba
GLFS Client
Storage Gateway
NFS/Samba
GLFS Client
RDMA
Compatibility with
MS Windows
and other Unices
InfiniBand / GigE / 10GigE
NFS / SAMBA
over TCP/IP
C
l
i
e
n
t
S
i
d
e
S
e
r
v
e
r
S
i
de
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Server
VFS
Stackable Design
Client
Server
I/O Cache
Unify
POSIX
Ext3 Ext3Ext3
TCPIP – GigE, 10GigE / InfiniBand - RDMA
POSIX POSIX
Brick 1
ServerServer
GlusterFS
Client Client
Gl
u
s
t
e
r
F
S
C
l
i
e
n
t
Read Ahead
Brick 2 Brick n
GlusterFS Server GlusterFS ServerGlusterFS Server
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Function - unify
Server/Head Node 1
Server/Head Node 2
Server/Head Node 3
Client View
Client View
(unify/roundrobin)
/files/aaa
/files/bbb
/files/aaa
/files/bbb
/files/ccc
/files/ccc
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Function – unify+AFR
Server/Head Node 1
Server/Head Node 2
Server/Head Node 3
Client View
Client View
(unify/roundrobin+AFR)
/files/aaa
/files/bbb
/files/aaa
/files/ccc
/files/aaa
/files/bbb
/files/bbb
/files/ccc
/files/ccc
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Function - stripe
Server/Head Node 1
Server/Head Node 2
Server/Head Node 3
Client View (stripe)
/files/aaa
/files/bbb
/files/ccc
/files/aaa
/files/bbb
/files/ccc
/files/aaa
/files/bbb
/files/ccc
/files/aaa
/files/bbb
/files/ccc
© 2007 Z RESEARCH
Z RESEARCH
I/O Scheduling
1. Round robin
2. Adaptive least usage (ALU)
3. NUFA
4. Random
5. Custom
volume bricks
type cluster/unify
subvolumes ss1c ss2c ss3c ss4c
option scheduler alu
option alu.limits.min-free-disk 60GB
option alu.limits.max-open-files 10000
option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
option alu.disk-usage.entry-threshold 2GB # Units in KB, MB and GB are allowed
option alu.disk-usage.exit-threshold 60MB # Units in KB, MB and GB are allowed
option alu.open-files-usage.entry-threshold 1024
option alu.open-files-usage.exit-threshold 32
option alu.stat-refresh.interval 10sec
end-volume
© 2007 Z RESEARCH
Z RESEARCH
Benchmarks
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Throughput & Scaling Benchmarks
Benchmark Environment
Method: Multiple 'dd' of varying blocks are read and written from multiple clients
simultaneously.
GlusterFS Brick Configuration (16 bricks)
Processor - Dual Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
RAM - 8GB FB-DIMM
Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian)
Disk - SATA-II 500GB
HCA - Mellanox MHGS18-XT/S InfiniBand HCA
Client Configuration (64 clients)
RAM - 4GB DDR2 (533 Mhz)
Processor - Single Intel(R) Pentium(R) D CPU 3.40GHz
Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian)
Disk - SATA-II 500GB
HCA - Mellanox MHGS18-XT/S InfiniBand HCA
Interconnect Switch: Voltaire port InfiniBand Switch (14U)
GlusterFS version 1.3.pre0-BENKI
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Performance
Aggregated I/O Benchmark on 16 bricks(servers) and 64 clients over IB Verbs transport.
¾Peak aggregated read throughput was 13 GBps.
¾After a particular threshold, write performance plateaus because of disk I/O bottleneck.
¾System memory greater than the peak load will ensure best possible performance.
¾ib-verbs transport driver is about 30% faster than ib-sdp transport driver.
¾Peak aggregated read throughput was 13 GBps.
¾After a particular threshold, write performance plateaus because of disk I/O bottleneck.
¾System memory greater than the peak load will ensure best possible performance.
¾ib-verbs transport driver is about 30% faster than ib-sdp transport driver.
1
3
G
B
p
s
!
© 2007 Z RESEARCH
Z RESEARCH
Scalability
Performance improves when the number of bricks are increased
Throughput increases with corresponding increased in servers from 1 to 16
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Value Proposition
9 A single solution for 10's of Terabytes to Petabytes
9 No single point of failure – completely distributed - no
centralized meta-data
9 Non-stop Storage – can withstand hardware failures, self
healing, snap-shots
9 Data easily recovered even without GlusterFS
9 Customizable schedulers
9 User Friendly - Installs and upgrades in minutes
9 Operating system agnostic!
9 Extremely cost effective – deployed on any x86-64
hardware!
© 2007 Z RESEARCH
Z RESEARCH
Thank You!
© 2007 Z RESEARCH
Z RESEARCH
Backup Slides
© 2007 Z RESEARCH
Z RESEARCH
GlusterFS Vs Lustre Benchmark
Benchmark Environment
Brick Config (10 bricks)
Processor - 2 x AMD Dual-Core Opteron™ Model 275 processors
RAM - 6 GB
Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex
Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive
Client Config (20 clients)
Processor - 2 x AMD Dual-Core Opteron™ Model 275 processors
RAM - 6 GB
Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex
Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive
Software Version
Operation System - Redhat Enterprise GNU/Linux 4 (Update 3)
Linux version - 2.6.9-42
Lustre version - 1.4.9.1
GlusterFS version - 1.3-pre2.3
© 2007 Z RESEARCH
Z RESEARCH
Directory Listing Benchmark
Directory Listing
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1.7
1.5
GlusterFS vs Lustre - Directory Listing benchmark
Lustre
GlusterFS
Time in Seconds
Lower is better
$ find /mnt/glusterfs
"find" command navigates across the directory tree structure and prints them to console.
In this case, there were thirteen thousand binary files.
Note: Commands are same for both GlusterFS and Lustre, except the directory part.
© 2007 Z RESEARCH
Z RESEARCH
Copy Local to Cluster File System
Copy Local to Cluster File System
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
25
27.5
30
32.5
35
37.5
37
26
GlusterFS vs Lustre - Copy Local to Cluster File System
Lustre
GlusterFS
Time in Seconds
Lower is better
$ cp -r /local/* /mnt/glusterfs/
cp utility is used to copy files and directories.
Copy 12039 files (595 MB) were copied into the cluster file system.
© 2007 Z RESEARCH
Z RESEARCH
Copy Local from Cluster File System
Copy Local from Cluster File System
0
5
10
15
20
25
30
35
40
45
45
18
GlusterFS vs Lustre - Copy Local from Cluster File System
Lustre
GlusterFS
Time in Seconds
Lower is better
$ cp -r /mnt/glusterfs/ /local/*
cp utility is used to copy files and directories.
Copy 12039 files (595 MB) were copied from the cluster file system.
© 2007 Z RESEARCH
Z RESEARCH
Checksum
Checksum
0
5
10
15
20
25
30
35
40
45
50
45.1
44.4
GlusterFS vs Lustre - Checksum
Lustre
GlusterFS
Time in Seconds
Lower is better
Perform md5sum calculation for all files across your file system. In this case, there were
thirteen thousand binary files.
$ find . -type f -exec md5sum {} \;
© 2007 Z RESEARCH
Z RESEARCH
Base64 Conversion
Base64 Conversion
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
25
27.5
25.7
25.1
GlusterFS vs Lustre - Base64 Conversion
Lustre
GlusterFS
Time in Seconds
Lower is better
Base64 is an algorithm for encoding binary to ASCII and vice-versa. This benchmark was
performed on a 640 MB binary file.
$ base64 encode big-file big-file.base64
© 2007 Z RESEARCH
Z RESEARCH
Pattern Search
Pattern Search
0
5
10
15
20
25
30
35
40
45
50
55
54.3
52.1
GlusterFS vs Lustre - Pattern Search
Lustre
GlusterFS
Time in Seconds
Lower is better
grep utility searches for a PATTERN on a file and prints the matching lines to console.
This benchmark used 1GB ASCII BASE-64 file.
$ grep GNU big-file.base64
© 2007 Z RESEARCH
Z RESEARCH
Data Compression
Compression Decomression
0
2
4
6
8
10
12
14
16
18
20
18.3
16.5
14.8
10.1
GlusterFS vs Lustre - Data Compression
Lustre
GlusterFS
Time in Seconds
Lower is better
GNU gzip utility compresses files using Lempel-Ziv coding.
This benchmark was performed on 1GB TAR binary file.
$ gzip big-file.tar
$ gunzip big-file.tar.gz
Lower is better
© 2007 Z RESEARCH
Z RESEARCH
Apache Web Server
Apache web server
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
3.25 3.17
GlusterFS vs Lustre - Apache Web Server
Lustre
GlusterFS
Time in Minutes
Lower is better
Apache served 12039 files (595 MB) over HTTP protocol. wget client fetched the files
recursively.
**Lustre failed after downloading 33 MB out of 585 MB in 11 mins.
Lustre Failed to execute**