Tải bản đầy đủ (.pdf) (83 trang)

luận văn: tìm hiểu và so sánh một số kỹ thuật nén XML

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.93 MB, 83 trang )



























- 2014














Ngành 
Chuyên ngành 
 : 60480103










- 2014





 




ình.












 

. 


- - 




công tác.
 












L
Li c
Mc lc
Danh mc các ký hiu và ch vit tt
Danh mc các bng
Danh mc các hình v
Danh m
M u 1
 TNG QUAN XML 2
1.1. Tng quan XML 2
1.2. m ca XML 2
1.3. So sánh XML và HTML 2
1.3.1. S ging nhau gia XML và HTML 2
1.3.2. S khác nhau gia XML và HTML 2
1.4. Cu trúc tài liu XML 3
1.5. Cú pháp 4
1.5.1. Khai báo XML 4
1.5.2. Th hin tài liu (Document Instance) 4
1.5.3. Thuc tính (Attribute) 4
1.5.4. Khai báo kiu tài liu 4
1.6. u tài liu 5
1.7. Ngôn ng  XML 5
1.8. XSLT 6

 TNG QUAN NÉN D LIU 9
2.1. Nén d liu 9
2.1.1.  9
2.1.2. Phân loi 9
2.1.2.1. Nén tn hao (lossy compression) 9
2.1.2.2. Nén không tn hao (lossess compression) 9
2.1.3. Mt s khái nim 10
2.1.3.1. T l nén (compression ratio) 10
2.1.3.2. Hiu sut nén 10
2.1.3.3.  a d liu 10
2.2. Các k thut nén XML 11
2.2.1. i 11
2.2.1.1. Các k thun tng quát 13
2.2.1.2. Các k thut nén XML không truy vn 14
2.2.1.3. Các k thut nén XML truy vn 15


 MT S K THUT NÉN XML 17
3.1. XMill 17
3.1.1. Tng quan v XMill 17
3.1.2. Kin trúc ca XMill 18
3.1.2.1. Phân chia cu trúc t ni dung 18
3.1.2.2. Nhóm các giá tr d liu da trên ng  19
3.1.2.3. Các b nén ng  22
3.2. XGrind 24
3.2.1. Tng quan v XGrind 24
3.2.2. Các k thuc s dng trong XGrind 24
3.2.2.1. Quá trình nén siêu d liu 24
3.2.2.2. Quá trình nén giá tr ca thuc tính kiu lit kê 25
3.2.2.3. Quá trình nén giá tr ca phn t hoc thuc tính tng quát 25

3.2.3. ng cu (Homomorphic Compression) 25
3.2.4. Kin trúc ca XGrind 26
3.3. XAUST 28
3.3.1. Tng quan v XAUST 28
3.3.2. Mã hóa s hc và mô hình ng cnh hu hn 29
3.3.2.1. Mã hóa s hc (Arithmetic Coding) 29
3.3.2.2. Mô hình ng cnh hu hn (Finite Context Modeling) 29
3.3.3. Máy t ng hu hnh 29
3.3.4. Quá trình nén và gii nén s dng XAUST 31
3.4. XSAQCT 33
3.4.1. Tng quan v XSAQCT 33
3.4.2. Kin trúc ca XSAQCT 34
3.4.3. Quá trình x lý thuc tính và ni dung tài lic trn 36
3.4.4. t XSAQCT 37
3.4.4.1. Quá trình xây dng cây chú thích TA,D 37
3.4.4.1.1. m ca cây chú thích TA,D 37
3.4.4.1.2. t cây chú thích TA,D 37
3.4.4.2. Quá trình gii nén ca XSAQCT 44
3.4.4.2.1. B chú thích li (Reannotator) 44
3.4.4.2.2. B phc hi (Restorer) 45
3.5. EXI 46
3.5.1. Tng quan v EXI 46
3.5.2. EXI Header 47
3.5.2.1. EXI Cookie 47
3.5.2.2. Các bit phân bit 47
3.5.2.3. Bit hin din cho tùy chn EXI 47
3.5.2.4. Phiên bnh dng EXI 47


3.5.2.5. EXI Options 48

3.5.2.6. Padding bits 49
3.5.3. EXI Body 49
3.5.3.1. Event Code 51
3.5.3.2. Event Content 53
3.5.4. String Table 54
3.5.5. EXI Grammar 57
3.5.5.1. Built-In Grammar 58
3.5.5.2. Schema-informed Grammar 58
3.5.6. Quá trình nén EXI 58
3.5.6.1. Block 59
3.5.6.2. Channel 59
3.5.6.2.1. Kênh cu trúc (Structure Channel) 60
3.5.6.2.2. Kênh giá tr (Value Channel) 60
3.5.6.3. Dòng nén (Compressed Stream) 61
 T THC NGHIM VÀ SO SÁNH MT S K THUT NÉN 63
4.1. D liu th nghim 63
4.2. t 64
4.3. c hin 64
4.3.1.  64
4.3.2. Hiu sut nén (Compression Performance) 64
4.3.3. Thi gian nén (Compression Time) 64
4.3.4. Th 64
4.4. Kt qu thc nghim 64
KT LUNG PHÁT TRIN 70
TÀI LIU THAM KHO 71






DFA
Deterministic Finite Automata
     

DTD
Document Type Definition

GPS
Global Positioning System

HTML
HyperText Markup Language
     

SGML
Standard Generalized Markup
Language


XML
Extensible Markup Language

XSD
XML Schema Definition Language

XSLT
Extensible Stylesheet Language
Transformations

N    

.




 2
n ca mt tài liu XML 3
 nén không truy vn [16] 14
 các bn [16] 15
 21
 x lý ng  (Atomic Semantic Compressors) [11] 22
 c thc hin thut toán 3.1 khi to mi cây chú thích ca tài liu D
trong hình 3.13 [20] 41
t s phiên bnh dng EXI 48
 chn EXI [6] 48
 50
 kic thit lp và tùy
chn bit-c s dng [6] 52
 kic thit lp bng true và giá
tr pre-compression ca tùy chn byte-c s dng [6] 52
u d lic xây dng sn trong EXI [6] 53
t lp phân vùng ca String Table [7] 54
p d liu th nghim 63
t thc nghim. 64
t qu thc nghim khi s dng b nén gzip 65
t qu thc nghim khi s dng b nén XMill 65
t qu thc nghim khi s dng b nén XGrind 65
t qu thc nghim khi s dng b nén XAUST 66
t qu thc nghim khi s dng b nén EXI (Exificient) 66






c s d chuyi gia các tài liu XML [18] 6
c s d chuyi mt tài liu XML sang các cách
biu din khác nhau [18] 7
Hình 2.1: Quá trình nén/gii nén d liu 9
Hình 2.2: Quá trình truyn d liu XML mà không có quá trình nén XML [17] 11
Hình 2.3: Quá trình truyn d liu XML có s dng quá trình nén XML [17] 11
Hình 2.4: Phân loi các b nén XML da vào s nhn bit cu trúc ca các tài liu XML
[17] 12
Hình 2.5: Phân loi b nén XML da vào s h tr kh n [17] 13
Hình 3.1: Kin trúc ca XMill [11] 18
Hình 3.2: Mô t quá trình XMill phân tách cu trúc và d liu 19
Hình 3.3: Kin trúc ca b nén XGrind [15] 27
Hình 3.4: DFA ca phn t card trong ví d 3.14 30
Hình 3.5: Kin trúc ca XSAQCT [20] 34
Hình 3.6: Minh ha mt tài lin [20] 35
Hình 3.7: Cây chú thích T
A,D
ca tài liu D trong hình 3.6 [20] 35
Hình 3.8: Quá trình x lý ni dung tài lic trn [20] 36
Hình 3.9: Cây chú thích T
A,D
và các b chn [20] 36
Hình 3.10: Biu din mt tài liu D có chn [20] 38
Hình 3.11: Biu din cây chú thích ca tài liu D có thêm các node gi  39
Hình 3.12: Khôi phc li cây tài liu D vi các node gi  39
Hình 3.13: Biu din mt tài liu D s c áp dng thut toán 3.1 [20] 40

Hình 3.14: Biu din cây chú thích ca tài liu D trong hình 3.13 [20] 42
Hình 3.15: Biu din cây chú thích hoàn chnh ca tài liu D trong hình 3.13 [20] 43
Hình 3.16: Khôi phc li cây tài liu D t cây chú thích T
A,D
trong hình 3.15 [20] 44
Hình 3.17: Cn ca EXI Stream [7] 46
nh dng EXI Header [6] 47
Hình 3.19: EXI Cookie [6] 47
Hình 3.20: Các bit phân bit (Distinguishing Bits) [6] 47
Hình 3.21: Minh ha EXI Stream Body ca tài liu Notebook trong ví d 3.23 [7] 51
u vào khi to trong phân vùng URI [7] 56
u vào khi to trong phân vùng Prefix [7] 56
Hình 3.2u vào khi to trong phân vùng LocalName [7] 56
c khi to trong phân vùng Value [7] 57
Hình 3.26: Tng quan quá trình nén EXI [6] 59
Hình 3.27: Quá trình dn kênh các s kin EXI [6] 60
Hình 3.28: Minh ha quá trình nén ca EXI Body Stream trong hình 3.21 [7] 61




Bi 66
Bi 67
Binh th 67
Bi nén gzip, XMill, XGrind, XAUST và
EXI 68


1




XML (Extensible Markup Language) 
 

 
 dài dòng
 

ind, XAUST, EXI, XSAQCT.




 [11,23], XGrind [15,22], XAUST [10,21], XSAQCT [19-20], EXI [6-8,13].
  

 
XML  

4 .
và cú pháp khai
 
- 
Schema Definition Language -          
(Extensible Stylesheet Language Transformations - XSLT).
 Tchung
o 
.
  G            



              

 


2


 
, 

XSLT.
1.1.  XML

 


 ard
Generalized Markup Language) - 
1.2. 


máy tính. XML cho 
  XML 
 
 
 . 


  

1.3. So sánh XML và HTML
1.3.1.  HTML
XML và HTML 
1.3.2. 
1.1: So sánh XML và HTML
XML
HTML


khác nhau.
       



        
        
3



      









1.4. 
            prolog    
   l  thông tin     khai báo
 
 )
con 
1.2: Các thành phbn ca mt tài liu XML
Prolog
(tùy chn)
Khai báo XML
<?xml version="1.0" encoding="UTF-8"
standalone="no"?>

<!doctype document system
"tutorials.dtd">
Chú thích
<! Here is a comment >

(Processing Instructions)
<?xml-stylesheet type="text/css"
href="myStyles.css"?>





<tutorials>

<tutorial>

<name>XML Tutorial</name>
<url>
</tutorial>
<tutorial>
<name>HTML Tutorial</name>
<url>
</tutorial>

</tutorials>
  [25]
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE document system "tutorials.dtd">
<! Here is a comment >
<?xml-stylesheet type="text/css" href="myStyles.css"?>
<tutorials>
4


<tutorial>
<name>XML Tutorial</name>
<url>
</tutorial>
<tutorial>
<name>HTML Tutorial</name>
<url>
</tutorial>
</tutorials>
1.5. Cú pháp
1.5.1. Khai báo XML
 


 Khai báo XML
<?xml version="l.0"?>
1.5.2. 


-con.

con.
 
<book>
<title>Joy of Integration</title>
<author>Joe Smith</author>
</book>
1.5.3. 

-
  
<book category=”Fiction”>

</book>
1.5.4. 


 
5


<!DOCTYPE book SYSTEM "book.dtd">
1.6. 


-
 
  

 
<!DOCTYPE book [
<!ELEMENT book(title,author)>
<!ATTLIST book CATEGORY (Fiction | Non-Fiction)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
]>
1.7. 
        
        
 
 
 
không gian tên (namespace). 

 hnâng
cao pháp ki
khóa (keyany module
h g include, import.
 
<?xml version="l.0"?>
<xsd:schema xmlns:xsd=”
<xsd:elementname="book”>
<xsd:complexType>
<xsd:sequence>

<xsd:element name="titleUtype="xsd:string”/>
<xsd:element name="author" type="xsd:string”/>
</xsd:sequence>
<xsd:attribute name="category">
<xsd:simpleType>
<xsd:restriction base="xsd:string”>
6


<xsd:enumeration value="Fiction"/>
<xsd:enumeration value="Non-Fiction”/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
</xsd:schema>
1.8. XSLT

 (Extensible
Stylesheet Language Transformations).
XSL 




Hình 1.1: XSLT style c s d chuyi gia các tài liu XML [18]

               




7



Hình 1.2: c s d chuyi mt tài liu XML sang các
cách biu din khác nhau [18]


 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="book.xsl"?>
<inventory>
<book category="Fiction">
<title> Anna Karenina</title>
<author> Leo Tolstoy</author>
</book>
<book category="Non-Fiction">
<title>Integration for Durnrnies</title>
<author>John Doe</author>
</book>
</inventory>
Ví   [18]
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="
version="1.0">
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>

<xsl:template match="inventory">
<table border="1">
<xsl:for-each select="book">
<tr>
<td><xsl:value-of select="@category"/></td>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="author"/></td>
</tr>
</xsl:for-each>
8


</table>
</xsl:template>
</xsl:stylesheet>

sau:


 XML là 
 

, 
, .
Trong XML, DTD 
. Bên XSD  không gian
tên
module hóa. D
.  
 XML.


9


 

nén 

i

2.1. N
2.1.1. 



Hình 2.1: Quá trình nén/gii nén d liu
2.1.2. 
 n
n
2.1.2.1. 



 b
 M
JPEG [4], MPEG-4 [4].
2.1.2.2. 
Nén kh    xác ha  tin. Trong



excel.  Shannon-Fano [4], LZW [4].
10


2.1.3. 
2.1.3.1. 


 



2.1.3.2.  nén

 

  

 
2.1.3.3. 

 


 



 
].

 

nhau.  
 

 


- Ziv - Welch).
 




11


khác nhau.  



        

2.2. C
2.2.1. Tí






Hình 2.2: Quá trình truyn d liu XML mà không có quá trình nén XML [17]

Hình 2.3: Quá trình truyn d liu XML có s dng quá trình nén XML [17]



 (general text compressors)  (XML-
conscious compressors).
12


Trong , 

-purpose text co], bzip2
[2]. - 
g  
-conscious compressors) 

 
thông tin              -
c-independent compressors).
             (schema-dependent
compressors), qVí
].
schema-independent
compressor 


Hình 2.4: Phân loi các b nén XML da vào s nhn bit cu trúc ca các tài liu
XML [17]




      -queriable XML compressors)
kh
13





 



       
 





-homomorphic compressors), quá trình mã

 ].

Hình 2.5: Phân loi b nén XML da vào s h tr kh n [17]
2.2.1.1. C
      
nén

ip,
bzip2 và PPM [4].
14


 4
] và mã hóa Huffman.
 -Wheeler (Burrows-

-to-font (move-to-
hóa Huffman. 

PPM 






  

 
 trong khi 


2.2.1.2. C

 t. Nhóm này có
 c c.
Trong  

 quá

2.1: Danh sách các b nén không truy vn [16]





XMill
No
Dictionary-Based
Gzip, Bzip2, PPM
XWRT
No
Dictionary-Based
Gzip, Bzip2, PPM
XComp
No
Dictionary-Based
Gzip
XMLPPM
No
Multiplexed Hierarchical PPM
PPM
SCMPPM
No
Dictionary-Based
PPM
Exalt
No

Context-Free Grammars
KY Codes
AXECHOP
No
Context-free Grammars
BWT+MPM

×