id sid tid token lemma pos 3222 1 1 THE the DT 3222 1 2 EFFiCiENT EFFiCiENT NNP 3222 1 3 SToraGE storage CD 3222 1 4 oF oF NNP 3222 1 5 TExT text NN 3222 1 6 DoCuMENTS documents NN 3222 1 7 iN in IN 3222 1 8 DiGiTal DiGiTal NNP 3222 1 9 liBrariES libraries NN 3222 1 10 | | NNP 3222 1 11 SkibiŃSki SkibiŃSki NNP 3222 1 12 and and CC 3222 1 13 Swacha Swacha NNP 3222 1 14 143 143 CD 3222 1 15 Przemysław Przemysław NNP 3222 1 16 skibiński skibiński NN 3222 1 17 and and CC 3222 1 18 Jakub Jakub NNP 3222 1 19 swacha swacha VBD 3222 1 20 The the DT 3222 1 21 Efficient efficient JJ 3222 1 22 Storage storage NN 3222 1 23 of of IN 3222 1 24 Text Text NNP 3222 1 25 Documents Documents NNP 3222 1 26 in in IN 3222 1 27 Digital Digital NNP 3222 1 28 Libraries Libraries NNPS 3222 1 29 przemysław przemysław NNP 3222 1 30 Skibiński Skibiński NNP 3222 1 31 ( ( -LRB- 3222 1 32 inikep@ii.uni.wroc.pl inikep@ii.uni.wroc.pl NN 3222 1 33 ) ) -RRB- 3222 1 34 is be VBZ 3222 1 35 [ [ -LRB- 3222 1 36 QY QY NNP 3222 1 37 : : : 3222 1 38 title title NN 3222 1 39 ? ? . 3222 2 1 ] ] -RRB- 3222 2 2 , , , 3222 2 3 institute institute NN 3222 2 4 of of IN 3222 2 5 Computer Computer NNP 3222 2 6 Science Science NNP 3222 2 7 , , , 3222 2 8 University University NNP 3222 2 9 of of IN 3222 2 10 wrocław wrocław NNP 3222 2 11 , , , 3222 2 12 Poland Poland NNP 3222 2 13 . . . 3222 3 1 Jakub Jakub NNP 3222 3 2 Swacha Swacha NNP 3222 3 3 ( ( -LRB- 3222 3 4 jakubs@uoo.univ.szczecin.pl jakubs@uoo.univ.szczecin.pl NNP 3222 3 5 ) ) -RRB- 3222 3 6 is be VBZ 3222 3 7 [ [ -LRB- 3222 3 8 QY QY NNP 3222 3 9 : : : 3222 3 10 title title NN 3222 3 11 ? ? . 3222 4 1 ] ] -RRB- 3222 4 2 , , , 3222 4 3 institute institute NNP 3222 4 4 of of IN 3222 4 5 information information NNP 3222 4 6 Technology Technology NNP 3222 4 7 in in IN 3222 4 8 Management Management NNP 3222 4 9 , , , 3222 4 10 University University NNP 3222 4 11 of of IN 3222 4 12 Szczecin Szczecin NNP 3222 4 13 , , , 3222 4 14 Poland Poland NNP 3222 4 15 . . . 3222 5 1 Przemysław Przemysław NNP 3222 5 2 Skibiński Skibiński NNP 3222 5 3 and and CC 3222 5 4 Jakub Jakub NNP 3222 5 5 Swacha Swacha NNP 3222 5 6 The the DT 3222 5 7 Efficient efficient JJ 3222 5 8 Storage storage NN 3222 5 9 of of IN 3222 5 10 Text Text NNP 3222 5 11 Documents document NNS 3222 5 12 in in IN 3222 5 13 Digital Digital NNP 3222 5 14 Libraries Libraries NNPS 3222 5 15 In in IN 3222 5 16 this this DT 3222 5 17 paper paper NN 3222 5 18 we -PRON- PRP 3222 5 19 investigate investigate VBP 3222 5 20 the the DT 3222 5 21 possibility possibility NN 3222 5 22 of of IN 3222 5 23 improv- improv- NNP 3222 5 24 ing e VBG 3222 5 25 the the DT 3222 5 26 efficiency efficiency NN 3222 5 27 of of IN 3222 5 28 data datum NNS 3222 5 29 compression compression NN 3222 5 30 , , , 3222 5 31 and and CC 3222 5 32 thus thus RB 3222 5 33 reduc- reduc- NNP 3222 5 34 ing ing NN 3222 5 35 storage storage NN 3222 5 36 requirements requirement NNS 3222 5 37 , , , 3222 5 38 for for IN 3222 5 39 seven seven CD 3222 5 40 widely widely RB 3222 5 41 used use VBN 3222 5 42 text text NN 3222 5 43 document document NN 3222 5 44 formats format NNS 3222 5 45 . . . 3222 6 1 We -PRON- PRP 3222 6 2 propose propose VBP 3222 6 3 an an DT 3222 6 4 open open JJ 3222 6 5 - - HYPH 3222 6 6 source source NN 3222 6 7 text text NN 3222 6 8 compression compression NN 3222 6 9 software software NN 3222 6 10 library library NN 3222 6 11 , , , 3222 6 12 featuring feature VBG 3222 6 13 an an DT 3222 6 14 advanced advanced JJ 3222 6 15 word word NN 3222 6 16 - - HYPH 3222 6 17 substitution substitution NN 3222 6 18 scheme scheme NN 3222 6 19 with with IN 3222 6 20 static static JJ 3222 6 21 and and CC 3222 6 22 semidynamic semidynamic JJ 3222 6 23 word word NN 3222 6 24 dictionaries dictionary NNS 3222 6 25 . . . 3222 7 1 The the DT 3222 7 2 empirical empirical JJ 3222 7 3 results result NNS 3222 7 4 show show VBP 3222 7 5 an an DT 3222 7 6 average average JJ 3222 7 7 storage storage NN 3222 7 8 space space NN 3222 7 9 reduction reduction NN 3222 7 10 as as RB 3222 7 11 high high JJ 3222 7 12 as as IN 3222 7 13 78 78 CD 3222 7 14 percent percent NN 3222 7 15 compared compare VBN 3222 7 16 to to IN 3222 7 17 uncompressed uncompressed JJ 3222 7 18 documents document NNS 3222 7 19 , , , 3222 7 20 and and CC 3222 7 21 as as RB 3222 7 22 high high JJ 3222 7 23 as as IN 3222 7 24 30 30 CD 3222 7 25 percent percent NN 3222 7 26 com- com- NN 3222 7 27 pared pare VBD 3222 7 28 to to IN 3222 7 29 documents document NNS 3222 7 30 compressed compress VBN 3222 7 31 with with IN 3222 7 32 the the DT 3222 7 33 free free JJ 3222 7 34 compression compression NN 3222 7 35 software software NN 3222 7 36 gzip gzip NNP 3222 7 37 . . . 3222 8 1 I -PRON- PRP 3222 8 2 t t VBP 3222 8 3 is be VBZ 3222 8 4 hard hard JJ 3222 8 5 to to TO 3222 8 6 expect expect VB 3222 8 7 the the DT 3222 8 8 continuing continue VBG 3222 8 9 rapid rapid JJ 3222 8 10 growth growth NN 3222 8 11 of of IN 3222 8 12 global global JJ 3222 8 13 information information NN 3222 8 14 volume volume NN 3222 8 15 not not RB 3222 8 16 to to TO 3222 8 17 affect affect VB 3222 8 18 digital digital IN 3222 8 19 libraries.1 libraries.1 PDT 3222 8 20 The the DT 3222 8 21 growth growth NN 3222 8 22 of of IN 3222 8 23 stored stored JJ 3222 8 24 information information NN 3222 8 25 volume volume NN 3222 8 26 means mean VBZ 3222 8 27 growth growth NN 3222 8 28 in in IN 3222 8 29 storage storage NN 3222 8 30 requirements requirement NNS 3222 8 31 , , , 3222 8 32 which which WDT 3222 8 33 poses pose VBZ 3222 8 34 a a DT 3222 8 35 problem problem NN 3222 8 36 in in IN 3222 8 37 both both CC 3222 8 38 technological technological JJ 3222 8 39 and and CC 3222 8 40 economic economic JJ 3222 8 41 terms term NNS 3222 8 42 . . . 3222 9 1 Fortunately fortunately RB 3222 9 2 , , , 3222 9 3 the the DT 3222 9 4 digi- digi- NNP 3222 9 5 tal tal NNP 3222 9 6 librarys librarys NNP 3222 9 7 ’ ' '' 3222 9 8 hunger hunger NN 3222 9 9 for for IN 3222 9 10 resources resource NNS 3222 9 11 can can MD 3222 9 12 be be VB 3222 9 13 tamed tame VBN 3222 9 14 with with IN 3222 9 15 data datum NNS 3222 9 16 compression.2 compression.2 CD 3222 9 17 The the DT 3222 9 18 primary primary JJ 3222 9 19 motivation motivation NN 3222 9 20 for for IN 3222 9 21 our -PRON- PRP$ 3222 9 22 research research NN 3222 9 23 was be VBD 3222 9 24 to to TO 3222 9 25 limit limit VB 3222 9 26 the the DT 3222 9 27 data datum NNS 3222 9 28 storage storage NN 3222 9 29 requirements requirement NNS 3222 9 30 of of IN 3222 9 31 the the DT 3222 9 32 student student NN 3222 9 33 thesis thesis NN 3222 9 34 elec- elec- NNP 3222 9 35 tronic tronic NNP 3222 9 36 archive archive NNP 3222 9 37 in in IN 3222 9 38 the the DT 3222 9 39 Institute Institute NNP 3222 9 40 of of IN 3222 9 41 Information Information NNP 3222 9 42 Technology Technology NNP 3222 9 43 in in IN 3222 9 44 Management Management NNP 3222 9 45 at at IN 3222 9 46 the the DT 3222 9 47 University University NNP 3222 9 48 of of IN 3222 9 49 Szczecin Szczecin NNP 3222 9 50 . . . 3222 10 1 The the DT 3222 10 2 current current JJ 3222 10 3 regulations regulation NNS 3222 10 4 state state VBP 3222 10 5 that that IN 3222 10 6 every every DT 3222 10 7 thesis thesis NN 3222 10 8 should should MD 3222 10 9 be be VB 3222 10 10 submitted submit VBN 3222 10 11 in in IN 3222 10 12 both both DT 3222 10 13 printed printed JJ 3222 10 14 and and CC 3222 10 15 electronic electronic JJ 3222 10 16 form form NN 3222 10 17 . . . 3222 11 1 The the DT 3222 11 2 latter latter JJ 3222 11 3 facilitates facilitate NNS 3222 11 4 automated automate VBN 3222 11 5 processing processing NN 3222 11 6 of of IN 3222 11 7 the the DT 3222 11 8 documents document NNS 3222 11 9 for for IN 3222 11 10 purposes purpose NNS 3222 11 11 such such JJ 3222 11 12 as as IN 3222 11 13 plagiarism plagiarism NN 3222 11 14 detection detection NN 3222 11 15 or or CC 3222 11 16 statistical statistical JJ 3222 11 17 language language NN 3222 11 18 analy- analy- NNP 3222 11 19 sis sis NNP 3222 11 20 . . . 3222 12 1 Considering consider VBG 3222 12 2 the the DT 3222 12 3 introduction introduction NN 3222 12 4 of of IN 3222 12 5 the the DT 3222 12 6 three three CD 3222 12 7 - - HYPH 3222 12 8 cycle cycle NN 3222 12 9 higher high JJR 3222 12 10 education education NN 3222 12 11 system system NN 3222 12 12 ( ( -LRB- 3222 12 13 bachelor bachelor NN 3222 12 14 / / SYM 3222 12 15 master master NNP 3222 12 16 / / SYM 3222 12 17 doctorate doctorate NNP 3222 12 18 ) ) -RRB- 3222 12 19 , , , 3222 12 20 there there EX 3222 12 21 are be VBP 3222 12 22 several several JJ 3222 12 23 hundred hundred CD 3222 12 24 these these DT 3222 12 25 s s NNPS 3222 12 26 added add VBD 3222 12 27 to to IN 3222 12 28 the the DT 3222 12 29 archive archive NN 3222 12 30 every every DT 3222 12 31 year year NN 3222 12 32 . . . 3222 13 1 Although although IN 3222 13 2 students student NNS 3222 13 3 are be VBP 3222 13 4 asked ask VBN 3222 13 5 to to TO 3222 13 6 submit submit VB 3222 13 7 Microsoft Microsoft NNP 3222 13 8 Word Word NNP 3222 13 9 – – : 3222 13 10 compatible compatible JJ 3222 13 11 documents document NNS 3222 13 12 such such JJ 3222 13 13 as as IN 3222 13 14 DOC DOC NNP 3222 13 15 , , , 3222 13 16 DOCX DOCX NNP 3222 13 17 , , , 3222 13 18 and and CC 3222 13 19 RTF RTF NNP 3222 13 20 , , , 3222 13 21 other other JJ 3222 13 22 popular popular JJ 3222 13 23 formats format NNS 3222 13 24 such such JJ 3222 13 25 as as IN 3222 13 26 TeX TeX NNP 3222 13 27 script script NN 3222 13 28 ( ( -LRB- 3222 13 29 TEX TEX NNP 3222 13 30 ) ) -RRB- 3222 13 31 , , , 3222 13 32 HTML HTML NNP 3222 13 33 , , , 3222 13 34 PS PS NNP 3222 13 35 , , , 3222 13 36 and and CC 3222 13 37 PDF PDF NNP 3222 13 38 are be VBP 3222 13 39 also also RB 3222 13 40 accepted accept VBN 3222 13 41 , , , 3222 13 42 both both DT 3222 13 43 in in IN 3222 13 44 the the DT 3222 13 45 case case NN 3222 13 46 of of IN 3222 13 47 the the DT 3222 13 48 main main JJ 3222 13 49 thesis thesis NN 3222 13 50 document document NN 3222 13 51 , , , 3222 13 52 containing contain VBG 3222 13 53 the the DT 3222 13 54 thesis thesis NN 3222 13 55 and and CC 3222 13 56 any any DT 3222 13 57 appendixes appendix NNS 3222 13 58 that that WDT 3222 13 59 were be VBD 3222 13 60 included include VBN 3222 13 61 in in IN 3222 13 62 the the DT 3222 13 63 printed print VBN 3222 13 64 ver- ver- JJ 3222 13 65 sion sion NN 3222 13 66 , , , 3222 13 67 and and CC 3222 13 68 the the DT 3222 13 69 additional additional JJ 3222 13 70 appendixes appendix NNS 3222 13 71 , , , 3222 13 72 comprising comprise VBG 3222 13 73 mate- mate- NNP 3222 13 74 rials rial NNS 3222 13 75 that that WDT 3222 13 76 were be VBD 3222 13 77 left leave VBN 3222 13 78 out out IN 3222 13 79 of of IN 3222 13 80 the the DT 3222 13 81 printed print VBN 3222 13 82 version version NN 3222 13 83 ( ( -LRB- 3222 13 84 such such JJ 3222 13 85 as as IN 3222 13 86 detailed detail VBN 3222 13 87 data data NN 3222 13 88 tables table NNS 3222 13 89 , , , 3222 13 90 the the DT 3222 13 91 full full JJ 3222 13 92 source source NN 3222 13 93 code code NN 3222 13 94 of of IN 3222 13 95 programs program NNS 3222 13 96 , , , 3222 13 97 program program NN 3222 13 98 manuals manual NNS 3222 13 99 , , , 3222 13 100 etc etc FW 3222 13 101 . . . 3222 13 102 ) ) -RRB- 3222 13 103 . . . 3222 14 1 Some some DT 3222 14 2 of of IN 3222 14 3 the the DT 3222 14 4 appendixes appendix NNS 3222 14 5 may may MD 3222 14 6 be be VB 3222 14 7 multimedia multimedia NNS 3222 14 8 , , , 3222 14 9 in in IN 3222 14 10 formats format NNS 3222 14 11 such such JJ 3222 14 12 as as IN 3222 14 13 PNG PNG NNP 3222 14 14 , , , 3222 14 15 JPEG JPEG NNP 3222 14 16 , , , 3222 14 17 or or CC 3222 14 18 MPEG.3 MPEG.3 NNP 3222 14 19 Notice Notice NNP 3222 14 20 that that IN 3222 14 21 this this DT 3222 14 22 paper paper NN 3222 14 23 deals deal NNS 3222 14 24 with with IN 3222 14 25 text text NN 3222 14 26 - - HYPH 3222 14 27 document document NN 3222 14 28 com- com- NN 3222 14 29 pression pression NN 3222 14 30 only only RB 3222 14 31 . . . 3222 15 1 Although although IN 3222 15 2 the the DT 3222 15 3 size size NN 3222 15 4 of of IN 3222 15 5 individual individual JJ 3222 15 6 text text NN 3222 15 7 documents document NNS 3222 15 8 is be VBZ 3222 15 9 often often RB 3222 15 10 significantly significantly RB 3222 15 11 smaller small JJR 3222 15 12 than than IN 3222 15 13 the the DT 3222 15 14 size size NN 3222 15 15 of of IN 3222 15 16 individual individual JJ 3222 15 17 multimedia multimedia NN 3222 15 18 objects object NNS 3222 15 19 , , , 3222 15 20 their -PRON- PRP$ 3222 15 21 collective collective JJ 3222 15 22 vol- vol- JJ 3222 15 23 ume ume NN 3222 15 24 is be VBZ 3222 15 25 large large JJ 3222 15 26 enough enough RB 3222 15 27 to to TO 3222 15 28 make make VB 3222 15 29 the the DT 3222 15 30 compression compression NN 3222 15 31 effort effort NN 3222 15 32 worthwhile worthwhile NNP 3222 15 33 . . . 3222 16 1 The the DT 3222 16 2 reason reason NN 3222 16 3 for for IN 3222 16 4 focusing focus VBG 3222 16 5 on on IN 3222 16 6 text text NN 3222 16 7 - - HYPH 3222 16 8 document document NN 3222 16 9 compression compression NN 3222 16 10 is be VBZ 3222 16 11 that that IN 3222 16 12 most most JJS 3222 16 13 multimedia multimedia NNS 3222 16 14 formats format NNS 3222 16 15 have have VBP 3222 16 16 efficient efficient JJ 3222 16 17 compression compression NN 3222 16 18 schemes scheme NNS 3222 16 19 embedded embed VBN 3222 16 20 , , , 3222 16 21 whereas whereas IN 3222 16 22 text text NN 3222 16 23 document document NN 3222 16 24 formats format VBZ 3222 16 25 usually usually RB 3222 16 26 either either CC 3222 16 27 are be VBP 3222 16 28 uncompressed uncompressed JJ 3222 16 29 or or CC 3222 16 30 use use VB 3222 16 31 schemes scheme NNS 3222 16 32 with with IN 3222 16 33 efficiency efficiency NN 3222 16 34 far far RB 3222 16 35 worse bad JJR 3222 16 36 than than IN 3222 16 37 the the DT 3222 16 38 current current JJ 3222 16 39 state state NN 3222 16 40 of of IN 3222 16 41 the the DT 3222 16 42 art art NN 3222 16 43 in in IN 3222 16 44 text text NN 3222 16 45 compression compression NN 3222 16 46 . . . 3222 17 1 Although although IN 3222 17 2 the the DT 3222 17 3 student student NN 3222 17 4 thesis thesis NN 3222 17 5 electronic electronic JJ 3222 17 6 archive archive NN 3222 17 7 was be VBD 3222 17 8 our -PRON- PRP$ 3222 17 9 motivation motivation NN 3222 17 10 , , , 3222 17 11 we -PRON- PRP 3222 17 12 propose propose VBP 3222 17 13 a a DT 3222 17 14 solution solution NN 3222 17 15 that that WDT 3222 17 16 can can MD 3222 17 17 be be VB 3222 17 18 applied apply VBN 3222 17 19 to to IN 3222 17 20 any any DT 3222 17 21 digital digital JJ 3222 17 22 library library NN 3222 17 23 containing contain VBG 3222 17 24 text text NN 3222 17 25 documents document NNS 3222 17 26 . . . 3222 18 1 As as IN 3222 18 2 the the DT 3222 18 3 recent recent JJ 3222 18 4 survey survey NN 3222 18 5 by by IN 3222 18 6 Kahl Kahl NNP 3222 18 7 and and CC 3222 18 8 Williams Williams NNP 3222 18 9 revealed reveal VBD 3222 18 10 , , , 3222 18 11 57.5 57.5 CD 3222 18 12 percent percent NN 3222 18 13 of of IN 3222 18 14 the the DT 3222 18 15 examined examine VBN 3222 18 16 1,117 1,117 CD 3222 18 17 digital digital JJ 3222 18 18 library library NN 3222 18 19 projects project NNS 3222 18 20 consisted consist VBN 3222 18 21 of of IN 3222 18 22 text text NN 3222 18 23 content content NN 3222 18 24 , , , 3222 18 25 so so CC 3222 18 26 there there EX 3222 18 27 are be VBP 3222 18 28 numerous numerous JJ 3222 18 29 libraries library NNS 3222 18 30 that that WDT 3222 18 31 could could MD 3222 18 32 benefit benefit VB 3222 18 33 form form NN 3222 18 34 implementation implementation NN 3222 18 35 of of IN 3222 18 36 the the DT 3222 18 37 proposed propose VBN 3222 18 38 scheme.4 scheme.4 NNP 3222 18 39 In in IN 3222 18 40 this this DT 3222 18 41 paper paper NN 3222 18 42 , , , 3222 18 43 we -PRON- PRP 3222 18 44 describe describe VBP 3222 18 45 a a DT 3222 18 46 state state NN 3222 18 47 - - HYPH 3222 18 48 of of IN 3222 18 49 - - HYPH 3222 18 50 the the DT 3222 18 51 - - HYPH 3222 18 52 art art NN 3222 18 53 approach approach NN 3222 18 54 to to IN 3222 18 55 text text NN 3222 18 56 - - HYPH 3222 18 57 document document NN 3222 18 58 compression compression NN 3222 18 59 and and CC 3222 18 60 present present VB 3222 18 61 an an DT 3222 18 62 open- open- JJ 3222 18 63 source source NN 3222 18 64 software software NN 3222 18 65 library library NN 3222 18 66 implementing implement VBG 3222 18 67 the the DT 3222 18 68 scheme scheme NN 3222 18 69 that that WDT 3222 18 70 can can MD 3222 18 71 be be VB 3222 18 72 freely freely RB 3222 18 73 used use VBN 3222 18 74 in in IN 3222 18 75 digital digital JJ 3222 18 76 library library NN 3222 18 77 projects project NNS 3222 18 78 . . . 3222 19 1 In in IN 3222 19 2 the the DT 3222 19 3 case case NN 3222 19 4 of of IN 3222 19 5 text text NN 3222 19 6 documents document NNS 3222 19 7 , , , 3222 19 8 improvement improvement NN 3222 19 9 in in IN 3222 19 10 com- com- NN 3222 19 11 pression pression NN 3222 19 12 effectiveness effectiveness NN 3222 19 13 may may MD 3222 19 14 be be VB 3222 19 15 obtained obtain VBN 3222 19 16 in in IN 3222 19 17 two two CD 3222 19 18 ways way NNS 3222 19 19 : : : 3222 19 20 with with IN 3222 19 21 or or CC 3222 19 22 without without IN 3222 19 23 regard regard NN 3222 19 24 to to IN 3222 19 25 their -PRON- PRP$ 3222 19 26 format format NN 3222 19 27 . . . 3222 20 1 The the DT 3222 20 2 more more RBR 3222 20 3 nontextual nontextual JJ 3222 20 4 content content NN 3222 20 5 in in IN 3222 20 6 a a DT 3222 20 7 document document NN 3222 20 8 ( ( -LRB- 3222 20 9 e.g. e.g. RB 3222 20 10 , , , 3222 20 11 formatting format VBG 3222 20 12 instructions instruction NNS 3222 20 13 , , , 3222 20 14 structure structure NN 3222 20 15 description description NN 3222 20 16 , , , 3222 20 17 or or CC 3222 20 18 embedded embed VBN 3222 20 19 images image NNS 3222 20 20 ) ) -RRB- 3222 20 21 , , , 3222 20 22 the the DT 3222 20 23 more more RBR 3222 20 24 it -PRON- PRP 3222 20 25 requires require VBZ 3222 20 26 format format NN 3222 20 27 - - HYPH 3222 20 28 specific specific JJ 3222 20 29 processing processing NN 3222 20 30 to to TO 3222 20 31 improve improve VB 3222 20 32 its -PRON- PRP$ 3222 20 33 com- com- NN 3222 20 34 pression pression NN 3222 20 35 ratio ratio NN 3222 20 36 . . . 3222 21 1 This this DT 3222 21 2 is be VBZ 3222 21 3 because because IN 3222 21 4 most most JJS 3222 21 5 document document NN 3222 21 6 formats format NNS 3222 21 7 have have VBP 3222 21 8 their -PRON- PRP$ 3222 21 9 own own JJ 3222 21 10 ways way NNS 3222 21 11 of of IN 3222 21 12 describing describe VBG 3222 21 13 their -PRON- PRP$ 3222 21 14 formatting formatting NN 3222 21 15 , , , 3222 21 16 structure structure NN 3222 21 17 , , , 3222 21 18 and and CC 3222 21 19 nontextual nontextual JJ 3222 21 20 inclusions inclusion NNS 3222 21 21 ( ( -LRB- 3222 21 22 plain plain JJ 3222 21 23 text text NN 3222 21 24 files file NNS 3222 21 25 have have VBP 3222 21 26 no no DT 3222 21 27 inclusions inclusion NNS 3222 21 28 ) ) -RRB- 3222 21 29 . . . 3222 22 1 For for IN 3222 22 2 this this DT 3222 22 3 reason reason NN 3222 22 4 , , , 3222 22 5 we -PRON- PRP 3222 22 6 have have VBP 3222 22 7 developed develop VBN 3222 22 8 a a DT 3222 22 9 compound compound NN 3222 22 10 scheme scheme NN 3222 22 11 that that WDT 3222 22 12 consists consist VBZ 3222 22 13 of of IN 3222 22 14 several several JJ 3222 22 15 subschemes subscheme NNS 3222 22 16 that that WDT 3222 22 17 can can MD 3222 22 18 be be VB 3222 22 19 turned turn VBN 3222 22 20 on on RP 3222 22 21 and and CC 3222 22 22 off off RB 3222 22 23 or or CC 3222 22 24 run run VB 3222 22 25 with with IN 3222 22 26 different different JJ 3222 22 27 parameters parameter NNS 3222 22 28 . . . 3222 23 1 The the DT 3222 23 2 most most RBS 3222 23 3 suitable suitable JJ 3222 23 4 solution solution NN 3222 23 5 for for IN 3222 23 6 a a DT 3222 23 7 given give VBN 3222 23 8 document document NN 3222 23 9 format format NN 3222 23 10 can can MD 3222 23 11 be be VB 3222 23 12 obtained obtain VBN 3222 23 13 by by IN 3222 23 14 merely merely RB 3222 23 15 choosing choose VBG 3222 23 16 the the DT 3222 23 17 right right JJ 3222 23 18 schemes scheme NNS 3222 23 19 and and CC 3222 23 20 adequate adequate JJ 3222 23 21 parameter parameter NN 3222 23 22 values value NNS 3222 23 23 . . . 3222 24 1 Experimentally experimentally RB 3222 24 2 , , , 3222 24 3 we -PRON- PRP 3222 24 4 have have VBP 3222 24 5 found find VBN 3222 24 6 the the DT 3222 24 7 optimal optimal JJ 3222 24 8 subscheme subscheme NN 3222 24 9 combinations combination NNS 3222 24 10 for for IN 3222 24 11 the the DT 3222 24 12 fol- fol- NN 3222 24 13 lowing lowing NN 3222 24 14 formats format NNS 3222 24 15 used use VBN 3222 24 16 in in IN 3222 24 17 digital digital JJ 3222 24 18 libraries library NNS 3222 24 19 : : : 3222 24 20 plain plain JJ 3222 24 21 text text NN 3222 24 22 , , , 3222 24 23 TEX TEX NNP 3222 24 24 , , , 3222 24 25 RTF RTF NNP 3222 24 26 , , , 3222 24 27 text text NN 3222 24 28 annotated annotate VBN 3222 24 29 with with IN 3222 24 30 XML xml NN 3222 24 31 , , , 3222 24 32 HTML html NN 3222 24 33 , , , 3222 24 34 as as RB 3222 24 35 well well RB 3222 24 36 as as IN 3222 24 37 the the DT 3222 24 38 device device NN 3222 24 39 - - HYPH 3222 24 40 independent independent JJ 3222 24 41 rendering rendering NN 3222 24 42 formats format NNS 3222 24 43 PS PS NNP 3222 24 44 and and CC 3222 24 45 PDF.5 PDF.5 NNP 3222 24 46 First first RB 3222 24 47 we -PRON- PRP 3222 24 48 discuss discuss VBP 3222 24 49 related related JJ 3222 24 50 work work NN 3222 24 51 in in IN 3222 24 52 text text NN 3222 24 53 compression compression NN 3222 24 54 , , , 3222 24 55 then then RB 3222 24 56 describe describe VB 3222 24 57 the the DT 3222 24 58 basis basis NN 3222 24 59 of of IN 3222 24 60 the the DT 3222 24 61 proposed propose VBN 3222 24 62 scheme scheme NN 3222 24 63 and and CC 3222 24 64 how how WRB 3222 24 65 it -PRON- PRP 3222 24 66 should should MD 3222 24 67 be be VB 3222 24 68 adapted adapt VBN 3222 24 69 for for IN 3222 24 70 particular particular JJ 3222 24 71 document document NN 3222 24 72 formats format NNS 3222 24 73 . . . 3222 25 1 The the DT 3222 25 2 section section NN 3222 25 3 “ " `` 3222 25 4 Using use VBG 3222 25 5 the the DT 3222 25 6 scheme scheme NN 3222 25 7 in in IN 3222 25 8 a a DT 3222 25 9 digital digital JJ 3222 25 10 library library NN 3222 25 11 project project NN 3222 25 12 ” " '' 3222 25 13 discusses discuss VBZ 3222 25 14 how how WRB 3222 25 15 to to TO 3222 25 16 use use VB 3222 25 17 the the DT 3222 25 18 free free JJ 3222 25 19 software software NN 3222 25 20 library library NN 3222 25 21 that that WDT 3222 25 22 imple- imple- VBZ 3222 25 23 ments ment VBZ 3222 25 24 the the DT 3222 25 25 scheme scheme NN 3222 25 26 . . . 3222 26 1 Then then RB 3222 26 2 we -PRON- PRP 3222 26 3 cover cover VBP 3222 26 4 the the DT 3222 26 5 results result NNS 3222 26 6 of of IN 3222 26 7 experi- experi- JJ 3222 26 8 ments ment NNS 3222 26 9 involving involve VBG 3222 26 10 the the DT 3222 26 11 proposed propose VBN 3222 26 12 scheme scheme NN 3222 26 13 and and CC 3222 26 14 a a DT 3222 26 15 corpus corpus NN 3222 26 16 of of IN 3222 26 17 test test NN 3222 26 18 files file NNS 3222 26 19 in in IN 3222 26 20 each each DT 3222 26 21 of of IN 3222 26 22 the the DT 3222 26 23 tested test VBN 3222 26 24 formats format NNS 3222 26 25 . . . 3222 27 1 n n NNP 3222 27 2 Text Text NNP 3222 27 3 compression compression NN 3222 27 4 There there EX 3222 27 5 are be VBP 3222 27 6 two two CD 3222 27 7 basic basic JJ 3222 27 8 principles principle NNS 3222 27 9 of of IN 3222 27 10 general general JJ 3222 27 11 - - HYPH 3222 27 12 purpose purpose NN 3222 27 13 data datum NNS 3222 27 14 compression compression NN 3222 27 15 . . . 3222 28 1 The the DT 3222 28 2 first first JJ 3222 28 3 one one CD 3222 28 4 works work VBZ 3222 28 5 on on IN 3222 28 6 the the DT 3222 28 7 level level NN 3222 28 8 of of IN 3222 28 9 char- char- NN 3222 28 10 acter acter NN 3222 28 11 sequences sequence NNS 3222 28 12 , , , 3222 28 13 the the DT 3222 28 14 second second JJ 3222 28 15 one one NN 3222 28 16 works work VBZ 3222 28 17 on on IN 3222 28 18 the the DT 3222 28 19 level level NN 3222 28 20 of of IN 3222 28 21 przemysław przemysław NNP 3222 28 22 Skibiński Skibiński NNP 3222 28 23 ( ( -LRB- 3222 28 24 inikep@ii.uni.wroc.pl inikep@ii.uni.wroc.pl NN 3222 28 25 ) ) -RRB- 3222 28 26 is be VBZ 3222 28 27 associate associate JJ 3222 28 28 Professor Professor NNP 3222 28 29 , , , 3222 28 30 institute institute NN 3222 28 31 of of IN 3222 28 32 Computer Computer NNP 3222 28 33 Science Science NNP 3222 28 34 , , , 3222 28 35 University University NNP 3222 28 36 of of IN 3222 28 37 wrocław wrocław NNP 3222 28 38 , , , 3222 28 39 Poland Poland NNP 3222 28 40 . . . 3222 29 1 Jakub Jakub NNP 3222 29 2 Swacha Swacha NNP 3222 29 3 ( ( -LRB- 3222 29 4 jakubs@uoo.univ.szczecin jakubs@uoo.univ.szczecin NNP 3222 29 5 .pl .pl NN 3222 29 6 ) ) -RRB- 3222 29 7 is be VBZ 3222 29 8 associate associate JJ 3222 29 9 Professor Professor NNP 3222 29 10 , , , 3222 29 11 institute institute NN 3222 29 12 of of IN 3222 29 13 information information NNP 3222 29 14 Technology Technology NNP 3222 29 15 in in IN 3222 29 16 Management Management NNP 3222 29 17 , , , 3222 29 18 University University NNP 3222 29 19 of of IN 3222 29 20 Szczecin Szczecin NNP 3222 29 21 , , , 3222 29 22 Poland Poland NNP 3222 29 23 . . . 3222 30 1 144 144 CD 3222 30 2 iNForMaTioN iNForMaTioN NNP 3222 30 3 TECHNoloGY technology NN 3222 30 4 aND and CC 3222 30 5 liBrariES liBrariES NNP 3222 30 6 | | NNP 3222 30 7 SEpTEMBEr september CD 3222 30 8 2009 2009 CD 3222 30 9 individual individual JJ 3222 30 10 characters character NNS 3222 30 11 . . . 3222 31 1 In in IN 3222 31 2 the the DT 3222 31 3 first first JJ 3222 31 4 case case NN 3222 31 5 , , , 3222 31 6 the the DT 3222 31 7 idea idea NN 3222 31 8 is be VBZ 3222 31 9 to to TO 3222 31 10 look look VB 3222 31 11 for for IN 3222 31 12 matching match VBG 3222 31 13 character character NN 3222 31 14 sequences sequence NNS 3222 31 15 in in IN 3222 31 16 the the DT 3222 31 17 past past JJ 3222 31 18 buffer buffer NN 3222 31 19 of of IN 3222 31 20 the the DT 3222 31 21 file file NN 3222 31 22 being be VBG 3222 31 23 compressed compress VBN 3222 31 24 and and CC 3222 31 25 replace replace VB 3222 31 26 such such JJ 3222 31 27 sequences sequence NNS 3222 31 28 with with IN 3222 31 29 shorter short JJR 3222 31 30 code code NN 3222 31 31 words word NNS 3222 31 32 ; ; : 3222 31 33 this this DT 3222 31 34 principle principle NN 3222 31 35 underlies underlie VBZ 3222 31 36 the the DT 3222 31 37 algo- algo- JJ 3222 31 38 rithms rithms NNP 3222 31 39 derived derive VBN 3222 31 40 from from IN 3222 31 41 the the DT 3222 31 42 concepts concept NNS 3222 31 43 of of IN 3222 31 44 Arbraham Arbraham NNP 3222 31 45 Lempel Lempel NNP 3222 31 46 and and CC 3222 31 47 Jacob Jacob NNP 3222 31 48 Ziv Ziv NNP 3222 31 49 ( ( -LRB- 3222 31 50 LZ LZ NNP 3222 31 51 - - HYPH 3222 31 52 type).6 type).6 NNP 3222 31 53 In in IN 3222 31 54 the the DT 3222 31 55 second second JJ 3222 31 56 case case NN 3222 31 57 , , , 3222 31 58 the the DT 3222 31 59 idea idea NN 3222 31 60 is be VBZ 3222 31 61 to to TO 3222 31 62 gather gather VB 3222 31 63 frequency frequency NN 3222 31 64 statistics statistic NNS 3222 31 65 for for IN 3222 31 66 characters character NNS 3222 31 67 in in IN 3222 31 68 the the DT 3222 31 69 file file NN 3222 31 70 being be VBG 3222 31 71 compressed compress VBN 3222 31 72 and and CC 3222 31 73 then then RB 3222 31 74 assign assign NNP 3222 31 75 shorter short JJR 3222 31 76 code code NN 3222 31 77 words word NNS 3222 31 78 for for IN 3222 31 79 frequent frequent JJ 3222 31 80 characters character NNS 3222 31 81 and and CC 3222 31 82 longer long JJR 3222 31 83 ones one NNS 3222 31 84 for for IN 3222 31 85 rare rare JJ 3222 31 86 characters character NNS 3222 31 87 ( ( -LRB- 3222 31 88 this this DT 3222 31 89 is be VBZ 3222 31 90 exactly exactly RB 3222 31 91 how how WRB 3222 31 92 Huffman Huffman NNP 3222 31 93 coding code VBG 3222 31 94 works work NNS 3222 31 95 — — : 3222 31 96 what what WP 3222 31 97 arithmetic arithmetic JJ 3222 31 98 coding code VBG 3222 31 99 assigns assign NNS 3222 31 100 are be VBP 3222 31 101 value value NN 3222 31 102 ranges range VBZ 3222 31 103 rather rather RB 3222 31 104 than than IN 3222 31 105 individual individual JJ 3222 31 106 code code NN 3222 31 107 words).7 words).7 NNP 3222 31 108 As as IN 3222 31 109 the the DT 3222 31 110 characters character NNS 3222 31 111 form form VBP 3222 31 112 words word NNS 3222 31 113 , , , 3222 31 114 and and CC 3222 31 115 words word NNS 3222 31 116 form form NN 3222 31 117 phrases phrase NNS 3222 31 118 , , , 3222 31 119 there there EX 3222 31 120 is be VBZ 3222 31 121 high high JJ 3222 31 122 correlation correlation NN 3222 31 123 between between IN 3222 31 124 subsequent subsequent JJ 3222 31 125 characters character NNS 3222 31 126 . . . 3222 32 1 To to TO 3222 32 2 produce produce VB 3222 32 3 shorter short JJR 3222 32 4 code code NN 3222 32 5 words word NNS 3222 32 6 , , , 3222 32 7 a a DT 3222 32 8 compression compression NN 3222 32 9 algorithm algorithm NN 3222 32 10 either either CC 3222 32 11 has have VBZ 3222 32 12 to to TO 3222 32 13 observe observe VB 3222 32 14 the the DT 3222 32 15 context context NN 3222 32 16 ( ( -LRB- 3222 32 17 understood understand VBN 3222 32 18 as as IN 3222 32 19 several several JJ 3222 32 20 preceding precede VBG 3222 32 21 characters character NNS 3222 32 22 ) ) -RRB- 3222 32 23 in in IN 3222 32 24 which which WDT 3222 32 25 the the DT 3222 32 26 character character NN 3222 32 27 appeared appear VBD 3222 32 28 and and CC 3222 32 29 maintain maintain VB 3222 32 30 separate separate JJ 3222 32 31 frequency frequency NN 3222 32 32 models model NNS 3222 32 33 for for IN 3222 32 34 different different JJ 3222 32 35 contexts contexts NN 3222 32 36 , , , 3222 32 37 or or CC 3222 32 38 has have VBZ 3222 32 39 to to TO 3222 32 40 first first RB 3222 32 41 decorrelate decorrelate VB 3222 32 42 the the DT 3222 32 43 characters character NNS 3222 32 44 ( ( -LRB- 3222 32 45 by by IN 3222 32 46 sorting sort VBG 3222 32 47 them -PRON- PRP 3222 32 48 according accord VBG 3222 32 49 to to IN 3222 32 50 their -PRON- PRP$ 3222 32 51 contexts contexts NN 3222 32 52 ) ) -RRB- 3222 32 53 and and CC 3222 32 54 then then RB 3222 32 55 use use VB 3222 32 56 an an DT 3222 32 57 adaptive adaptive JJ 3222 32 58 frequency frequency NN 3222 32 59 model model NN 3222 32 60 when when WRB 3222 32 61 compressing compress VBG 3222 32 62 the the DT 3222 32 63 out- out- JJ 3222 32 64 put put NN 3222 32 65 ( ( -LRB- 3222 32 66 as as IN 3222 32 67 the the DT 3222 32 68 characters character NNS 3222 32 69 ’ ’ POS 3222 32 70 dependence dependence NN 3222 32 71 on on IN 3222 32 72 context context NN 3222 32 73 becomes become VBZ 3222 32 74 dependence dependence NN 3222 32 75 on on IN 3222 32 76 position position NN 3222 32 77 ) ) -RRB- 3222 32 78 . . . 3222 33 1 Whereas whereas IN 3222 33 2 the the DT 3222 33 3 former former JJ 3222 33 4 solution solution NN 3222 33 5 is be VBZ 3222 33 6 the the DT 3222 33 7 foundation foundation NN 3222 33 8 of of IN 3222 33 9 Prediction Prediction NNP 3222 33 10 by by IN 3222 33 11 Partial Partial NNP 3222 33 12 Match Match NNP 3222 33 13 ( ( -LRB- 3222 33 14 PPM PPM NNP 3222 33 15 ) ) -RRB- 3222 33 16 algo- algo- . 3222 33 17 rithms rithms NN 3222 33 18 , , , 3222 33 19 Burrows Burrows NNP 3222 33 20 - - HYPH 3222 33 21 Wheeler Wheeler NNP 3222 33 22 Transform Transform NNP 3222 33 23 ( ( -LRB- 3222 33 24 BWT BWT NNP 3222 33 25 ) ) -RRB- 3222 33 26 compression compression NN 3222 33 27 algorithms algorithm NNS 3222 33 28 are be VBP 3222 33 29 based base VBN 3222 33 30 on on IN 3222 33 31 the the DT 3222 33 32 latter.8 latter.8 NNP 3222 33 33 Witten Witten NNP 3222 33 34 et et FW 3222 33 35 al al NNP 3222 33 36 . . NNP 3222 33 37 , , , 3222 33 38 in in IN 3222 33 39 their -PRON- PRP$ 3222 33 40 seminal seminal JJ 3222 33 41 work work NN 3222 33 42 Managing manage VBG 3222 33 43 Gigabytes Gigabytes NNPS 3222 33 44 , , , 3222 33 45 emphasize emphasize VB 3222 33 46 the the DT 3222 33 47 role role NN 3222 33 48 of of IN 3222 33 49 data datum NNS 3222 33 50 compression compression NN 3222 33 51 in in IN 3222 33 52 text text NN 3222 33 53 storage storage NN 3222 33 54 and and CC 3222 33 55 retrieval retrieval NN 3222 33 56 systems system NNS 3222 33 57 , , , 3222 33 58 stating state VBG 3222 33 59 three three CD 3222 33 60 requirements requirement NNS 3222 33 61 for for IN 3222 33 62 the the DT 3222 33 63 compression compression NN 3222 33 64 process process NN 3222 33 65 : : : 3222 33 66 good good JJ 3222 33 67 compression compression NN 3222 33 68 , , , 3222 33 69 fast fast JJ 3222 33 70 decoding decoding NN 3222 33 71 , , , 3222 33 72 and and CC 3222 33 73 feasibility feasibility NN 3222 33 74 of of IN 3222 33 75 decoding decode VBG 3222 33 76 individual individual JJ 3222 33 77 documents document NNS 3222 33 78 with with IN 3222 33 79 minimum minimum NN 3222 33 80 overhead.9 overhead.9 NNP 3222 33 81 The the DT 3222 33 82 choice choice NN 3222 33 83 of of IN 3222 33 84 compression compression NN 3222 33 85 algorithm algorithm NNP 3222 33 86 should should MD 3222 33 87 depend depend VB 3222 33 88 on on IN 3222 33 89 what what WP 3222 33 90 is be VBZ 3222 33 91 more more RBR 3222 33 92 important important JJ 3222 33 93 for for IN 3222 33 94 a a DT 3222 33 95 specific specific JJ 3222 33 96 application application NN 3222 33 97 : : : 3222 33 98 better well JJR 3222 33 99 compression compression NN 3222 33 100 or or CC 3222 33 101 faster fast RBR 3222 33 102 decoding decoding NN 3222 33 103 . . . 3222 34 1 An an DT 3222 34 2 early early JJ 3222 34 3 work work NN 3222 34 4 of of IN 3222 34 5 Jon Jon NNP 3222 34 6 Louis Louis NNP 3222 34 7 Bentley Bentley NNP 3222 34 8 and and CC 3222 34 9 others other NNS 3222 34 10 showed show VBD 3222 34 11 that that IN 3222 34 12 a a DT 3222 34 13 significant significant JJ 3222 34 14 improvement improvement NN 3222 34 15 in in IN 3222 34 16 text text NN 3222 34 17 compression compression NN 3222 34 18 can can MD 3222 34 19 be be VB 3222 34 20 achieved achieve VBN 3222 34 21 by by IN 3222 34 22 treating treat VBG 3222 34 23 a a DT 3222 34 24 text text NN 3222 34 25 document document NN 3222 34 26 as as IN 3222 34 27 a a DT 3222 34 28 stream stream NN 3222 34 29 of of IN 3222 34 30 space space NN 3222 34 31 - - HYPH 3222 34 32 delimited delimit VBN 3222 34 33 words word NNS 3222 34 34 rather rather RB 3222 34 35 than than IN 3222 34 36 individual individual JJ 3222 34 37 characters.10 characters.10 NNP 3222 34 38 This this DT 3222 34 39 technique technique NN 3222 34 40 can can MD 3222 34 41 be be VB 3222 34 42 combined combine VBN 3222 34 43 with with IN 3222 34 44 any any DT 3222 34 45 general general JJ 3222 34 46 - - HYPH 3222 34 47 purpose purpose NN 3222 34 48 compression compression NN 3222 34 49 method method NN 3222 34 50 in in IN 3222 34 51 two two CD 3222 34 52 ways way NNS 3222 34 53 : : : 3222 34 54 by by IN 3222 34 55 redesigning redesign VBG 3222 34 56 charac- charac- NNP 3222 34 57 ter ter NN 3222 34 58 - - HYPH 3222 34 59 based base VBN 3222 34 60 algorithms algorithm NNS 3222 34 61 as as IN 3222 34 62 word word NN 3222 34 63 - - HYPH 3222 34 64 based base VBN 3222 34 65 ones one NNS 3222 34 66 or or CC 3222 34 67 by by IN 3222 34 68 implement- implement- NNP 3222 34 69 ing ing NNP 3222 34 70 a a DT 3222 34 71 two two CD 3222 34 72 - - HYPH 3222 34 73 stage stage NN 3222 34 74 scheme scheme NN 3222 34 75 whose whose WP$ 3222 34 76 first first JJ 3222 34 77 step step NN 3222 34 78 is be VBZ 3222 34 79 a a DT 3222 34 80 transform transform NN 3222 34 81 replacing replace VBG 3222 34 82 words word NNS 3222 34 83 with with IN 3222 34 84 dictionary dictionary JJ 3222 34 85 indices index NNS 3222 34 86 and and CC 3222 34 87 whose whose WP$ 3222 34 88 second second JJ 3222 34 89 step step NN 3222 34 90 is be VBZ 3222 34 91 passing pass VBG 3222 34 92 the the DT 3222 34 93 transformed transform VBN 3222 34 94 text text NN 3222 34 95 through through IN 3222 34 96 any any DT 3222 34 97 general- general- NN 3222 34 98 purpose purpose NN 3222 34 99 compressor.11 compressor.11 NNP 3222 34 100 From from IN 3222 34 101 the the DT 3222 34 102 designer designer NN 3222 34 103 ’s ’s POS 3222 34 104 point point NN 3222 34 105 of of IN 3222 34 106 view view NN 3222 34 107 , , , 3222 34 108 although although IN 3222 34 109 the the DT 3222 34 110 first first JJ 3222 34 111 approach approach NN 3222 34 112 provides provide VBZ 3222 34 113 more more JJR 3222 34 114 control control NN 3222 34 115 over over IN 3222 34 116 how how WRB 3222 34 117 the the DT 3222 34 118 text text NN 3222 34 119 is be VBZ 3222 34 120 modeled model VBN 3222 34 121 , , , 3222 34 122 the the DT 3222 34 123 second second JJ 3222 34 124 approach approach NN 3222 34 125 is be VBZ 3222 34 126 much much RB 3222 34 127 eas- eas- RB 3222 34 128 ier ier JJ 3222 34 129 to to TO 3222 34 130 implement implement VB 3222 34 131 and and CC 3222 34 132 upgrade upgrade VB 3222 34 133 to to IN 3222 34 134 future future JJ 3222 34 135 general general JJ 3222 34 136 - - HYPH 3222 34 137 purpose purpose NN 3222 34 138 compressors.12 compressors.12 NNP 3222 34 139 Notice Notice NNP 3222 34 140 that that IN 3222 34 141 the the DT 3222 34 142 separation separation NN 3222 34 143 of of IN 3222 34 144 the the DT 3222 34 145 word- word- NN 3222 34 146 replacement replacement NN 3222 34 147 stage stage NN 3222 34 148 from from IN 3222 34 149 the the DT 3222 34 150 compression compression NN 3222 34 151 stage stage NN 3222 34 152 does do VBZ 3222 34 153 not not RB 3222 34 154 imply imply VB 3222 34 155 that that IN 3222 34 156 two two CD 3222 34 157 distinct distinct JJ 3222 34 158 programs program NNS 3222 34 159 have have VBP 3222 34 160 to to TO 3222 34 161 be be VB 3222 34 162 used use VBN 3222 34 163 — — : 3222 34 164 if if IN 3222 34 165 only only RB 3222 34 166 an an DT 3222 34 167 appropriate appropriate JJ 3222 34 168 general general JJ 3222 34 169 - - HYPH 3222 34 170 purpose purpose NN 3222 34 171 compression compression NN 3222 34 172 software software NN 3222 34 173 library library NN 3222 34 174 is be VBZ 3222 34 175 available available JJ 3222 34 176 , , , 3222 34 177 a a DT 3222 34 178 single single JJ 3222 34 179 utility utility NN 3222 34 180 can can MD 3222 34 181 use use VB 3222 34 182 it -PRON- PRP 3222 34 183 to to TO 3222 34 184 compress compress VB 3222 34 185 the the DT 3222 34 186 output output NN 3222 34 187 of of IN 3222 34 188 the the DT 3222 34 189 transform transform NN 3222 34 190 it -PRON- PRP 3222 34 191 first first RB 3222 34 192 performed perform VBD 3222 34 193 . . . 3222 35 1 An an DT 3222 35 2 important important JJ 3222 35 3 element element NN 3222 35 4 of of IN 3222 35 5 every every DT 3222 35 6 word word NN 3222 35 7 - - HYPH 3222 35 8 based base VBN 3222 35 9 scheme scheme NN 3222 35 10 is be VBZ 3222 35 11 the the DT 3222 35 12 dictionary dictionary NN 3222 35 13 of of IN 3222 35 14 words word NNS 3222 35 15 that that WDT 3222 35 16 lists list VBZ 3222 35 17 character character NN 3222 35 18 sequences sequence NNS 3222 35 19 that that WDT 3222 35 20 should should MD 3222 35 21 be be VB 3222 35 22 treated treat VBN 3222 35 23 as as IN 3222 35 24 single single JJ 3222 35 25 entities entity NNS 3222 35 26 . . . 3222 36 1 The the DT 3222 36 2 dictionary dictionary NN 3222 36 3 can can MD 3222 36 4 be be VB 3222 36 5 dynamic dynamic JJ 3222 36 6 ( ( -LRB- 3222 36 7 i.e. i.e. FW 3222 36 8 , , , 3222 36 9 constructed construct VBN 3222 36 10 on on IN 3222 36 11 - - HYPH 3222 36 12 line line NN 3222 36 13 during during IN 3222 36 14 the the DT 3222 36 15 com- com- NN 3222 36 16 pression pression NN 3222 36 17 of of IN 3222 36 18 every every DT 3222 36 19 document),13 document),13 NNP 3222 36 20 static static NN 3222 36 21 ( ( -LRB- 3222 36 22 i.e. i.e. FW 3222 36 23 , , , 3222 36 24 constructed construct VBN 3222 36 25 off off IN 3222 36 26 - - HYPH 3222 36 27 line line NN 3222 36 28 before before IN 3222 36 29 the the DT 3222 36 30 compression compression NN 3222 36 31 stage stage NN 3222 36 32 and and CC 3222 36 33 once once RB 3222 36 34 for for IN 3222 36 35 every every DT 3222 36 36 document document NN 3222 36 37 of of IN 3222 36 38 a a DT 3222 36 39 given give VBN 3222 36 40 class class NN 3222 36 41 — — : 3222 36 42 typically typically RB 3222 36 43 , , , 3222 36 44 the the DT 3222 36 45 language language NN 3222 36 46 of of IN 3222 36 47 the the DT 3222 36 48 document document NN 3222 36 49 determines determine VBZ 3222 36 50 its -PRON- PRP$ 3222 36 51 class),14 class),14 JJ 3222 36 52 or or CC 3222 36 53 semidynamic semidynamic JJ 3222 36 54 ( ( -LRB- 3222 36 55 i.e. i.e. FW 3222 36 56 , , , 3222 36 57 constructed construct VBN 3222 36 58 off off IN 3222 36 59 - - HYPH 3222 36 60 line line NN 3222 36 61 before before IN 3222 36 62 compression compression NN 3222 36 63 stage stage NN 3222 36 64 but but CC 3222 36 65 indi- indi- FW 3222 36 66 vidually vidually RB 3222 36 67 for for IN 3222 36 68 every every DT 3222 36 69 document).15 document).15 NN 3222 36 70 Semidynamic Semidynamic NNP 3222 36 71 dictionar- dictionar- NN 3222 36 72 ies ie NNS 3222 36 73 must must MD 3222 36 74 be be VB 3222 36 75 stored store VBN 3222 36 76 along along RB 3222 36 77 with with IN 3222 36 78 the the DT 3222 36 79 compressed compressed JJ 3222 36 80 document document NN 3222 36 81 . . . 3222 37 1 Dynamic dynamic JJ 3222 37 2 dictionaries dictionary NNS 3222 37 3 are be VBP 3222 37 4 reconstructed reconstruct VBN 3222 37 5 during during IN 3222 37 6 decom- decom- JJ 3222 37 7 pression pression NN 3222 37 8 ( ( -LRB- 3222 37 9 which which WDT 3222 37 10 makes make VBZ 3222 37 11 the the DT 3222 37 12 decoding decode VBG 3222 37 13 slower slow RBR 3222 37 14 than than IN 3222 37 15 in in IN 3222 37 16 the the DT 3222 37 17 other other JJ 3222 37 18 cases case NNS 3222 37 19 ) ) -RRB- 3222 37 20 . . . 3222 38 1 When when WRB 3222 38 2 the the DT 3222 38 3 static static JJ 3222 38 4 dictionary dictionary NN 3222 38 5 is be VBZ 3222 38 6 used use VBN 3222 38 7 , , , 3222 38 8 it -PRON- PRP 3222 38 9 must must MD 3222 38 10 be be VB 3222 38 11 distributed distribute VBN 3222 38 12 with with IN 3222 38 13 the the DT 3222 38 14 decoder decoder NN 3222 38 15 ; ; : 3222 38 16 since since IN 3222 38 17 a a DT 3222 38 18 single single JJ 3222 38 19 dictionary dictionary NN 3222 38 20 is be VBZ 3222 38 21 used use VBN 3222 38 22 to to TO 3222 38 23 compress compress VB 3222 38 24 multiple multiple JJ 3222 38 25 files file NNS 3222 38 26 , , , 3222 38 27 it -PRON- PRP 3222 38 28 usually usually RB 3222 38 29 attains attain VBZ 3222 38 30 the the DT 3222 38 31 best good JJS 3222 38 32 compression compression NN 3222 38 33 ratios ratio NNS 3222 38 34 , , , 3222 38 35 but but CC 3222 38 36 it -PRON- PRP 3222 38 37 is be VBZ 3222 38 38 only only RB 3222 38 39 effective effective JJ 3222 38 40 with with IN 3222 38 41 docu- docu- NN 3222 38 42 ments ment NNS 3222 38 43 of of IN 3222 38 44 the the DT 3222 38 45 class class NN 3222 38 46 it -PRON- PRP 3222 38 47 was be VBD 3222 38 48 originally originally RB 3222 38 49 prepared prepare VBN 3222 38 50 for for IN 3222 38 51 . . . 3222 39 1 n n LS 3222 39 2 The the DT 3222 39 3 basic basic JJ 3222 39 4 compression compression NN 3222 39 5 scheme scheme NN 3222 39 6 The the DT 3222 39 7 basis basis NN 3222 39 8 of of IN 3222 39 9 our -PRON- PRP$ 3222 39 10 approach approach NN 3222 39 11 is be VBZ 3222 39 12 a a DT 3222 39 13 word word NN 3222 39 14 - - HYPH 3222 39 15 based base VBN 3222 39 16 , , , 3222 39 17 lossless lossless JJ 3222 39 18 text text NN 3222 39 19 compression compression NN 3222 39 20 scheme scheme NN 3222 39 21 , , , 3222 39 22 dubbed dub VBN 3222 39 23 Compression Compression NNP 3222 39 24 for for IN 3222 39 25 Textual Textual NNP 3222 39 26 Digital Digital NNP 3222 39 27 Libraries Libraries NNPS 3222 39 28 ( ( -LRB- 3222 39 29 CTDL CTDL NNP 3222 39 30 ) ) -RRB- 3222 39 31 . . . 3222 40 1 The the DT 3222 40 2 scheme scheme NN 3222 40 3 consists consist VBZ 3222 40 4 of of IN 3222 40 5 up up IN 3222 40 6 to to TO 3222 40 7 four four CD 3222 40 8 stages stage NNS 3222 40 9 : : : 3222 40 10 1 1 LS 3222 40 11 . . . 3222 40 12 document document NN 3222 40 13 decompression decompression NNP 3222 40 14 2 2 CD 3222 40 15 . . . 3222 40 16 dictionary dictionary NNP 3222 40 17 composition composition NN 3222 40 18 3 3 CD 3222 40 19 . . . 3222 40 20 text text NN 3222 40 21 transform transform NN 3222 40 22 4 4 CD 3222 40 23 . . . 3222 40 24 compression compression NN 3222 40 25 Stages stage NNS 3222 40 26 1–2 1–2 CD 3222 40 27 are be VBP 3222 40 28 optional optional JJ 3222 40 29 . . . 3222 41 1 The the DT 3222 41 2 first first JJ 3222 41 3 is be VBZ 3222 41 4 for for IN 3222 41 5 retrieving retrieve VBG 3222 41 6 tex- tex- XX 3222 41 7 tual tual JJ 3222 41 8 content content NN 3222 41 9 from from IN 3222 41 10 files file NNS 3222 41 11 compressed compress VBN 3222 41 12 poorly poorly RB 3222 41 13 with with IN 3222 41 14 general- general- NN 3222 41 15 purpose purpose NN 3222 41 16 methods method NNS 3222 41 17 . . . 3222 42 1 It -PRON- PRP 3222 42 2 is be VBZ 3222 42 3 only only RB 3222 42 4 executed execute VBN 3222 42 5 for for IN 3222 42 6 compressed compress VBN 3222 42 7 input input NN 3222 42 8 documents document NNS 3222 42 9 . . . 3222 43 1 It -PRON- PRP 3222 43 2 uses use VBZ 3222 43 3 an an DT 3222 43 4 embedded embed VBN 3222 43 5 decompressor decompressor NN 3222 43 6 for for IN 3222 43 7 files file NNS 3222 43 8 compressed compress VBN 3222 43 9 using use VBG 3222 43 10 the the DT 3222 43 11 Deflate Deflate NNP 3222 43 12 algorithm,16 algorithm,16 NNS 3222 43 13 but but CC 3222 43 14 an an DT 3222 43 15 external external JJ 3222 43 16 tool tool NN 3222 43 17 — — : 3222 43 18 Precomp precomp NN 3222 43 19 — — : 3222 43 20 is be VBZ 3222 43 21 used use VBN 3222 43 22 to to TO 3222 43 23 decode decode VB 3222 43 24 natively natively RB 3222 43 25 compressed compress VBN 3222 43 26 PDF PDF NNP 3222 43 27 documents.17 documents.17 NNP 3222 43 28 The the DT 3222 43 29 second second JJ 3222 43 30 stage stage NN 3222 43 31 is be VBZ 3222 43 32 for for IN 3222 43 33 constructing construct VBG 3222 43 34 the the DT 3222 43 35 dictionary dictionary NN 3222 43 36 of of IN 3222 43 37 the the DT 3222 43 38 most most RBS 3222 43 39 frequent frequent JJ 3222 43 40 words word NNS 3222 43 41 in in IN 3222 43 42 the the DT 3222 43 43 processed process VBN 3222 43 44 document document NN 3222 43 45 . . . 3222 44 1 Doing do VBG 3222 44 2 so so RB 3222 44 3 is be VBZ 3222 44 4 a a DT 3222 44 5 good good JJ 3222 44 6 idea idea NN 3222 44 7 when when WRB 3222 44 8 the the DT 3222 44 9 compressed compress VBN 3222 44 10 documents document NNS 3222 44 11 have have VBP 3222 44 12 no no DT 3222 44 13 common common JJ 3222 44 14 set set NN 3222 44 15 of of IN 3222 44 16 words word NNS 3222 44 17 . . . 3222 45 1 If if IN 3222 45 2 there there EX 3222 45 3 are be VBP 3222 45 4 many many JJ 3222 45 5 docu- docu- NN 3222 45 6 ments ment NNS 3222 45 7 in in IN 3222 45 8 the the DT 3222 45 9 same same JJ 3222 45 10 language language NN 3222 45 11 , , , 3222 45 12 a a DT 3222 45 13 common common JJ 3222 45 14 dictionary dictionary JJ 3222 45 15 fares fare NNS 3222 45 16 better better RB 3222 45 17 — — : 3222 45 18 it -PRON- PRP 3222 45 19 usually usually RB 3222 45 20 does do VBZ 3222 45 21 not not RB 3222 45 22 pay pay VB 3222 45 23 off off RP 3222 45 24 to to TO 3222 45 25 store store VB 3222 45 26 an an DT 3222 45 27 individual individual JJ 3222 45 28 dictionary dictionary NN 3222 45 29 with with IN 3222 45 30 each each DT 3222 45 31 file file NN 3222 45 32 because because IN 3222 45 33 they -PRON- PRP 3222 45 34 all all DT 3222 45 35 contain contain VBP 3222 45 36 similar similar JJ 3222 45 37 lists list NNS 3222 45 38 of of IN 3222 45 39 words word NNS 3222 45 40 . . . 3222 46 1 For for IN 3222 46 2 this this DT 3222 46 3 reason reason NN 3222 46 4 we -PRON- PRP 3222 46 5 have have VBP 3222 46 6 developed develop VBN 3222 46 7 two two CD 3222 46 8 variants variant NNS 3222 46 9 of of IN 3222 46 10 the the DT 3222 46 11 scheme scheme NN 3222 46 12 . . . 3222 47 1 The the DT 3222 47 2 basic basic JJ 3222 47 3 CTDL CTDL NNP 3222 47 4 includes include VBZ 3222 47 5 stage stage NN 3222 47 6 2 2 CD 3222 47 7 ; ; : 3222 47 8 therefore therefore RB 3222 47 9 it -PRON- PRP 3222 47 10 can can MD 3222 47 11 use use VB 3222 47 12 a a DT 3222 47 13 document document NN 3222 47 14 - - HYPH 3222 47 15 specific specific JJ 3222 47 16 semidynamic semidynamic JJ 3222 47 17 dictionary dictionary NNP 3222 47 18 in in IN 3222 47 19 the the DT 3222 47 20 third third JJ 3222 47 21 stage stage NN 3222 47 22 . . . 3222 48 1 The the DT 3222 48 2 CTDL+ CTDL+ NNP 3222 48 3 variant variant JJ 3222 48 4 uses use VBZ 3222 48 5 a a DT 3222 48 6 static static JJ 3222 48 7 dictionary dictionary JJ 3222 48 8 common common NN 3222 48 9 for for IN 3222 48 10 all all DT 3222 48 11 files file NNS 3222 48 12 in in IN 3222 48 13 the the DT 3222 48 14 same same JJ 3222 48 15 lan- lan- NN 3222 48 16 guage guage NN 3222 48 17 ; ; : 3222 48 18 therefore therefore RB 3222 48 19 it -PRON- PRP 3222 48 20 can can MD 3222 48 21 omit omit VB 3222 48 22 stage stage VB 3222 48 23 2 2 CD 3222 48 24 . . . 3222 49 1 During during IN 3222 49 2 stage stage NN 3222 49 3 2 2 CD 3222 49 4 , , , 3222 49 5 all all PDT 3222 49 6 the the DT 3222 49 7 potential potential JJ 3222 49 8 dictionary dictionary JJ 3222 49 9 items item NNS 3222 49 10 that that WDT 3222 49 11 meet meet VBP 3222 49 12 the the DT 3222 49 13 word word NN 3222 49 14 requirements requirement NNS 3222 49 15 are be VBP 3222 49 16 extracted extract VBN 3222 49 17 from from IN 3222 49 18 the the DT 3222 49 19 document document NN 3222 49 20 and and CC 3222 49 21 then then RB 3222 49 22 sorted sort VBD 3222 49 23 according accord VBG 3222 49 24 to to IN 3222 49 25 their -PRON- PRP$ 3222 49 26 frequency frequency NN 3222 49 27 THE the DT 3222 49 28 EFFiCiENT efficient PRP$ 3222 49 29 SToraGE storage CD 3222 49 30 oF oF NNP 3222 49 31 TExT text NN 3222 49 32 DoCuMENTS documents NN 3222 49 33 iN in IN 3222 49 34 DiGiTal DiGiTal NNP 3222 49 35 liBrariES libraries NN 3222 49 36 | | NNP 3222 49 37 SkibiŃSki SkibiŃSki NNP 3222 49 38 and and CC 3222 49 39 Swacha Swacha NNP 3222 49 40 145 145 CD 3222 49 41 to to TO 3222 49 42 form form VB 3222 49 43 a a DT 3222 49 44 dictionary dictionary NN 3222 49 45 . . . 3222 50 1 The the DT 3222 50 2 requirements requirement NNS 3222 50 3 define define VBP 3222 50 4 the the DT 3222 50 5 mini- mini- NNP 3222 50 6 mum mum NNP 3222 50 7 length length NNP 3222 50 8 and and CC 3222 50 9 frequency frequency NN 3222 50 10 of of IN 3222 50 11 a a DT 3222 50 12 word word NN 3222 50 13 in in IN 3222 50 14 the the DT 3222 50 15 document document NN 3222 50 16 ( ( -LRB- 3222 50 17 by by IN 3222 50 18 default default NN 3222 50 19 , , , 3222 50 20 2 2 CD 3222 50 21 and and CC 3222 50 22 6 6 CD 3222 50 23 respectively respectively RB 3222 50 24 ) ) -RRB- 3222 50 25 as as RB 3222 50 26 well well RB 3222 50 27 as as IN 3222 50 28 its -PRON- PRP$ 3222 50 29 content content NN 3222 50 30 . . . 3222 51 1 Only only RB 3222 51 2 the the DT 3222 51 3 following follow VBG 3222 51 4 kinds kind NNS 3222 51 5 of of IN 3222 51 6 strings string NNS 3222 51 7 are be VBP 3222 51 8 accepted accept VBN 3222 51 9 into into IN 3222 51 10 the the DT 3222 51 11 dictionary dictionary NN 3222 51 12 : : : 3222 51 13 n n LS 3222 51 14 a a DT 3222 51 15 sequence sequence NN 3222 51 16 of of IN 3222 51 17 lowercase lowercase NN 3222 51 18 and and CC 3222 51 19 uppercase uppercase JJ 3222 51 20 letters letter NNS 3222 51 21 ( ( -LRB- 3222 51 22 “ " `` 3222 51 23 a”–“z a”–“z FW 3222 51 24 ” " '' 3222 51 25 , , , 3222 51 26 “ " `` 3222 51 27 A”–“Z a”–“z NN 3222 51 28 ” " '' 3222 51 29 ) ) -RRB- 3222 51 30 and and CC 3222 51 31 characters character NNS 3222 51 32 with with IN 3222 51 33 ASCII ASCII NNP 3222 51 34 code code NN 3222 51 35 values value NNS 3222 51 36 from from IN 3222 51 37 range range NN 3222 51 38 128–255 128–255 CD 3222 51 39 ( ( -LRB- 3222 51 40 thus thus RB 3222 51 41 it -PRON- PRP 3222 51 42 supports support VBZ 3222 51 43 any any DT 3222 51 44 typical typical JJ 3222 51 45 8-bit 8-bit NNP 3222 51 46 text text NN 3222 51 47 encoding encoding NN 3222 51 48 and and CC 3222 51 49 also also RB 3222 51 50 UTF-8 UTF-8 NNP 3222 51 51 ) ) -RRB- 3222 51 52 n n NN 3222 51 53 URL url NN 3222 51 54 address address NN 3222 51 55 prefixes prefix NNS 3222 51 56 of of IN 3222 51 57 the the DT 3222 51 58 form form NN 3222 51 59 “ " `` 3222 51 60 http:// http:// JJ 3222 51 61 domain/ domain/ NN 3222 51 62 , , , 3222 51 63 ” " '' 3222 51 64 where where WRB 3222 51 65 domain domain NN 3222 51 66 is be VBZ 3222 51 67 any any DT 3222 51 68 combination combination NN 3222 51 69 of of IN 3222 51 70 letters letter NNS 3222 51 71 , , , 3222 51 72 digits digit NNS 3222 51 73 , , , 3222 51 74 dots dot NNS 3222 51 75 , , , 3222 51 76 and and CC 3222 51 77 dashes dash VBZ 3222 51 78 n n DT 3222 51 79 e e NNP 3222 51 80 - - HYPH 3222 51 81 mails mail NNS 3222 51 82 — — : 3222 51 83 patterns pattern NNS 3222 51 84 of of IN 3222 51 85 the the DT 3222 51 86 form form NN 3222 51 87 “ " `` 3222 51 88 login@domain login@domain NNP 3222 51 89 , , , 3222 51 90 ” " '' 3222 51 91 where where WRB 3222 51 92 login login NNP 3222 51 93 and and CC 3222 51 94 domain domain NN 3222 51 95 are be VBP 3222 51 96 any any DT 3222 51 97 combination combination NN 3222 51 98 of of IN 3222 51 99 letters letter NNS 3222 51 100 , , , 3222 51 101 digits digit NNS 3222 51 102 , , , 3222 51 103 dots dot NNS 3222 51 104 , , , 3222 51 105 and and CC 3222 51 106 dashes dash VBZ 3222 51 107 n n DT 3222 51 108 runs run NNS 3222 51 109 of of IN 3222 51 110 spaces space NNS 3222 51 111 Stage Stage NNP 3222 51 112 3 3 CD 3222 51 113 begins begin VBZ 3222 51 114 with with IN 3222 51 115 parsing parse VBG 3222 51 116 the the DT 3222 51 117 text text NN 3222 51 118 into into IN 3222 51 119 tokens token NNS 3222 51 120 . . . 3222 52 1 The the DT 3222 52 2 tokens token NNS 3222 52 3 are be VBP 3222 52 4 defined define VBN 3222 52 5 by by IN 3222 52 6 their -PRON- PRP$ 3222 52 7 content content NN 3222 52 8 ; ; : 3222 52 9 as as IN 3222 52 10 four four CD 3222 52 11 types type NNS 3222 52 12 of of IN 3222 52 13 content content NN 3222 52 14 are be VBP 3222 52 15 distinguished distinguish VBN 3222 52 16 , , , 3222 52 17 there there EX 3222 52 18 are be VBP 3222 52 19 also also RB 3222 52 20 four four CD 3222 52 21 classes class NNS 3222 52 22 of of IN 3222 52 23 tokens token NNS 3222 52 24 : : : 3222 52 25 words word NNS 3222 52 26 , , , 3222 52 27 numbers number NNS 3222 52 28 , , , 3222 52 29 special special JJ 3222 52 30 tokens token NNS 3222 52 31 , , , 3222 52 32 and and CC 3222 52 33 characters character NNS 3222 52 34 . . . 3222 53 1 Every every DT 3222 53 2 token token NN 3222 53 3 is be VBZ 3222 53 4 then then RB 3222 53 5 encoded encode VBN 3222 53 6 in in IN 3222 53 7 a a DT 3222 53 8 way way NN 3222 53 9 that that WDT 3222 53 10 depends depend VBZ 3222 53 11 on on IN 3222 53 12 the the DT 3222 53 13 class class NN 3222 53 14 it -PRON- PRP 3222 53 15 belongs belong VBZ 3222 53 16 to to TO 3222 53 17 . . . 3222 54 1 The the DT 3222 54 2 words word NNS 3222 54 3 are be VBP 3222 54 4 those those DT 3222 54 5 character character NN 3222 54 6 sequences sequence NNS 3222 54 7 that that WDT 3222 54 8 are be VBP 3222 54 9 listed list VBN 3222 54 10 in in IN 3222 54 11 the the DT 3222 54 12 dictionary dictionary NN 3222 54 13 . . . 3222 55 1 Every every DT 3222 55 2 word word NN 3222 55 3 is be VBZ 3222 55 4 replaced replace VBN 3222 55 5 with with IN 3222 55 6 its -PRON- PRP$ 3222 55 7 diction- diction- NN 3222 55 8 ary ary NNP 3222 55 9 index index NN 3222 55 10 , , , 3222 55 11 which which WDT 3222 55 12 is be VBZ 3222 55 13 then then RB 3222 55 14 encoded encode VBN 3222 55 15 using use VBG 3222 55 16 symbols symbol NNS 3222 55 17 that that WDT 3222 55 18 are be VBP 3222 55 19 rare rare JJ 3222 55 20 or or CC 3222 55 21 nonexistent nonexistent JJ 3222 55 22 in in IN 3222 55 23 the the DT 3222 55 24 input input NN 3222 55 25 document document NN 3222 55 26 . . . 3222 56 1 Indexes index NNS 3222 56 2 are be VBP 3222 56 3 encoded encode VBN 3222 56 4 with with IN 3222 56 5 code code NN 3222 56 6 words word NNS 3222 56 7 that that WDT 3222 56 8 are be VBP 3222 56 9 between between IN 3222 56 10 one one CD 3222 56 11 and and CC 3222 56 12 four four CD 3222 56 13 bytes byte NNS 3222 56 14 long long JJ 3222 56 15 , , , 3222 56 16 with with IN 3222 56 17 lower low JJR 3222 56 18 indexes index NNS 3222 56 19 ( ( -LRB- 3222 56 20 denoting denote VBG 3222 56 21 more more JJR 3222 56 22 frequent frequent JJ 3222 56 23 words word NNS 3222 56 24 ) ) -RRB- 3222 56 25 being be VBG 3222 56 26 assigned assign VBN 3222 56 27 shorter short JJR 3222 56 28 code code NN 3222 56 29 words word NNS 3222 56 30 . . . 3222 57 1 The the DT 3222 57 2 numbers number NNS 3222 57 3 are be VBP 3222 57 4 sequences sequence NNS 3222 57 5 of of IN 3222 57 6 decimal decimal JJ 3222 57 7 digits digit NNS 3222 57 8 , , , 3222 57 9 which which WDT 3222 57 10 are be VBP 3222 57 11 encoded encode VBN 3222 57 12 with with IN 3222 57 13 a a DT 3222 57 14 dense dense JJ 3222 57 15 binary binary JJ 3222 57 16 code code NN 3222 57 17 , , , 3222 57 18 and and CC 3222 57 19 , , , 3222 57 20 similarly similarly RB 3222 57 21 to to IN 3222 57 22 letters letter NNS 3222 57 23 , , , 3222 57 24 placed place VBN 3222 57 25 in in IN 3222 57 26 a a DT 3222 57 27 separate separate JJ 3222 57 28 location location NN 3222 57 29 in in IN 3222 57 30 the the DT 3222 57 31 output output NN 3222 57 32 file file NN 3222 57 33 . . . 3222 58 1 The the DT 3222 58 2 special special JJ 3222 58 3 tokens token NNS 3222 58 4 can can MD 3222 58 5 be be VB 3222 58 6 decimal decimal JJ 3222 58 7 fractions fraction NNS 3222 58 8 , , , 3222 58 9 IP IP NNP 3222 58 10 numeri- numeri- JJ 3222 58 11 cal cal NN 3222 58 12 addresses address NNS 3222 58 13 , , , 3222 58 14 dates date NNS 3222 58 15 , , , 3222 58 16 times time NNS 3222 58 17 , , , 3222 58 18 and and CC 3222 58 19 numerical numerical JJ 3222 58 20 ranges range NNS 3222 58 21 . . . 3222 59 1 As as IN 3222 59 2 they -PRON- PRP 3222 59 3 have have VBP 3222 59 4 a a DT 3222 59 5 strict strict JJ 3222 59 6 format format NN 3222 59 7 and and CC 3222 59 8 differ differ VBP 3222 59 9 only only RB 3222 59 10 in in IN 3222 59 11 numerical numerical JJ 3222 59 12 values value NNS 3222 59 13 , , , 3222 59 14 they -PRON- PRP 3222 59 15 are be VBP 3222 59 16 encoded encode VBN 3222 59 17 as as IN 3222 59 18 sequences sequence NNS 3222 59 19 of of IN 3222 59 20 numbers.18 numbers.18 NNP 3222 59 21 Finally finally RB 3222 59 22 , , , 3222 59 23 the the DT 3222 59 24 characters character NNS 3222 59 25 are be VBP 3222 59 26 the the DT 3222 59 27 tokens token NNS 3222 59 28 that that WDT 3222 59 29 do do VBP 3222 59 30 not not RB 3222 59 31 belong belong VB 3222 59 32 to to IN 3222 59 33 any any DT 3222 59 34 of of IN 3222 59 35 the the DT 3222 59 36 aforementioned aforementioned JJ 3222 59 37 group group NN 3222 59 38 . . . 3222 60 1 They -PRON- PRP 3222 60 2 are be VBP 3222 60 3 sim- sim- RB 3222 60 4 ply ply RB 3222 60 5 copied copied JJ 3222 60 6 to to IN 3222 60 7 the the DT 3222 60 8 output output NN 3222 60 9 file file NN 3222 60 10 , , , 3222 60 11 with with IN 3222 60 12 the the DT 3222 60 13 exception exception NN 3222 60 14 of of IN 3222 60 15 those those DT 3222 60 16 rare rare JJ 3222 60 17 characters character NNS 3222 60 18 that that WDT 3222 60 19 were be VBD 3222 60 20 used use VBN 3222 60 21 to to TO 3222 60 22 construct construct VB 3222 60 23 code code NN 3222 60 24 words word NNS 3222 60 25 ; ; : 3222 60 26 they -PRON- PRP 3222 60 27 are be VBP 3222 60 28 copied copy VBN 3222 60 29 as as RB 3222 60 30 well well RB 3222 60 31 , , , 3222 60 32 but but CC 3222 60 33 have have VBP 3222 60 34 to to TO 3222 60 35 be be VB 3222 60 36 preceded precede VBN 3222 60 37 with with IN 3222 60 38 a a DT 3222 60 39 special special JJ 3222 60 40 escape escape NN 3222 60 41 symbol symbol NN 3222 60 42 . . . 3222 61 1 The the DT 3222 61 2 specialized specialized JJ 3222 61 3 transform transform NN 3222 61 4 variants variant NNS 3222 61 5 ( ( -LRB- 3222 61 6 see see VB 3222 61 7 the the DT 3222 61 8 next next JJ 3222 61 9 sec- sec- JJ 3222 61 10 tion tion NN 3222 61 11 ) ) -RRB- 3222 61 12 distinguish distinguish VB 3222 61 13 three three CD 3222 61 14 additional additional JJ 3222 61 15 classes class NNS 3222 61 16 from from IN 3222 61 17 the the DT 3222 61 18 charac- charac- NNP 3222 61 19 ter ter NN 3222 61 20 class class NN 3222 61 21 : : : 3222 61 22 letters letter NNS 3222 61 23 ( ( -LRB- 3222 61 24 words word NNS 3222 61 25 not not RB 3222 61 26 in in IN 3222 61 27 the the DT 3222 61 28 dictionary dictionary NN 3222 61 29 ) ) -RRB- 3222 61 30 , , , 3222 61 31 single single JJ 3222 61 32 white white JJ 3222 61 33 spaces space NNS 3222 61 34 , , , 3222 61 35 and and CC 3222 61 36 multiple multiple JJ 3222 61 37 white white JJ 3222 61 38 spaces space NNS 3222 61 39 . . . 3222 62 1 Stage stage NN 3222 62 2 4 4 CD 3222 62 3 could could MD 3222 62 4 use use VB 3222 62 5 any any DT 3222 62 6 general general JJ 3222 62 7 - - HYPH 3222 62 8 purpose purpose NN 3222 62 9 compression compression NN 3222 62 10 method method NN 3222 62 11 to to TO 3222 62 12 encode encode VB 3222 62 13 the the DT 3222 62 14 output output NN 3222 62 15 of of IN 3222 62 16 stage stage NN 3222 62 17 3 3 CD 3222 62 18 . . . 3222 63 1 For for IN 3222 63 2 this this DT 3222 63 3 role role NN 3222 63 4 , , , 3222 63 5 we -PRON- PRP 3222 63 6 have have VBP 3222 63 7 investigated investigate VBN 3222 63 8 several several JJ 3222 63 9 open open JJ 3222 63 10 - - HYPH 3222 63 11 licensed licensed JJ 3222 63 12 , , , 3222 63 13 general- general- NN 3222 63 14 purpose purpose NN 3222 63 15 compression compression NN 3222 63 16 algorithms algorithm NNS 3222 63 17 that that WDT 3222 63 18 differ differ VBP 3222 63 19 in in IN 3222 63 20 speed speed NN 3222 63 21 and and CC 3222 63 22 efficiency efficiency NN 3222 63 23 . . . 3222 64 1 As as IN 3222 64 2 we -PRON- PRP 3222 64 3 believe believe VBP 3222 64 4 that that IN 3222 64 5 document document NN 3222 64 6 access access NN 3222 64 7 speed speed NN 3222 64 8 is be VBZ 3222 64 9 important important JJ 3222 64 10 to to IN 3222 64 11 textual textual JJ 3222 64 12 digital digital JJ 3222 64 13 libraries library NNS 3222 64 14 , , , 3222 64 15 we -PRON- PRP 3222 64 16 have have VBP 3222 64 17 decided decide VBN 3222 64 18 to to TO 3222 64 19 focus focus VB 3222 64 20 on on IN 3222 64 21 LZ LZ NNP 3222 64 22 – – : 3222 64 23 type type NN 3222 64 24 algorithms algorithm NNS 3222 64 25 because because IN 3222 64 26 they -PRON- PRP 3222 64 27 offer offer VBP 3222 64 28 the the DT 3222 64 29 best good JJS 3222 64 30 decompression decompression NN 3222 64 31 times time NNS 3222 64 32 . . . 3222 65 1 CTDL CTDL NNP 3222 65 2 has have VBZ 3222 65 3 two two CD 3222 65 4 embedded embed VBN 3222 65 5 back- back- NN 3222 65 6 end end NN 3222 65 7 compressors compressor NNS 3222 65 8 : : : 3222 65 9 the the DT 3222 65 10 standard standard JJ 3222 65 11 Deflate Deflate NNP 3222 65 12 and and CC 3222 65 13 LZMA LZMA NNP 3222 65 14 , , , 3222 65 15 well- well- NN 3222 65 16 known know VBN 3222 65 17 for for IN 3222 65 18 its -PRON- PRP$ 3222 65 19 ability ability NN 3222 65 20 to to TO 3222 65 21 attain attain VB 3222 65 22 high high JJ 3222 65 23 compression compression NN 3222 65 24 ratios.19 ratios.19 NNP 3222 65 25 n n CC 3222 65 26 Adapting adapt VBG 3222 65 27 the the DT 3222 65 28 transform transform NN 3222 65 29 for for IN 3222 65 30 individual individual JJ 3222 65 31 text text NN 3222 65 32 document document NN 3222 65 33 formats format VBZ 3222 65 34 The the DT 3222 65 35 text text NN 3222 65 36 document document NN 3222 65 37 formats format NNS 3222 65 38 have have VBP 3222 65 39 individual individual JJ 3222 65 40 character- character- NN 3222 65 41 istics istic NNS 3222 65 42 ; ; : 3222 65 43 therefore therefore RB 3222 65 44 the the DT 3222 65 45 compression compression NN 3222 65 46 ratio ratio NN 3222 65 47 can can MD 3222 65 48 be be VB 3222 65 49 improved improve VBN 3222 65 50 by by IN 3222 65 51 adapting adapt VBG 3222 65 52 the the DT 3222 65 53 transform transform NN 3222 65 54 for for IN 3222 65 55 a a DT 3222 65 56 particular particular JJ 3222 65 57 format format NN 3222 65 58 . . . 3222 66 1 As as IN 3222 66 2 we -PRON- PRP 3222 66 3 noted note VBD 3222 66 4 in in IN 3222 66 5 the the DT 3222 66 6 introduction introduction NN 3222 66 7 , , , 3222 66 8 we -PRON- PRP 3222 66 9 propose propose VBP 3222 66 10 a a DT 3222 66 11 set set NN 3222 66 12 of of IN 3222 66 13 sub- sub- JJ 3222 66 14 schemes scheme NNS 3222 66 15 ( ( -LRB- 3222 66 16 modifications modification NNS 3222 66 17 of of IN 3222 66 18 the the DT 3222 66 19 original original JJ 3222 66 20 processing processing NN 3222 66 21 steps step NNS 3222 66 22 or or CC 3222 66 23 additional additional JJ 3222 66 24 processing processing NN 3222 66 25 steps step NNS 3222 66 26 ) ) -RRB- 3222 66 27 that that WDT 3222 66 28 can can MD 3222 66 29 help help VB 3222 66 30 compression compression NN 3222 66 31 — — : 3222 66 32 provided provide VBN 3222 66 33 the the DT 3222 66 34 issue issue NN 3222 66 35 that that IN 3222 66 36 a a DT 3222 66 37 given give VBN 3222 66 38 subscheme subscheme NN 3222 66 39 addresses address NNS 3222 66 40 is be VBZ 3222 66 41 valid valid JJ 3222 66 42 for for IN 3222 66 43 the the DT 3222 66 44 document document NN 3222 66 45 format format NN 3222 66 46 being be VBG 3222 66 47 compressed compress VBN 3222 66 48 . . . 3222 67 1 There there EX 3222 67 2 are be VBP 3222 67 3 two two CD 3222 67 4 groups group NNS 3222 67 5 of of IN 3222 67 6 subschemes subscheme NNS 3222 67 7 : : : 3222 67 8 the the DT 3222 67 9 first first JJ 3222 67 10 consists consist NNS 3222 67 11 of of IN 3222 67 12 solu- solu- NNS 3222 67 13 tions tion NNS 3222 67 14 that that WDT 3222 67 15 can can MD 3222 67 16 be be VB 3222 67 17 applied apply VBN 3222 67 18 to to IN 3222 67 19 more more JJR 3222 67 20 than than IN 3222 67 21 one one CD 3222 67 22 document document NN 3222 67 23 format format NN 3222 67 24 . . . 3222 68 1 It -PRON- PRP 3222 68 2 includes include VBZ 3222 68 3 n n IN 3222 68 4 changing change VBG 3222 68 5 the the DT 3222 68 6 minimum minimum JJ 3222 68 7 word word NN 3222 68 8 frequency frequency NN 3222 68 9 threshold threshold NN 3222 68 10 ( ( -LRB- 3222 68 11 the the DT 3222 68 12 “ " `` 3222 68 13 MinFr MinFr NNP 3222 68 14 ” " '' 3222 68 15 column column NN 3222 68 16 in in IN 3222 68 17 table table NN 3222 68 18 1 1 CD 3222 68 19 ) ) -RRB- 3222 68 20 that that IN 3222 68 21 a a DT 3222 68 22 word word NN 3222 68 23 must must MD 3222 68 24 pass pass VB 3222 68 25 to to TO 3222 68 26 be be VB 3222 68 27 included include VBN 3222 68 28 in in IN 3222 68 29 the the DT 3222 68 30 semidynamic semidynamic JJ 3222 68 31 dictionary dictionary NNP 3222 68 32 ( ( -LRB- 3222 68 33 notice notice NN 3222 68 34 that that IN 3222 68 35 no no DT 3222 68 36 word word NN 3222 68 37 can can MD 3222 68 38 be be VB 3222 68 39 added add VBN 3222 68 40 to to IN 3222 68 41 a a DT 3222 68 42 static static JJ 3222 68 43 dic- dic- NN 3222 68 44 tionary tionary NN 3222 68 45 ) ) -RRB- 3222 68 46 ; ; : 3222 68 47 n n LS 3222 68 48 using use VBG 3222 68 49 spaceless spaceless NN 3222 68 50 word word NN 3222 68 51 model model NN 3222 68 52 ( ( -LRB- 3222 68 53 “ " `` 3222 68 54 WdSpc WdSpc NNP 3222 68 55 ” " '' 3222 68 56 column column NN 3222 68 57 in in IN 3222 68 58 table table NN 3222 68 59 1 1 CD 3222 68 60 ) ) -RRB- 3222 68 61 in in IN 3222 68 62 which which WDT 3222 68 63 a a DT 3222 68 64 single single JJ 3222 68 65 space space NN 3222 68 66 between between IN 3222 68 67 two two CD 3222 68 68 words word NNS 3222 68 69 is be VBZ 3222 68 70 not not RB 3222 68 71 encoded encode VBN 3222 68 72 at at RB 3222 68 73 all all RB 3222 68 74 ; ; : 3222 68 75 instead instead RB 3222 68 76 , , , 3222 68 77 a a DT 3222 68 78 flag flag NN 3222 68 79 is be VBZ 3222 68 80 used use VBN 3222 68 81 to to TO 3222 68 82 mark mark VB 3222 68 83 two two CD 3222 68 84 neighboring neighboring JJ 3222 68 85 words word NNS 3222 68 86 that that WDT 3222 68 87 are be VBP 3222 68 88 not not RB 3222 68 89 separated separate VBN 3222 68 90 by by IN 3222 68 91 a a DT 3222 68 92 space space NN 3222 68 93 ; ; : 3222 68 94 n n JJ 3222 68 95 run run VBN 3222 68 96 - - HYPH 3222 68 97 length length NN 3222 68 98 encoding encoding NN 3222 68 99 of of IN 3222 68 100 multiple multiple JJ 3222 68 101 spaces space NNS 3222 68 102 ( ( -LRB- 3222 68 103 “ " `` 3222 68 104 SpRuns SpRuns NNP 3222 68 105 ” " '' 3222 68 106 column column NN 3222 68 107 in in IN 3222 68 108 table table NN 3222 68 109 1 1 CD 3222 68 110 ) ) -RRB- 3222 68 111 ; ; : 3222 68 112 n n DT 3222 68 113 letter letter NN 3222 68 114 containers container NNS 3222 68 115 ( ( -LRB- 3222 68 116 “ " `` 3222 68 117 LetCnt LetCnt NNP 3222 68 118 ” " '' 3222 68 119 column column NN 3222 68 120 in in IN 3222 68 121 table table NN 3222 68 122 1 1 CD 3222 68 123 ) ) -RRB- 3222 68 124 , , , 3222 68 125 that that RB 3222 68 126 is is RB 3222 68 127 , , , 3222 68 128 removing remove VBG 3222 68 129 sequences sequence NNS 3222 68 130 of of IN 3222 68 131 letters letter NNS 3222 68 132 ( ( -LRB- 3222 68 133 belonging belong VBG 3222 68 134 to to IN 3222 68 135 words word NNS 3222 68 136 that that WDT 3222 68 137 are be VBP 3222 68 138 not not RB 3222 68 139 included include VBN 3222 68 140 in in IN 3222 68 141 the the DT 3222 68 142 dictionary dictionary NN 3222 68 143 ) ) -RRB- 3222 68 144 to to IN 3222 68 145 a a DT 3222 68 146 separate separate JJ 3222 68 147 location location NN 3222 68 148 in in IN 3222 68 149 the the DT 3222 68 150 output output NN 3222 68 151 file file NN 3222 68 152 ( ( -LRB- 3222 68 153 and and CC 3222 68 154 leaving leave VBG 3222 68 155 a a DT 3222 68 156 flag flag NN 3222 68 157 at at IN 3222 68 158 their -PRON- PRP$ 3222 68 159 original original JJ 3222 68 160 position position NN 3222 68 161 ) ) -RRB- 3222 68 162 . . . 3222 69 1 Table table NN 3222 69 2 1 1 CD 3222 69 3 shows show VBZ 3222 69 4 the the DT 3222 69 5 assignment assignment NN 3222 69 6 of of IN 3222 69 7 the the DT 3222 69 8 mentioned mention VBN 3222 69 9 sub- sub- DT 3222 69 10 schemes scheme NNS 3222 69 11 to to IN 3222 69 12 document document NN 3222 69 13 formats format NNS 3222 69 14 , , , 3222 69 15 with with IN 3222 69 16 “ " `` 3222 69 17 + + NNS 3222 69 18 ” " '' 3222 69 19 denoting denote VBG 3222 69 20 that that IN 3222 69 21 a a DT 3222 69 22 given give VBN 3222 69 23 subscheme subscheme NN 3222 69 24 should should MD 3222 69 25 be be VB 3222 69 26 applied apply VBN 3222 69 27 when when WRB 3222 69 28 processing process VBG 3222 69 29 a a DT 3222 69 30 given give VBN 3222 69 31 document document NN 3222 69 32 format format NN 3222 69 33 . . . 3222 70 1 Notice notice VB 3222 70 2 that that IN 3222 70 3 we -PRON- PRP 3222 70 4 use use VBP 3222 70 5 different different JJ 3222 70 6 subschemes subscheme NNS 3222 70 7 for for IN 3222 70 8 the the DT 3222 70 9 same same JJ 3222 70 10 format format NN 3222 70 11 depending depend VBG 3222 70 12 on on IN 3222 70 13 whether whether IN 3222 70 14 a a DT 3222 70 15 semidynamic semidynamic NN 3222 70 16 ( ( -LRB- 3222 70 17 CTDL CTDL NNP 3222 70 18 ) ) -RRB- 3222 70 19 or or CC 3222 70 20 static static JJ 3222 70 21 ( ( -LRB- 3222 70 22 CTDL+ CTDL+ NNP 3222 70 23 ) ) -RRB- 3222 70 24 dictionary dictionary NNP 3222 70 25 is be VBZ 3222 70 26 used use VBN 3222 70 27 . . . 3222 71 1 The the DT 3222 71 2 remaining remain VBG 3222 71 3 subschemes subscheme NNS 3222 71 4 are be VBP 3222 71 5 applied apply VBN 3222 71 6 for for IN 3222 71 7 only only RB 3222 71 8 one one CD 3222 71 9 document document NN 3222 71 10 format format NN 3222 71 11 . . . 3222 72 1 They -PRON- PRP 3222 72 2 attain attain VBP 3222 72 3 an an DT 3222 72 4 improvement improvement NN 3222 72 5 in in IN 3222 72 6 com- com- NN 3222 72 7 pression pression NN 3222 72 8 performance performance NN 3222 72 9 by by IN 3222 72 10 changing change VBG 3222 72 11 the the DT 3222 72 12 definition definition NN 3222 72 13 of of IN 3222 72 14 acceptable acceptable JJ 3222 72 15 dictionary dictionary JJ 3222 72 16 words word NNS 3222 72 17 , , , 3222 72 18 and and CC 3222 72 19 , , , 3222 72 20 in in IN 3222 72 21 one one CD 3222 72 22 case case NN 3222 72 23 ( ( -LRB- 3222 72 24 PS PS NNP 3222 72 25 ) ) -RRB- 3222 72 26 , , , 3222 72 27 by by IN 3222 72 28 changing change VBG 3222 72 29 the the DT 3222 72 30 definition definition NN 3222 72 31 of of IN 3222 72 32 number number NN 3222 72 33 strings string NNS 3222 72 34 . . . 3222 73 1 The the DT 3222 73 2 encoder encoder NN 3222 73 3 for for IN 3222 73 4 the the DT 3222 73 5 simplest simple JJS 3222 73 6 of of IN 3222 73 7 the the DT 3222 73 8 examined examine VBN 3222 73 9 for- for- IN 3222 73 10 mats mat NNS 3222 73 11 — — : 3222 73 12 plain plain JJ 3222 73 13 text text NN 3222 73 14 files file NNS 3222 73 15 — — : 3222 73 16 performs perform VBZ 3222 73 17 no no DT 3222 73 18 additional additional JJ 3222 73 19 format- format- NN 3222 73 20 specific specific JJ 3222 73 21 processing processing NN 3222 73 22 . . . 3222 74 1 The the DT 3222 74 2 first first JJ 3222 74 3 such such JJ 3222 74 4 modification modification NN 3222 74 5 is be VBZ 3222 74 6 in in IN 3222 74 7 the the DT 3222 74 8 TEX TEX NNP 3222 74 9 encoder encoder NN 3222 74 10 . . . 3222 75 1 The the DT 3222 75 2 difference difference NN 3222 75 3 is be VBZ 3222 75 4 that that IN 3222 75 5 words word NNS 3222 75 6 beginning begin VBG 3222 75 7 with with IN 3222 75 8 “ " `` 3222 75 9 \ \ NNP 3222 75 10 ” " '' 3222 75 11 ( ( -LRB- 3222 75 12 TEX TEX NNP 3222 75 13 146 146 CD 3222 75 14 iNForMaTioN iNForMaTioN NNP 3222 75 15 TECHNoloGY TECHNoloGY NNP 3222 75 16 aND and CC 3222 75 17 liBrariES liBrariES NNP 3222 75 18 | | NNP 3222 75 19 SEpTEMBEr september CD 3222 75 20 2009 2009 CD 3222 75 21 instructions instruction NNS 3222 75 22 ) ) -RRB- 3222 75 23 are be VBP 3222 75 24 now now RB 3222 75 25 accepted accept VBN 3222 75 26 in in IN 3222 75 27 the the DT 3222 75 28 dictionary dictionary NN 3222 75 29 . . . 3222 76 1 The the DT 3222 76 2 modification modification NN 3222 76 3 for for IN 3222 76 4 PDF PDF NNP 3222 76 5 documents document NNS 3222 76 6 is be VBZ 3222 76 7 similar similar JJ 3222 76 8 . . . 3222 77 1 In in IN 3222 77 2 this this DT 3222 77 3 case case NN 3222 77 4 , , , 3222 77 5 bracketed bracket VBN 3222 77 6 words word NNS 3222 77 7 ( ( -LRB- 3222 77 8 PDF PDF NNP 3222 77 9 entities entity NNS 3222 77 10 ) ) -RRB- 3222 77 11 — — : 3222 77 12 for for IN 3222 77 13 example example NN 3222 77 14 “ " `` 3222 77 15 ( ( -LRB- 3222 77 16 abc)”—are abc)”—are NNP 3222 77 17 accept- accept- NNS 3222 77 18 able able JJ 3222 77 19 as as IN 3222 77 20 dictionary dictionary JJ 3222 77 21 entries entry NNS 3222 77 22 . . . 3222 78 1 Notice notice VB 3222 78 2 that that IN 3222 78 3 PDF PDF NNP 3222 78 4 files file NNS 3222 78 5 are be VBP 3222 78 6 internally internally RB 3222 78 7 compressed compress VBN 3222 78 8 by by IN 3222 78 9 default default NN 3222 78 10 — — : 3222 78 11 the the DT 3222 78 12 transform transform NN 3222 78 13 can can MD 3222 78 14 be be VB 3222 78 15 applied apply VBN 3222 78 16 after after IN 3222 78 17 decompressing decompress VBG 3222 78 18 them -PRON- PRP 3222 78 19 into into IN 3222 78 20 textual textual JJ 3222 78 21 format format NN 3222 78 22 . . . 3222 79 1 The the DT 3222 79 2 Precomp Precomp NNP 3222 79 3 tool tool NN 3222 79 4 is be VBZ 3222 79 5 used use VBN 3222 79 6 for for IN 3222 79 7 this this DT 3222 79 8 purpose purpose NN 3222 79 9 . . . 3222 80 1 The the DT 3222 80 2 subscheme subscheme NN 3222 80 3 for for IN 3222 80 4 PS PS NNP 3222 80 5 files file NNS 3222 80 6 features feature VBZ 3222 80 7 two two CD 3222 80 8 modifications modification NNS 3222 80 9 : : : 3222 80 10 Its -PRON- PRP$ 3222 80 11 dictionary dictionary NN 3222 80 12 accepts accept VBZ 3222 80 13 words word NNS 3222 80 14 begin- begin- JJ 3222 80 15 ning ning NN 3222 80 16 with with IN 3222 80 17 “ " `` 3222 80 18 / / , 3222 80 19 ” " '' 3222 80 20 and and CC 3222 80 21 “ " `` 3222 80 22 \ \ NN 3222 80 23 ” " '' 3222 80 24 or or CC 3222 80 25 ending end VBG 3222 80 26 with with IN 3222 80 27 “ " `` 3222 80 28 ( ( -LRB- 3222 80 29 “ " `` 3222 80 30 , , , 3222 80 31 and and CC 3222 80 32 its -PRON- PRP$ 3222 80 33 number number NN 3222 80 34 tokens token NNS 3222 80 35 can can MD 3222 80 36 contain contain VB 3222 80 37 not not RB 3222 80 38 only only RB 3222 80 39 deci- deci- XX 3222 80 40 mal mal NNP 3222 80 41 but but CC 3222 80 42 also also RB 3222 80 43 hexadecimal hexadecimal JJ 3222 80 44 digits digit NNS 3222 80 45 ( ( -LRB- 3222 80 46 though though IN 3222 80 47 a a DT 3222 80 48 single single JJ 3222 80 49 number number NN 3222 80 50 must must MD 3222 80 51 have have VB 3222 80 52 at at RB 3222 80 53 least least RBS 3222 80 54 one one CD 3222 80 55 decimal decimal JJ 3222 80 56 digit digit NN 3222 80 57 ) ) -RRB- 3222 80 58 . . . 3222 81 1 The the DT 3222 81 2 hexadecimal hexadecimal JJ 3222 81 3 number number NN 3222 81 4 must must MD 3222 81 5 be be VB 3222 81 6 at at RB 3222 81 7 least least RBS 3222 81 8 6 6 CD 3222 81 9 digits digit NNS 3222 81 10 long long JJ 3222 81 11 , , , 3222 81 12 and and CC 3222 81 13 is be VBZ 3222 81 14 encoded encode VBN 3222 81 15 with with IN 3222 81 16 a a DT 3222 81 17 flag flag NN 3222 81 18 : : : 3222 81 19 a a DT 3222 81 20 byte byte NN 3222 81 21 containing contain VBG 3222 81 22 its -PRON- PRP$ 3222 81 23 length length NN 3222 81 24 ( ( -LRB- 3222 81 25 numbers number NNS 3222 81 26 with with IN 3222 81 27 more more JJR 3222 81 28 than than IN 3222 81 29 261 261 CD 3222 81 30 digits digit NNS 3222 81 31 are be VBP 3222 81 32 split split VBN 3222 81 33 into into IN 3222 81 34 parts part NNS 3222 81 35 ) ) -RRB- 3222 81 36 and and CC 3222 81 37 a a DT 3222 81 38 sequence sequence NN 3222 81 39 of of IN 3222 81 40 bytes byte NNS 3222 81 41 , , , 3222 81 42 each each DT 3222 81 43 containing contain VBG 3222 81 44 two two CD 3222 81 45 digits digit NNS 3222 81 46 from from IN 3222 81 47 the the DT 3222 81 48 number number NN 3222 81 49 ( ( -LRB- 3222 81 50 if if IN 3222 81 51 the the DT 3222 81 52 number number NN 3222 81 53 of of IN 3222 81 54 digits digit NNS 3222 81 55 is be VBZ 3222 81 56 odd odd JJ 3222 81 57 , , , 3222 81 58 the the DT 3222 81 59 last last JJ 3222 81 60 byte byte NN 3222 81 61 contains contain VBZ 3222 81 62 only only RB 3222 81 63 one one CD 3222 81 64 digit digit NN 3222 81 65 ) ) -RRB- 3222 81 66 . . . 3222 82 1 For for IN 3222 82 2 RTF RTF NNP 3222 82 3 documents document NNS 3222 82 4 , , , 3222 82 5 the the DT 3222 82 6 dictionary dictionary NN 3222 82 7 accepts accept VBZ 3222 82 8 the the DT 3222 82 9 “ " `` 3222 82 10 \”-preceded \”-preceded JJ 3222 82 11 words word NNS 3222 82 12 , , , 3222 82 13 like like IN 3222 82 14 the the DT 3222 82 15 TEX TEX NNP 3222 82 16 files file NNS 3222 82 17 . . . 3222 83 1 Moreover moreover RB 3222 83 2 , , , 3222 83 3 the the DT 3222 83 4 hexadecimal hexadecimal JJ 3222 83 5 numbers number NNS 3222 83 6 are be VBP 3222 83 7 encoded encode VBN 3222 83 8 in in IN 3222 83 9 the the DT 3222 83 10 same same JJ 3222 83 11 way way NN 3222 83 12 as as IN 3222 83 13 in in IN 3222 83 14 the the DT 3222 83 15 PS PS NNP 3222 83 16 subscheme subscheme NN 3222 83 17 so so IN 3222 83 18 that that IN 3222 83 19 RTF RTF NNP 3222 83 20 documents document NNS 3222 83 21 containing contain VBG 3222 83 22 images image NNS 3222 83 23 can can MD 3222 83 24 be be VB 3222 83 25 significantly significantly RB 3222 83 26 reduced reduce VBN 3222 83 27 in in IN 3222 83 28 size size NN 3222 83 29 . . . 3222 84 1 Specialization specialization NN 3222 84 2 for for IN 3222 84 3 XML xml NN 3222 84 4 is be VBZ 3222 84 5 roughly roughly RB 3222 84 6 the the DT 3222 84 7 transform transform NN 3222 84 8 described describe VBN 3222 84 9 in in IN 3222 84 10 our -PRON- PRP$ 3222 84 11 earlier early JJR 3222 84 12 article article NN 3222 84 13 , , , 3222 84 14 “ " `` 3222 84 15 Revisiting Revisiting NNP 3222 84 16 Dictionary- Dictionary- NNP 3222 84 17 Based Based NNP 3222 84 18 Compression Compression NNP 3222 84 19 . . . 3222 84 20 ”20 ”20 UH 3222 84 21 It -PRON- PRP 3222 84 22 allows allow VBZ 3222 84 23 for for IN 3222 84 24 XML xml NN 3222 84 25 start start NN 3222 84 26 tags tag NNS 3222 84 27 and and CC 3222 84 28 entities entity NNS 3222 84 29 to to TO 3222 84 30 be be VB 3222 84 31 added add VBN 3222 84 32 to to IN 3222 84 33 dictionary dictionary NNP 3222 84 34 , , , 3222 84 35 and and CC 3222 84 36 it -PRON- PRP 3222 84 37 replaces replace VBZ 3222 84 38 every every DT 3222 84 39 end end NN 3222 84 40 tag tag NN 3222 84 41 respecting respect VBG 3222 84 42 the the DT 3222 84 43 XML xml NN 3222 84 44 well well NN 3222 84 45 - - HYPH 3222 84 46 formedness formedness NN 3222 84 47 rule rule NN 3222 84 48 ( ( -LRB- 3222 84 49 i.e. i.e. FW 3222 84 50 , , , 3222 84 51 closing close VBG 3222 84 52 the the DT 3222 84 53 element element NN 3222 84 54 opened open VBD 3222 84 55 most most RBS 3222 84 56 recently recently RB 3222 84 57 ) ) -RRB- 3222 84 58 with with IN 3222 84 59 a a DT 3222 84 60 single single JJ 3222 84 61 flag flag NN 3222 84 62 . . . 3222 85 1 It -PRON- PRP 3222 85 2 also also RB 3222 85 3 uses use VBZ 3222 85 4 a a DT 3222 85 5 single single JJ 3222 85 6 flag flag NN 3222 85 7 to to TO 3222 85 8 denote denote VB 3222 85 9 XML xml NN 3222 85 10 attribute attribute NN 3222 85 11 value value NN 3222 85 12 begin begin VB 3222 85 13 and and CC 3222 85 14 end end VB 3222 85 15 marks mark NNS 3222 85 16 . . . 3222 86 1 HTML html NN 3222 86 2 documents document NNS 3222 86 3 are be VBP 3222 86 4 handled handle VBN 3222 86 5 similarly similarly RB 3222 86 6 . . . 3222 87 1 The the DT 3222 87 2 only only JJ 3222 87 3 dif- dif- JJ 3222 87 4 ference ference NN 3222 87 5 is be VBZ 3222 87 6 that that IN 3222 87 7 the the DT 3222 87 8 tags tag NNS 3222 87 9 that that WDT 3222 87 10 , , , 3222 87 11 according accord VBG 3222 87 12 to to IN 3222 87 13 the the DT 3222 87 14 HTML HTML NNP 3222 87 15 4.01 4.01 CD 3222 87 16 specification specification NN 3222 87 17 , , , 3222 87 18 are be VBP 3222 87 19 not not RB 3222 87 20 expected expect VBN 3222 87 21 to to TO 3222 87 22 be be VB 3222 87 23 followed follow VBN 3222 87 24 by by IN 3222 87 25 an an DT 3222 87 26 end- end- NN 3222 87 27 tag tag NN 3222 87 28 ( ( -LRB- 3222 87 29 BASE BASE NNP 3222 87 30 , , , 3222 87 31 LINK LINK NNP 3222 87 32 , , , 3222 87 33 XBASEHREF xbasehref NN 3222 87 34 , , , 3222 87 35 BR BR NNP 3222 87 36 , , , 3222 87 37 META META NNP 3222 87 38 , , , 3222 87 39 HR hr NN 3222 87 40 , , , 3222 87 41 IMG IMG NNP 3222 87 42 , , , 3222 87 43 AREA AREA NNP 3222 87 44 , , , 3222 87 45 INPUT INPUT NNP 3222 87 46 , , , 3222 87 47 EMBED EMBED NNS 3222 87 48 , , , 3222 87 49 PARAM PARAM NNS 3222 87 50 and and CC 3222 87 51 COL col NN 3222 87 52 ) ) -RRB- 3222 87 53 are be VBP 3222 87 54 ignored ignore VBN 3222 87 55 by by IN 3222 87 56 the the DT 3222 87 57 mechanism mechanism NN 3222 87 58 replacing replace VBG 3222 87 59 closing close VBG 3222 87 60 tags tag NNS 3222 87 61 ( ( -LRB- 3222 87 62 so so IN 3222 87 63 that that IN 3222 87 64 it -PRON- PRP 3222 87 65 can can MD 3222 87 66 guess guess VB 3222 87 67 the the DT 3222 87 68 correct correct JJ 3222 87 69 closing closing NN 3222 87 70 tag tag NN 3222 87 71 even even RB 3222 87 72 after after IN 3222 87 73 the the DT 3222 87 74 singular singular JJ 3222 87 75 tags tag NNS 3222 87 76 were be VBD 3222 87 77 encountered).21 encountered).21 NNP 3222 87 78 n n CC 3222 87 79 Using use VBG 3222 87 80 the the DT 3222 87 81 scheme scheme NN 3222 87 82 in in IN 3222 87 83 a a DT 3222 87 84 digital digital JJ 3222 87 85 library library NN 3222 87 86 project project NN 3222 87 87 Many many JJ 3222 87 88 textual textual JJ 3222 87 89 digital digital JJ 3222 87 90 libraries library NNS 3222 87 91 seriously seriously RB 3222 87 92 lack lack VBP 3222 87 93 text text NN 3222 87 94 compres- compres- VBZ 3222 87 95 sion sion NN 3222 87 96 capabilities capability NNS 3222 87 97 , , , 3222 87 98 and and CC 3222 87 99 popular popular JJ 3222 87 100 digital digital JJ 3222 87 101 library library NN 3222 87 102 systems system NNS 3222 87 103 , , , 3222 87 104 such such JJ 3222 87 105 as as IN 3222 87 106 Greenstone Greenstone NNP 3222 87 107 , , , 3222 87 108 have have VBP 3222 87 109 no no DT 3222 87 110 embedded embed VBN 3222 87 111 efficient efficient JJ 3222 87 112 text text NN 3222 87 113 compression.22 compression.22 NNP 3222 87 114 Therefore therefore RB 3222 87 115 we -PRON- PRP 3222 87 116 have have VBP 3222 87 117 decided decide VBN 3222 87 118 to to TO 3222 87 119 develop develop VB 3222 87 120 CTDL CTDL NNP 3222 87 121 as as IN 3222 87 122 an an DT 3222 87 123 open open JJ 3222 87 124 - - HYPH 3222 87 125 source source NN 3222 87 126 software software NN 3222 87 127 library library NN 3222 87 128 . . . 3222 88 1 The the DT 3222 88 2 library library NN 3222 88 3 is be VBZ 3222 88 4 free free JJ 3222 88 5 to to TO 3222 88 6 use use VB 3222 88 7 and and CC 3222 88 8 can can MD 3222 88 9 be be VB 3222 88 10 downloaded download VBN 3222 88 11 from from IN 3222 88 12 www.ii.uni.wroc www.ii.uni.wroc NNP 3222 88 13 .pl/~inikep .pl/~inikep . 3222 88 14 / / SYM 3222 88 15 research research NN 3222 88 16 / / SYM 3222 88 17 CTDL CTDL NNP 3222 88 18 / / SYM 3222 88 19 CTDL09.zip CTDL09.zip NNP 3222 88 20 . . . 3222 89 1 The the DT 3222 89 2 library library NN 3222 89 3 does do VBZ 3222 89 4 not not RB 3222 89 5 require require VB 3222 89 6 any any DT 3222 89 7 additional additional JJ 3222 89 8 nonstan- nonstan- JJ 3222 89 9 dard dard NN 3222 89 10 libraries library NNS 3222 89 11 . . . 3222 90 1 It -PRON- PRP 3222 90 2 has have VBZ 3222 90 3 both both CC 3222 90 4 the the DT 3222 90 5 text text NN 3222 90 6 transform transform NN 3222 90 7 and and CC 3222 90 8 back back NN 3222 90 9 - - HYPH 3222 90 10 end end NN 3222 90 11 compressors compressor NNS 3222 90 12 embedded embed VBN 3222 90 13 . . . 3222 91 1 However however RB 3222 91 2 , , , 3222 91 3 compressing compress VBG 3222 91 4 PDF pdf NN 3222 91 5 documents document NNS 3222 91 6 requires require VBZ 3222 91 7 them -PRON- PRP 3222 91 8 to to TO 3222 91 9 be be VB 3222 91 10 decompressed decompress VBN 3222 91 11 first first RB 3222 91 12 with with IN 3222 91 13 the the DT 3222 91 14 free free JJ 3222 91 15 Precomp Precomp NNP 3222 91 16 tool tool NN 3222 91 17 . . . 3222 92 1 The the DT 3222 92 2 compression compression NN 3222 92 3 routines routine NNS 3222 92 4 are be VBP 3222 92 5 wrapped wrap VBN 3222 92 6 in in IN 3222 92 7 a a DT 3222 92 8 code code NN 3222 92 9 selecting select VBG 3222 92 10 the the DT 3222 92 11 best good JJS 3222 92 12 algorithm algorithm NNP 3222 92 13 depending depend VBG 3222 92 14 on on IN 3222 92 15 the the DT 3222 92 16 chosen choose VBN 3222 92 17 compression compression NN 3222 92 18 mode mode NN 3222 92 19 and and CC 3222 92 20 the the DT 3222 92 21 input input NN 3222 92 22 document document NN 3222 92 23 format format NN 3222 92 24 . . . 3222 93 1 The the DT 3222 93 2 interface interface NN 3222 93 3 of of IN 3222 93 4 the the DT 3222 93 5 library library NN 3222 93 6 consists consist VBZ 3222 93 7 of of IN 3222 93 8 only only RB 3222 93 9 two two CD 3222 93 10 functions function NNS 3222 93 11 : : : 3222 93 12 CTDL_encode CTDL_encode NNP 3222 93 13 and and CC 3222 93 14 CTDL_decode CTDL_decode NNP 3222 93 15 , , , 3222 93 16 for for IN 3222 93 17 , , , 3222 93 18 respectively respectively RB 3222 93 19 , , , 3222 93 20 com- com- NN 3222 93 21 pressing pressing NN 3222 93 22 and and CC 3222 93 23 decompressing decompress VBG 3222 93 24 documents document NNS 3222 93 25 . . . 3222 94 1 CTDL_encode CTDL_encode NNP 3222 94 2 takes take VBZ 3222 94 3 the the DT 3222 94 4 following follow VBG 3222 94 5 parameters parameter NNS 3222 94 6 : : : 3222 94 7 n n NNP 3222 94 8 char char NNP 3222 94 9 * * NFP 3222 94 10 filename filename NN 3222 94 11 — — : 3222 94 12 name name NN 3222 94 13 of of IN 3222 94 14 the the DT 3222 94 15 input input NN 3222 94 16 ( ( -LRB- 3222 94 17 uncompressed uncompressed JJ 3222 94 18 ) ) -RRB- 3222 94 19 document document NN 3222 94 20 n n CC 3222 94 21 char char NNP 3222 94 22 * * NFP 3222 94 23 filename_out filename_out NN 3222 94 24 — — : 3222 94 25 name name NN 3222 94 26 of of IN 3222 94 27 the the DT 3222 94 28 output output NN 3222 94 29 ( ( -LRB- 3222 94 30 com- com- NN 3222 94 31 pressed press VBD 3222 94 32 ) ) -RRB- 3222 94 33 document document NN 3222 94 34 n n NNP 3222 94 35 EFileType EFileType NNP 3222 94 36 ftype ftype NN 3222 94 37 — — : 3222 94 38 format format NN 3222 94 39 of of IN 3222 94 40 the the DT 3222 94 41 input input NN 3222 94 42 document document NN 3222 94 43 , , , 3222 94 44 defined define VBN 3222 94 45 as as IN 3222 94 46 : : : 3222 94 47 enum enum NNP 3222 94 48 EFileType EFileType NFP 3222 94 49 { { -LRB- 3222 94 50 HTML html NN 3222 94 51 , , , 3222 94 52 PDF PDF NNP 3222 94 53 , , , 3222 94 54 PS PS NNP 3222 94 55 , , , 3222 94 56 RTF RTF NNP 3222 94 57 , , , 3222 94 58 TEX TEX NNP 3222 94 59 , , , 3222 94 60 TXT txt NN 3222 94 61 , , , 3222 94 62 XML xml NN 3222 94 63 } } -RRB- 3222 94 64 ; ; : 3222 94 65 n n NNP 3222 94 66 EDictionaryType EDictionaryType NNP 3222 94 67 dtype dtype NN 3222 94 68 — — : 3222 94 69 dictionary dictionary JJ 3222 94 70 type type NN 3222 94 71 , , , 3222 94 72 defined define VBN 3222 94 73 as as IN 3222 94 74 : : : 3222 94 75 enum enum NNP 3222 94 76 EDictionaryType EDictionaryType NNP 3222 94 77 { { -LRB- 3222 94 78 Static Static NNP 3222 94 79 , , , 3222 94 80 SemiDynamic SemiDynamic NNP 3222 94 81 } } -RRB- 3222 94 82 ; ; : 3222 94 83 CTDL_decode CTDL_decode NNP 3222 94 84 takes take VBZ 3222 94 85 the the DT 3222 94 86 following follow VBG 3222 94 87 parameters parameter NNS 3222 94 88 : : : 3222 94 89 n n NNP 3222 94 90 char char NNP 3222 94 91 * * NFP 3222 94 92 filename filename NN 3222 94 93 — — : 3222 94 94 name name NN 3222 94 95 of of IN 3222 94 96 the the DT 3222 94 97 input input NN 3222 94 98 ( ( -LRB- 3222 94 99 compressed compressed JJ 3222 94 100 ) ) -RRB- 3222 94 101 document document NN 3222 94 102 n n CC 3222 94 103 char char NNP 3222 94 104 * * NFP 3222 94 105 filename_out filename_out NN 3222 94 106 — — : 3222 94 107 name name NN 3222 94 108 of of IN 3222 94 109 the the DT 3222 94 110 output output NN 3222 94 111 ( ( -LRB- 3222 94 112 decom- decom- NNP 3222 94 113 pressed pressed JJ 3222 94 114 ) ) -RRB- 3222 94 115 document document NN 3222 94 116 Table table NN 3222 94 117 1 1 CD 3222 94 118 . . . 3222 95 1 Universal universal JJ 3222 95 2 transform transform NN 3222 95 3 optimizations optimization NNS 3222 95 4 CTDL CTDL NNP 3222 95 5 Settings Settings NNP 3222 95 6 CTDL+ CTDL+ NNP 3222 95 7 Settings Settings NNP 3222 95 8 Format Format NNP 3222 95 9 MinFr MinFr NNP 3222 95 10 WdSpc WdSpc NNP 3222 95 11 SpRuns SpRuns NNP 3222 95 12 LetCnt LetCnt NNP 3222 95 13 WdSpc WdSpc NNP 3222 95 14 SpRuns SpRuns NNP 3222 95 15 LetCnt LetCnt NNP 3222 95 16 HTML HTML VBD 3222 95 17 3 3 CD 3222 95 18 + + SYM 3222 95 19 + + SYM 3222 95 20 + + SYM 3222 95 21 + + SYM 3222 95 22 + + SYM 3222 95 23 - - HYPH 3222 95 24 PDF PDF NNP 3222 95 25 3 3 CD 3222 95 26 - - HYPH 3222 95 27 - - HYPH 3222 95 28 - - HYPH 3222 95 29 - - HYPH 3222 95 30 - - HYPH 3222 95 31 - - HYPH 3222 95 32 PS PS NNP 3222 95 33 6 6 CD 3222 95 34 - - HYPH 3222 95 35 + + SYM 3222 95 36 - - HYPH 3222 95 37 - - HYPH 3222 95 38 + + SYM 3222 95 39 - - HYPH 3222 95 40 RTF RTF NNP 3222 95 41 3 3 CD 3222 95 42 + + SYM 3222 95 43 - - HYPH 3222 95 44 + + SYM 3222 95 45 + + SYM 3222 95 46 - - HYPH 3222 95 47 - - : 3222 95 48 TEX tex NN 3222 95 49 3 3 CD 3222 95 50 + + SYM 3222 95 51 + + SYM 3222 95 52 + + SYM 3222 95 53 + + SYM 3222 95 54 + + SYM 3222 95 55 + + SYM 3222 95 56 TXT txt NN 3222 95 57 6 6 CD 3222 95 58 + + SYM 3222 95 59 + + SYM 3222 95 60 + + SYM 3222 95 61 + + SYM 3222 95 62 + + SYM 3222 95 63 + + SYM 3222 95 64 XML xml NN 3222 95 65 3 3 CD 3222 95 66 + + SYM 3222 95 67 + + SYM 3222 95 68 + + SYM 3222 95 69 + + SYM 3222 95 70 + + SYM 3222 95 71 - - : 3222 95 72 THE THE NNP 3222 95 73 EFFiCiENT efficient PRP$ 3222 95 74 SToraGE storage CD 3222 95 75 oF oF NNP 3222 95 76 TExT text NN 3222 95 77 DoCuMENTS documents NN 3222 95 78 iN in IN 3222 95 79 DiGiTal DiGiTal NNP 3222 95 80 liBrariES libraries NN 3222 95 81 | | NNP 3222 95 82 SkibiŃSki SkibiŃSki NNP 3222 95 83 and and CC 3222 95 84 Swacha Swacha NNP 3222 95 85 147 147 CD 3222 95 86 The the DT 3222 95 87 library library NN 3222 95 88 was be VBD 3222 95 89 written write VBN 3222 95 90 in in IN 3222 95 91 the the DT 3222 95 92 C++ C++ NNP 3222 95 93 programming programming NN 3222 95 94 language language NN 3222 95 95 , , , 3222 95 96 but but CC 3222 95 97 a a DT 3222 95 98 compiled compiled JJ 3222 95 99 static static JJ 3222 95 100 library library NN 3222 95 101 is be VBZ 3222 95 102 also also RB 3222 95 103 distributed distribute VBN 3222 95 104 ; ; : 3222 95 105 thus thus RB 3222 95 106 it -PRON- PRP 3222 95 107 can can MD 3222 95 108 be be VB 3222 95 109 used use VBN 3222 95 110 in in IN 3222 95 111 any any DT 3222 95 112 language language NN 3222 95 113 that that WDT 3222 95 114 can can MD 3222 95 115 link link VB 3222 95 116 such such JJ 3222 95 117 libraries library NNS 3222 95 118 . . . 3222 96 1 Currently currently RB 3222 96 2 , , , 3222 96 3 the the DT 3222 96 4 library library NN 3222 96 5 is be VBZ 3222 96 6 compatible compatible JJ 3222 96 7 with with IN 3222 96 8 two two CD 3222 96 9 platforms platform NNS 3222 96 10 : : : 3222 96 11 Microsoft Microsoft NNP 3222 96 12 Windows Windows NNP 3222 96 13 and and CC 3222 96 14 Linux Linux NNP 3222 96 15 . . . 3222 97 1 To to TO 3222 97 2 use use VB 3222 97 3 static static JJ 3222 97 4 dictionaries dictionary NNS 3222 97 5 , , , 3222 97 6 the the DT 3222 97 7 respective respective JJ 3222 97 8 dictionary dictionary JJ 3222 97 9 file file NN 3222 97 10 must must MD 3222 97 11 be be VB 3222 97 12 available available JJ 3222 97 13 . . . 3222 98 1 The the DT 3222 98 2 library library NN 3222 98 3 is be VBZ 3222 98 4 sup- sup- RB 3222 98 5 plied ply VBN 3222 98 6 with with IN 3222 98 7 an an DT 3222 98 8 English english JJ 3222 98 9 dictionary dictionary NN 3222 98 10 trained train VBN 3222 98 11 on on IN 3222 98 12 a a DT 3222 98 13 3 3 CD 3222 98 14 GB GB NNP 3222 98 15 text text NN 3222 98 16 corpus corpus NN 3222 98 17 from from IN 3222 98 18 Project Project NNP 3222 98 19 Gutenberg.23 Gutenberg.23 NNP 3222 98 20 Seven seven CD 3222 98 21 other other JJ 3222 98 22 dictionaries dictionary NNS 3222 98 23 — — : 3222 98 24 German german JJ 3222 98 25 , , , 3222 98 26 Spanish spanish JJ 3222 98 27 , , , 3222 98 28 Finnish finnish JJ 3222 98 29 , , , 3222 98 30 French french JJ 3222 98 31 , , , 3222 98 32 Italian italian JJ 3222 98 33 , , , 3222 98 34 Polish polish JJ 3222 98 35 , , , 3222 98 36 and and CC 3222 98 37 Russian russian JJ 3222 98 38 — — : 3222 98 39 can can MD 3222 98 40 be be VB 3222 98 41 freely freely RB 3222 98 42 downloaded download VBN 3222 98 43 from from IN 3222 98 44 www.ii.uni.wroc.pl/~inikep/ www.ii.uni.wroc.pl/~inikep/ NNP 3222 98 45 research research NN 3222 98 46 / / SYM 3222 98 47 dicts dict NNS 3222 98 48 . . . 3222 99 1 There there EX 3222 99 2 also also RB 3222 99 3 is be VBZ 3222 99 4 a a DT 3222 99 5 tool tool NN 3222 99 6 that that WDT 3222 99 7 helps help VBZ 3222 99 8 create create VB 3222 99 9 a a DT 3222 99 10 new new JJ 3222 99 11 dictionary dictionary NN 3222 99 12 from from IN 3222 99 13 any any DT 3222 99 14 given give VBN 3222 99 15 corpus corpus NN 3222 99 16 of of IN 3222 99 17 documents document NNS 3222 99 18 , , , 3222 99 19 available available JJ 3222 99 20 from from IN 3222 99 21 Skibiński Skibiński NNP 3222 99 22 upon upon IN 3222 99 23 request request NN 3222 99 24 via via IN 3222 99 25 e e NN 3222 99 26 - - NN 3222 99 27 mail mail NN 3222 99 28 ( ( -LRB- 3222 99 29 inikep@ii.uni inikep@ii.uni ADD 3222 99 30 .wroc.pl .wroc.pl . 3222 99 31 ) ) -RRB- 3222 99 32 . . . 3222 100 1 The the DT 3222 100 2 library library NN 3222 100 3 can can MD 3222 100 4 be be VB 3222 100 5 used use VBN 3222 100 6 to to TO 3222 100 7 reduce reduce VB 3222 100 8 the the DT 3222 100 9 storage storage NN 3222 100 10 require- require- NN 3222 100 11 ments ment NNS 3222 100 12 or or CC 3222 100 13 also also RB 3222 100 14 to to TO 3222 100 15 reduce reduce VB 3222 100 16 the the DT 3222 100 17 time time NN 3222 100 18 of of IN 3222 100 19 delivering deliver VBG 3222 100 20 a a DT 3222 100 21 requested requested JJ 3222 100 22 document document NN 3222 100 23 to to IN 3222 100 24 the the DT 3222 100 25 library library NN 3222 100 26 user user NN 3222 100 27 . . . 3222 101 1 In in IN 3222 101 2 the the DT 3222 101 3 first first JJ 3222 101 4 case case NN 3222 101 5 , , , 3222 101 6 the the DT 3222 101 7 decom- decom- JJ 3222 101 8 pression pression NN 3222 101 9 must must MD 3222 101 10 be be VB 3222 101 11 done do VBN 3222 101 12 on on IN 3222 101 13 the the DT 3222 101 14 server server NN 3222 101 15 side side NN 3222 101 16 . . . 3222 102 1 In in IN 3222 102 2 the the DT 3222 102 3 second second JJ 3222 102 4 case case NN 3222 102 5 , , , 3222 102 6 it -PRON- PRP 3222 102 7 must must MD 3222 102 8 be be VB 3222 102 9 done do VBN 3222 102 10 on on IN 3222 102 11 the the DT 3222 102 12 client client NN 3222 102 13 side side NN 3222 102 14 , , , 3222 102 15 which which WDT 3222 102 16 is be VBZ 3222 102 17 pos- pos- VBN 3222 102 18 sible sible JJ 3222 102 19 because because IN 3222 102 20 stand stand VB 3222 102 21 - - HYPH 3222 102 22 alone alone RB 3222 102 23 decompressors decompressor NNS 3222 102 24 are be VBP 3222 102 25 available available JJ 3222 102 26 for for IN 3222 102 27 Microsoft Microsoft NNP 3222 102 28 Windows Windows NNP 3222 102 29 and and CC 3222 102 30 Linux Linux NNP 3222 102 31 . . . 3222 103 1 Obviously obviously RB 3222 103 2 , , , 3222 103 3 a a DT 3222 103 4 library library NN 3222 103 5 can can MD 3222 103 6 support support VB 3222 103 7 both both DT 3222 103 8 options option NNS 3222 103 9 by by IN 3222 103 10 providing provide VBG 3222 103 11 the the DT 3222 103 12 user user NN 3222 103 13 with with IN 3222 103 14 a a DT 3222 103 15 choice choice NN 3222 103 16 whether whether IN 3222 103 17 a a DT 3222 103 18 document document NN 3222 103 19 should should MD 3222 103 20 be be VB 3222 103 21 delivered deliver VBN 3222 103 22 compressed compressed JJ 3222 103 23 or or CC 3222 103 24 not not RB 3222 103 25 . . . 3222 104 1 If if IN 3222 104 2 documents document NNS 3222 104 3 are be VBP 3222 104 4 to to TO 3222 104 5 be be VB 3222 104 6 decompressed decompress VBN 3222 104 7 client client NN 3222 104 8 - - HYPH 3222 104 9 side side NN 3222 104 10 , , , 3222 104 11 the the DT 3222 104 12 basic basic JJ 3222 104 13 CTDL CTDL NNP 3222 104 14 , , , 3222 104 15 using use VBG 3222 104 16 a a DT 3222 104 17 semidynamic semidynamic JJ 3222 104 18 dictionary dictionary NN 3222 104 19 , , , 3222 104 20 seems seem VBZ 3222 104 21 hand- hand- NNP 3222 104 22 ier ier NNP 3222 104 23 , , , 3222 104 24 since since IN 3222 104 25 it -PRON- PRP 3222 104 26 does do VBZ 3222 104 27 not not RB 3222 104 28 require require VB 3222 104 29 the the DT 3222 104 30 user user NN 3222 104 31 to to TO 3222 104 32 obtain obtain VB 3222 104 33 the the DT 3222 104 34 static static JJ 3222 104 35 dictionary dictionary NN 3222 104 36 that that WDT 3222 104 37 was be VBD 3222 104 38 used use VBN 3222 104 39 to to TO 3222 104 40 compress compress VB 3222 104 41 the the DT 3222 104 42 downloaded downloaded JJ 3222 104 43 doc- doc- NN 3222 104 44 ument ument NN 3222 104 45 . . . 3222 105 1 Still still RB 3222 105 2 , , , 3222 105 3 the the DT 3222 105 4 size size NN 3222 105 5 of of IN 3222 105 6 such such PDT 3222 105 7 a a DT 3222 105 8 dictionary dictionary NN 3222 105 9 is be VBZ 3222 105 10 usually usually RB 3222 105 11 small small JJ 3222 105 12 , , , 3222 105 13 so so IN 3222 105 14 it -PRON- PRP 3222 105 15 does do VBZ 3222 105 16 not not RB 3222 105 17 disqualify disqualify VB 3222 105 18 CTDL+ CTDL+ NNP 3222 105 19 from from IN 3222 105 20 this this DT 3222 105 21 kind kind NN 3222 105 22 of of IN 3222 105 23 use use NN 3222 105 24 . . . 3222 106 1 n n DT 3222 106 2 Experimental Experimental NNP 3222 106 3 results result NNS 3222 106 4 We -PRON- PRP 3222 106 5 tested test VBD 3222 106 6 CTDL CTDL NNP 3222 106 7 experimentally experimentally RB 3222 106 8 on on IN 3222 106 9 a a DT 3222 106 10 benchmark benchmark JJ 3222 106 11 set set NN 3222 106 12 of of IN 3222 106 13 text text NN 3222 106 14 documents document NNS 3222 106 15 . . . 3222 107 1 The the DT 3222 107 2 purpose purpose NN 3222 107 3 of of IN 3222 107 4 the the DT 3222 107 5 tests test NNS 3222 107 6 was be VBD 3222 107 7 to to TO 3222 107 8 compare compare VB 3222 107 9 the the DT 3222 107 10 storage storage NN 3222 107 11 requirements requirement NNS 3222 107 12 of of IN 3222 107 13 different different JJ 3222 107 14 document document NN 3222 107 15 formats format NNS 3222 107 16 in in IN 3222 107 17 compressed compressed JJ 3222 107 18 and and CC 3222 107 19 uncompressed uncompressed JJ 3222 107 20 form form NN 3222 107 21 . . . 3222 108 1 In in IN 3222 108 2 selecting select VBG 3222 108 3 the the DT 3222 108 4 test test NN 3222 108 5 files file NNS 3222 108 6 we -PRON- PRP 3222 108 7 wanted want VBD 3222 108 8 to to TO 3222 108 9 achieve achieve VB 3222 108 10 the the DT 3222 108 11 following follow VBG 3222 108 12 goals goal NNS 3222 108 13 : : : 3222 108 14 n n RB 3222 108 15 test test VBP 3222 108 16 all all PDT 3222 108 17 the the DT 3222 108 18 formats format NNS 3222 108 19 listed list VBN 3222 108 20 in in IN 3222 108 21 table table NN 3222 108 22 1 1 CD 3222 108 23 ( ( -LRB- 3222 108 24 therefore therefore RB 3222 108 25 we -PRON- PRP 3222 108 26 decided decide VBD 3222 108 27 to to TO 3222 108 28 choose choose VB 3222 108 29 documents document NNS 3222 108 30 that that WDT 3222 108 31 produced produce VBD 3222 108 32 no no DT 3222 108 33 errors error NNS 3222 108 34 during during IN 3222 108 35 document document JJ 3222 108 36 format format NN 3222 108 37 conversion conversion NN 3222 108 38 ) ) -RRB- 3222 108 39 n n LS 3222 108 40 obtain obtain VBP 3222 108 41 verifiable verifiable JJ 3222 108 42 results result NNS 3222 108 43 ( ( -LRB- 3222 108 44 therefore therefore RB 3222 108 45 we -PRON- PRP 3222 108 46 decided decide VBD 3222 108 47 to to TO 3222 108 48 use use VB 3222 108 49 documents document NNS 3222 108 50 that that WDT 3222 108 51 can can MD 3222 108 52 be be VB 3222 108 53 easily easily RB 3222 108 54 obtained obtain VBN 3222 108 55 from from IN 3222 108 56 the the DT 3222 108 57 Internet internet NN 3222 108 58 ) ) -RRB- 3222 108 59 n n CC 3222 108 60 measure measure VBP 3222 108 61 the the DT 3222 108 62 actual actual JJ 3222 108 63 compression compression NN 3222 108 64 improvement improvement NN 3222 108 65 from from IN 3222 108 66 applying apply VBG 3222 108 67 the the DT 3222 108 68 proposed propose VBN 3222 108 69 scheme scheme NN 3222 108 70 ( ( -LRB- 3222 108 71 apart apart RB 3222 108 72 from from IN 3222 108 73 the the DT 3222 108 74 RTF RTF NNP 3222 108 75 format format NN 3222 108 76 , , , 3222 108 77 the the DT 3222 108 78 scheme scheme NN 3222 108 79 is be VBZ 3222 108 80 neutral neutral JJ 3222 108 81 to to IN 3222 108 82 the the DT 3222 108 83 images image NNS 3222 108 84 embedded embed VBN 3222 108 85 in in IN 3222 108 86 documents document NNS 3222 108 87 ; ; : 3222 108 88 therefore therefore RB 3222 108 89 we -PRON- PRP 3222 108 90 decided decide VBD 3222 108 91 to to TO 3222 108 92 use use VB 3222 108 93 documents document NNS 3222 108 94 that that WDT 3222 108 95 have have VBP 3222 108 96 no no DT 3222 108 97 embedded embed VBN 3222 108 98 images image NNS 3222 108 99 ) ) -RRB- 3222 108 100 For for IN 3222 108 101 these these DT 3222 108 102 reasons reason NNS 3222 108 103 , , , 3222 108 104 we -PRON- PRP 3222 108 105 used use VBD 3222 108 106 the the DT 3222 108 107 following follow VBG 3222 108 108 procedure procedure NN 3222 108 109 for for IN 3222 108 110 selecting select VBG 3222 108 111 documents document NNS 3222 108 112 to to IN 3222 108 113 the the DT 3222 108 114 test test NN 3222 108 115 set set NN 3222 108 116 . . . 3222 109 1 First first RB 3222 109 2 , , , 3222 109 3 we -PRON- PRP 3222 109 4 searched search VBD 3222 109 5 the the DT 3222 109 6 Project Project NNP 3222 109 7 Gutenberg Gutenberg NNP 3222 109 8 library library NN 3222 109 9 for for IN 3222 109 10 TEX TEX NNP 3222 109 11 documents document NNS 3222 109 12 , , , 3222 109 13 as as IN 3222 109 14 this this DT 3222 109 15 format format NN 3222 109 16 can can MD 3222 109 17 most most RBS 3222 109 18 reliably reliably RB 3222 109 19 be be VB 3222 109 20 transformed transform VBN 3222 109 21 into into IN 3222 109 22 the the DT 3222 109 23 other other JJ 3222 109 24 formats format NNS 3222 109 25 . . . 3222 110 1 From from IN 3222 110 2 the the DT 3222 110 3 fifty fifty CD 3222 110 4 - - HYPH 3222 110 5 one one CD 3222 110 6 retrieved retrieve VBN 3222 110 7 documents document NNS 3222 110 8 , , , 3222 110 9 we -PRON- PRP 3222 110 10 removed remove VBD 3222 110 11 all all PDT 3222 110 12 those those DT 3222 110 13 containing contain VBG 3222 110 14 images image NNS 3222 110 15 as as RB 3222 110 16 well well RB 3222 110 17 as as IN 3222 110 18 those those DT 3222 110 19 that that WDT 3222 110 20 the the DT 3222 110 21 htlatex htlatex NN 3222 110 22 tool tool NN 3222 110 23 failed fail VBD 3222 110 24 to to TO 3222 110 25 convert convert VB 3222 110 26 to to IN 3222 110 27 HTML html NN 3222 110 28 . . . 3222 111 1 In in IN 3222 111 2 the the DT 3222 111 3 eleven eleven CD 3222 111 4 remaining remain VBG 3222 111 5 documents document NNS 3222 111 6 , , , 3222 111 7 there there EX 3222 111 8 were be VBD 3222 111 9 four four CD 3222 111 10 Jane Jane NNP 3222 111 11 Austen Austen NNP 3222 111 12 books book NNS 3222 111 13 ; ; : 3222 111 14 this this DT 3222 111 15 overrepresentation overrepresentation NN 3222 111 16 was be VBD 3222 111 17 handled handle VBN 3222 111 18 by by IN 3222 111 19 removing remove VBG 3222 111 20 three three CD 3222 111 21 of of IN 3222 111 22 them -PRON- PRP 3222 111 23 . . . 3222 112 1 The the DT 3222 112 2 resulting result VBG 3222 112 3 eight eight CD 3222 112 4 documents document NNS 3222 112 5 are be VBP 3222 112 6 given give VBN 3222 112 7 in in IN 3222 112 8 table table NN 3222 112 9 2 2 CD 3222 112 10 . . . 3222 113 1 From from IN 3222 113 2 the the DT 3222 113 3 TEX TEX NNP 3222 113 4 files file NNS 3222 113 5 we -PRON- PRP 3222 113 6 generated generate VBD 3222 113 7 HTML HTML NNP 3222 113 8 , , , 3222 113 9 PDF PDF NNP 3222 113 10 , , , 3222 113 11 and and CC 3222 113 12 PS PS NNP 3222 113 13 documents document NNS 3222 113 14 . . . 3222 114 1 Then then RB 3222 114 2 we -PRON- PRP 3222 114 3 used use VBD 3222 114 4 Word Word NNP 3222 114 5 2007 2007 CD 3222 114 6 to to TO 3222 114 7 transform transform VB 3222 114 8 HTML html NN 3222 114 9 documents document NNS 3222 114 10 into into IN 3222 114 11 RTF RTF NNP 3222 114 12 , , , 3222 114 13 DOC DOC NNP 3222 114 14 , , , 3222 114 15 and and CC 3222 114 16 XML xml NN 3222 114 17 ( ( -LRB- 3222 114 18 thus thus RB 3222 114 19 this this DT 3222 114 20 is be VBZ 3222 114 21 the the DT 3222 114 22 Microsoft Microsoft NNP 3222 114 23 Word Word NNP 3222 114 24 XML xml NN 3222 114 25 format format NN 3222 114 26 , , , 3222 114 27 not not RB 3222 114 28 the the DT 3222 114 29 Project Project NNP 3222 114 30 Gutenberg Gutenberg NNP 3222 114 31 XML xml NN 3222 114 32 format format NN 3222 114 33 ) ) -RRB- 3222 114 34 . . . 3222 115 1 The the DT 3222 115 2 TXT txt NN 3222 115 3 files file NNS 3222 115 4 were be VBD 3222 115 5 downloaded download VBN 3222 115 6 from from IN 3222 115 7 Project Project NNP 3222 115 8 Gutenberg Gutenberg NNP 3222 115 9 . . . 3222 116 1 The the DT 3222 116 2 tests test NNS 3222 116 3 were be VBD 3222 116 4 conducted conduct VBN 3222 116 5 on on IN 3222 116 6 a a DT 3222 116 7 low low JJ 3222 116 8 - - HYPH 3222 116 9 end end NN 3222 116 10 AMD AMD NNP 3222 116 11 Sempron Sempron NNP 3222 116 12 3000 3000 CD 3222 116 13 + + SYM 3222 116 14 1.80 1.80 CD 3222 116 15 GHz GHz NNS 3222 116 16 system system NN 3222 116 17 with with IN 3222 116 18 512 512 CD 3222 116 19 MB mb NN 3222 116 20 RAM ram NN 3222 116 21 and and CC 3222 116 22 a a DT 3222 116 23 Seagate Seagate NNP 3222 116 24 80 80 CD 3222 116 25 GB GB NNP 3222 116 26 ATA ATA NNP 3222 116 27 drive drive NN 3222 116 28 , , , 3222 116 29 running run VBG 3222 116 30 Windows Windows NNP 3222 116 31 XP XP NNP 3222 116 32 SP2 SP2 NNP 3222 116 33 . . . 3222 117 1 For for IN 3222 117 2 comparison comparison NN 3222 117 3 purposes purpose NNS 3222 117 4 , , , 3222 117 5 we -PRON- PRP 3222 117 6 used use VBD 3222 117 7 three three CD 3222 117 8 general- general- NN 3222 117 9 purpose purpose NN 3222 117 10 compression compression NN 3222 117 11 programs program NNS 3222 117 12 : : : 3222 117 13 n n NNP 3222 117 14 gzip gzip NN 3222 117 15 implementing implement VBG 3222 117 16 Deflate deflate NN 3222 117 17 n n IN 3222 117 18 bzip2 bzip2 IN 3222 117 19 implementing implement VBG 3222 117 20 a a DT 3222 117 21 BWT BWT NNP 3222 117 22 - - HYPH 3222 117 23 based base VBN 3222 117 24 compression compression NN 3222 117 25 algorithm algorithm NN 3222 117 26 Table Table NNP 3222 117 27 2 2 CD 3222 117 28 . . . 3222 118 1 Test test NN 3222 118 2 set set VBN 3222 118 3 documents document NNS 3222 118 4 specification specification NN 3222 118 5 File File NNP 3222 118 6 Name Name NNP 3222 118 7 Title Title NNP 3222 118 8 Author Author NNP 3222 118 9 TEx TEx NNS 3222 118 10 Size Size NNP 3222 118 11 ( ( -LRB- 3222 118 12 bytes bytes NN 3222 118 13 ) ) -RRB- 3222 118 14 13601-t 13601-t VBZ 3222 118 15 Expositions exposition NNS 3222 118 16 of of IN 3222 118 17 Holy Holy NNP 3222 118 18 Scripture scripture NN 3222 118 19 : : : 3222 118 20 Romans Romans NNPS 3222 118 21 Corinthians Corinthians NNPS 3222 118 22 Maclaren Maclaren NNP 3222 118 23 1,443,056 1,443,056 CD 3222 118 24 16514-t 16514-t CD 3222 118 25 A a DT 3222 118 26 Little little JJ 3222 118 27 Cook Cook NNP 3222 118 28 Book Book NNP 3222 118 29 for for IN 3222 118 30 a a DT 3222 118 31 Little little JJ 3222 118 32 Girl Girl NNP 3222 118 33 Benton Benton NNP 3222 118 34 220,480 220,480 CD 3222 118 35 1noam10 1noam10 NN 3222 118 36 t t NN 3222 118 37 North North NNP 3222 118 38 America America NNP 3222 118 39 , , , 3222 118 40 V. V. NNP 3222 118 41 1 1 CD 3222 118 42 Trollope Trollope NNP 3222 118 43 804,813 804,813 CD 3222 118 44 2ws2610 2ws2610 CD 3222 118 45 Hamlet Hamlet NNP 3222 118 46 Shakespeare Shakespeare NNP 3222 118 47 194,527 194,527 CD 3222 118 48 alice30 alice30 CD 3222 118 49 Alice Alice NNP 3222 118 50 in in IN 3222 118 51 Wonderland Wonderland NNP 3222 118 52 Carroll Carroll NNP 3222 118 53 165,844 165,844 CD 3222 118 54 cdscs10 cdscs10 NN 3222 118 55 t t NN 3222 118 56 Some some DT 3222 118 57 Christmas Christmas NNP 3222 118 58 Stories Stories NNPS 3222 118 59 Dickens Dickens NNP 3222 118 60 127,684 127,684 CD 3222 118 61 grimm10 grimm10 NN 3222 118 62 t t NN 3222 118 63 Fairy Fairy NNP 3222 118 64 Tales Tales NNP 3222 118 65 Grimm Grimm NNP 3222 118 66 535,842 535,842 CD 3222 118 67 pandp12 pandp12 NNP 3222 118 68 t t NN 3222 118 69 Pride Pride NNP 3222 118 70 and and CC 3222 118 71 Prejudice Prejudice NNP 3222 118 72 Austen Austen NNP 3222 118 73 727,415 727,415 CD 3222 118 74 148 148 CD 3222 118 75 iNForMaTioN information NN 3222 118 76 TECHNoloGY technology NN 3222 118 77 aND and CC 3222 118 78 liBrariES liBrariES NNP 3222 118 79 | | NNP 3222 118 80 SEpTEMBEr september CD 3222 118 81 2009 2009 CD 3222 118 82 n n CD 3222 118 83 PPMVC PPMVC NNP 3222 118 84 implementing implement VBG 3222 118 85 a a DT 3222 118 86 PPM ppm NN 3222 118 87 - - HYPH 3222 118 88 derived derive VBN 3222 118 89 compres- compres- NN 3222 118 90 sion sion NN 3222 118 91 algorithm24 algorithm24 NNP 3222 118 92 Tables Tables NNP 3222 118 93 3–10 3–10 CD 3222 118 94 show show VBP 3222 118 95 n n CC 3222 118 96 the the DT 3222 118 97 bitrate bitrate NN 3222 118 98 attained attain VBN 3222 118 99 on on IN 3222 118 100 each each DT 3222 118 101 test test NN 3222 118 102 file file NN 3222 118 103 by by IN 3222 118 104 the the DT 3222 118 105 Deflate- Deflate- NNP 3222 118 106 based base VBN 3222 118 107 gzip gzip NN 3222 118 108 in in IN 3222 118 109 default default NN 3222 118 110 mode mode NN 3222 118 111 , , , 3222 118 112 the the DT 3222 118 113 proposed propose VBN 3222 118 114 com- com- NN 3222 118 115 pression pression NN 3222 118 116 scheme scheme NN 3222 118 117 in in IN 3222 118 118 the the DT 3222 118 119 semidynamic semidynamic JJ 3222 118 120 and and CC 3222 118 121 static static JJ 3222 118 122 variants variant NNS 3222 118 123 with with IN 3222 118 124 Deflate Deflate NNP 3222 118 125 as as IN 3222 118 126 the the DT 3222 118 127 back back JJ 3222 118 128 - - HYPH 3222 118 129 end end NN 3222 118 130 compression compression NN 3222 118 131 algorithm algorithm NN 3222 118 132 , , , 3222 118 133 7-zip 7-zip NNP 3222 118 134 in in IN 3222 118 135 LZMA lzma NN 3222 118 136 mode mode NN 3222 118 137 , , , 3222 118 138 the the DT 3222 118 139 proposed propose VBN 3222 118 140 compression compression NN 3222 118 141 scheme scheme NN 3222 118 142 in in IN 3222 118 143 the the DT 3222 118 144 semidynamic semidynamic JJ 3222 118 145 and and CC 3222 118 146 static static JJ 3222 118 147 variants variant NNS 3222 118 148 with with IN 3222 118 149 LZMA lzma NN 3222 118 150 as as IN 3222 118 151 the the DT 3222 118 152 back back JJ 3222 118 153 - - HYPH 3222 118 154 end end NN 3222 118 155 compression compression NN 3222 118 156 algorithm algorithm NN 3222 118 157 , , , 3222 118 158 bzip2 bzip2 NN 3222 118 159 and and CC 3222 118 160 PPMVC PPMVC NNS 3222 118 161 ; ; : 3222 118 162 n n CC 3222 118 163 the the DT 3222 118 164 average average JJ 3222 118 165 bitrate bitrate NN 3222 118 166 attained attain VBD 3222 118 167 on on IN 3222 118 168 the the DT 3222 118 169 whole whole JJ 3222 118 170 test test NN 3222 118 171 cor- cor- IN 3222 118 172 pus pus NNP 3222 118 173 ; ; : 3222 118 174 and and CC 3222 118 175 n n CC 3222 118 176 the the DT 3222 118 177 total total JJ 3222 118 178 compression compression NN 3222 118 179 and and CC 3222 118 180 decompression decompression NN 3222 118 181 times time NNS 3222 118 182 ( ( -LRB- 3222 118 183 in in IN 3222 118 184 seconds second NNS 3222 118 185 ) ) -RRB- 3222 118 186 for for IN 3222 118 187 the the DT 3222 118 188 whole whole JJ 3222 118 189 test test NN 3222 118 190 corpus corpus NNP 3222 118 191 , , , 3222 118 192 measured measure VBN 3222 118 193 on on IN 3222 118 194 the the DT 3222 118 195 test test NN 3222 118 196 platform platform NN 3222 118 197 ( ( -LRB- 3222 118 198 they -PRON- PRP 3222 118 199 are be VBP 3222 118 200 total total JJ 3222 118 201 elapsed elapse VBN 3222 118 202 times time NNS 3222 118 203 including include VBG 3222 118 204 program program NN 3222 118 205 initialization initialization NN 3222 118 206 and and CC 3222 118 207 disk disk NN 3222 118 208 operations operation NNS 3222 118 209 ) ) -RRB- 3222 118 210 . . . 3222 119 1 Bitrates bitrate NNS 3222 119 2 are be VBP 3222 119 3 given give VBN 3222 119 4 in in IN 3222 119 5 output output NN 3222 119 6 bits bit NNS 3222 119 7 per per IN 3222 119 8 character character NN 3222 119 9 of of IN 3222 119 10 an an DT 3222 119 11 uncompressed uncompressed JJ 3222 119 12 document document NN 3222 119 13 in in IN 3222 119 14 a a DT 3222 119 15 given give VBN 3222 119 16 format format NN 3222 119 17 , , , 3222 119 18 so so IN 3222 119 19 a a DT 3222 119 20 smaller small JJR 3222 119 21 Table table NN 3222 119 22 3 3 CD 3222 119 23 . . . 3222 120 1 Compression compression NN 3222 120 2 efficiency efficiency NN 3222 120 3 and and CC 3222 120 4 times time NNS 3222 120 5 for for IN 3222 120 6 the the DT 3222 120 7 TXT txt NN 3222 120 8 documents document NNS 3222 120 9 Deflate deflate JJ 3222 120 10 LZMA LZMA NNS 3222 120 11 bzip2 bzip2 VBP 3222 120 12 PPMVC PPMVC NNP 3222 120 13 File File NNP 3222 120 14 Name Name NNP 3222 120 15 gzip gzip NN 3222 120 16 CTDL CTDL NNP 3222 120 17 CTDL+ CTDL+ NNP 3222 120 18 7-zip 7-zip NNP 3222 120 19 CTDL CTDL NNP 3222 120 20 CTDL+ CTDL+ NNP 3222 120 21 13601-t 13601-t VBD 3222 120 22 2.944 2.944 CD 3222 120 23 2.244 2.244 CD 3222 120 24 2.101 2.101 CD 3222 120 25 2.337 2.337 CD 3222 120 26 2.057 2.057 CD 3222 120 27 1.919 1.919 CD 3222 120 28 2.158 2.158 CD 3222 120 29 1.863 1.863 CD 3222 120 30 16514-t 16514-t CD 3222 120 31 2.566 2.566 CD 3222 120 32 2.150 2.150 CD 3222 120 33 1.969 1.969 CD 3222 120 34 2.228 2.228 CD 3222 120 35 1.993 1.993 CD 3222 120 36 1.838 1.838 CD 3222 120 37 2.010 2.010 CD 3222 120 38 1.780 1.780 CD 3222 120 39 1noam10 1noam10 NN 3222 120 40 t t NNP 3222 120 41 2.967 2.967 CD 3222 120 42 2.337 2.337 CD 3222 120 43 2.109 2.109 CD 3222 120 44 2.432 2.432 CD 3222 120 45 2.151 2.151 CD 3222 120 46 1.958 1.958 CD 3222 120 47 2.160 2.160 CD 3222 120 48 1.946 1.946 CD 3222 120 49 2ws2610 2ws2610 CD 3222 120 50 3.217 3.217 CD 3222 120 51 2.874 2.874 CD 3222 120 52 2.459 2.459 CD 3222 120 53 2.871 2.871 CD 3222 120 54 2.659 2.659 CD 3222 120 55 2.312 2.312 CD 3222 120 56 2.565 2.565 CD 3222 120 57 2.343 2.343 CD 3222 120 58 alice30 alice30 CD 3222 120 59 2.906 2.906 CD 3222 120 60 2.533 2.533 CD 3222 120 61 2.184 2.184 CD 3222 120 62 2.585 2.585 CD 3222 120 63 2.360 2.360 CD 3222 120 64 2.056 2.056 CD 3222 120 65 2.341 2.341 CD 3222 120 66 2.090 2.090 CD 3222 120 67 cdscs10 cdscs10 NN 3222 120 68 t t NN 3222 120 69 3.222 3.222 CD 3222 120 70 2.898 2.898 CD 3222 120 71 2.298 2.298 CD 3222 120 72 2.928 2.928 CD 3222 120 73 2.721 2.721 CD 3222 120 74 2.192 2.192 CD 3222 120 75 2.694 2.694 CD 3222 120 76 2.436 2.436 CD 3222 120 77 grimm10 grimm10 NNS 3222 120 78 t t NN 3222 120 79 2.832 2.832 CD 3222 120 80 2.275 2.275 CD 3222 120 81 2.090 2.090 CD 3222 120 82 2.357 2.357 CD 3222 120 83 2.079 2.079 CD 3222 120 84 1.931 1.931 CD 3222 120 85 2.112 2.112 CD 3222 120 86 1.886 1.886 CD 3222 120 87 pandp12 pandp12 NNP 3222 120 88 t t NN 3222 120 89 2.901 2.901 CD 3222 120 90 2.251 2.251 CD 3222 120 91 2.097 2.097 CD 3222 120 92 2.366 2.366 CD 3222 120 93 2.061 2.061 CD 3222 120 94 1.930 1.930 CD 3222 120 95 2.032 2.032 CD 3222 120 96 1.835 1.835 CD 3222 120 97 Average average JJ 3222 120 98 2.944 2.944 CD 3222 120 99 2.445 2.445 CD 3222 120 100 2.163 2.163 CD 3222 120 101 2.513 2.513 CD 3222 120 102 2.260 2.260 CD 3222 120 103 2.017 2.017 CD 3222 120 104 2.259 2.259 CD 3222 120 105 2.022 2.022 CD 3222 120 106 Comp Comp NNP 3222 120 107 . . . 3222 121 1 Time Time NNP 3222 121 2 0.688 0.688 CD 3222 121 3 1.234 1.234 CD 3222 121 4 0.954 0.954 CD 3222 121 5 6.688 6.688 CD 3222 121 6 2.640 2.640 CD 3222 121 7 2.281 2.281 CD 3222 121 8 2.110 2.110 CD 3222 121 9 3.281 3.281 CD 3222 121 10 Dec. December NNP 3222 121 11 Time Time NNP 3222 121 12 0.125 0.125 CD 3222 121 13 0.454 0.454 CD 3222 121 14 0.546 0.546 CD 3222 121 15 0.343 0.343 CD 3222 121 16 0.610 0.610 CD 3222 121 17 0.656 0.656 CD 3222 121 18 0.703 0.703 CD 3222 121 19 3.453 3.453 CD 3222 121 20 Table table NN 3222 121 21 4 4 CD 3222 121 22 . . . 3222 122 1 Compression compression NN 3222 122 2 efficiency efficiency NN 3222 122 3 and and CC 3222 122 4 times time NNS 3222 122 5 for for IN 3222 122 6 the the DT 3222 122 7 TEX TEX NNP 3222 122 8 documents document NNS 3222 122 9 Deflate deflate JJ 3222 122 10 LZMA LZMA NNS 3222 122 11 bzip2 bzip2 VBP 3222 122 12 PPMVC PPMVC NNP 3222 122 13 File File NNP 3222 122 14 Name Name NNP 3222 122 15 gzip gzip NN 3222 122 16 CTDL CTDL NNP 3222 122 17 CTDL+ CTDL+ NNP 3222 122 18 7-zip 7-zip NNP 3222 122 19 CTDL CTDL NNP 3222 122 20 CTDL+ CTDL+ NNP 3222 122 21 13601-t 13601-t VBD 3222 122 22 2.927 2.927 CD 3222 122 23 2.233 2.233 CD 3222 122 24 2.092 2.092 CD 3222 122 25 2.328 2.328 CD 3222 122 26 2.049 2.049 CD 3222 122 27 1.913 1.913 CD 3222 122 28 2.146 2.146 CD 3222 122 29 1.852 1.852 CD 3222 122 30 16514-t 16514-t CD 3222 122 31 2.277 2.277 CD 3222 122 32 1.904 1.904 CD 3222 122 33 1.794 1.794 CD 3222 122 34 1.957 1.957 CD 3222 122 35 1.744 1.744 CD 3222 122 36 1.645 1.645 CD 3222 122 37 1.746 1.746 CD 3222 122 38 1.534 1.534 CD 3222 122 39 1noam10 1noam10 NN 3222 122 40 t t NNP 3222 122 41 2.976 2.976 CD 3222 122 42 2.370 2.370 CD 3222 122 43 2.142 2.142 CD 3222 122 44 2.445 2.445 CD 3222 122 45 2.186 2.186 CD 3222 122 46 1.986 1.986 CD 3222 122 47 2.195 2.195 CD 3222 122 48 1.976 1.976 CD 3222 122 49 2ws2610 2ws2610 CD 3222 122 50 3.206 3.206 CD 3222 122 51 2.906 2.906 CD 3222 122 52 2.482 2.482 CD 3222 122 53 2.864 2.864 CD 3222 122 54 2.674 2.674 CD 3222 122 55 2.323 2.323 CD 3222 122 56 2.562 2.562 CD 3222 122 57 2.340 2.340 CD 3222 122 58 alice30 alice30 CD 3222 122 59 2.897 2.897 CD 3222 122 60 2.526 2.526 CD 3222 122 61 2.183 2.183 CD 3222 122 62 2.573 2.573 CD 3222 122 63 2.350 2.350 CD 3222 122 64 2.048 2.048 CD 3222 122 65 2.332 2.332 CD 3222 122 66 2.085 2.085 CD 3222 122 67 cdscs10 cdscs10 NN 3222 122 68 t t NN 3222 122 69 3.224 3.224 CD 3222 122 70 2.931 2.931 CD 3222 122 71 2.328 2.328 CD 3222 122 72 2.941 2.941 CD 3222 122 73 2.759 2.759 CD 3222 122 74 2.222 2.222 CD 3222 122 75 2.723 2.723 CD 3222 122 76 2.466 2.466 CD 3222 122 77 grimm10 grimm10 CD 3222 122 78 t t NN 3222 122 79 2.831 2.831 CD 3222 122 80 2.304 2.304 CD 3222 122 81 2.120 2.120 CD 3222 122 82 2.364 2.364 CD 3222 122 83 2.113 2.113 CD 3222 122 84 1.960 1.960 CD 3222 122 85 2.143 2.143 CD 3222 122 86 1.910 1.910 CD 3222 122 87 pandp12 pandp12 NNP 3222 122 88 t t NN 3222 122 89 2.881 2.881 CD 3222 122 90 2.239 2.239 CD 3222 122 91 2.090 2.090 CD 3222 122 92 2.346 2.346 CD 3222 122 93 2.049 2.049 CD 3222 122 94 1.916 1.916 CD 3222 122 95 2.013 2.013 CD 3222 122 96 1.817 1.817 CD 3222 122 97 Average average NN 3222 122 98 2.902 2.902 CD 3222 122 99 2.427 2.427 CD 3222 122 100 2.154 2.154 CD 3222 122 101 2.477 2.477 CD 3222 122 102 2.241 2.241 CD 3222 122 103 2.002 2.002 CD 3222 122 104 2.233 2.233 CD 3222 122 105 1.998 1.998 CD 3222 122 106 Comp Comp NNP 3222 122 107 . . . 3222 123 1 Time Time NNP 3222 123 2 0.688 0.688 CD 3222 123 3 1.250 1.250 CD 3222 123 4 0.969 0.969 CD 3222 123 5 6.718 6.718 CD 3222 123 6 2.703 2.703 CD 3222 123 7 2.406 2.406 CD 3222 123 8 2.140 2.140 CD 3222 123 9 3.329 3.329 CD 3222 123 10 Dec. December NNP 3222 123 11 Time Time NNP 3222 123 12 0.109 0.109 CD 3222 123 13 0.453 0.453 CD 3222 123 14 0.547 0.547 CD 3222 123 15 0.360 0.360 CD 3222 123 16 0.609 0.609 CD 3222 123 17 0.672 0.672 CD 3222 123 18 0.703 0.703 CD 3222 123 19 3.485 3.485 CD 3222 123 20 THE the DT 3222 123 21 EFFiCiENT efficient CD 3222 123 22 SToraGE storage CD 3222 123 23 oF oF NNP 3222 123 24 TExT text NN 3222 123 25 DoCuMENTS documents NN 3222 123 26 iN in IN 3222 123 27 DiGiTal DiGiTal NNP 3222 123 28 liBrariES libraries NN 3222 123 29 | | NNP 3222 123 30 SkibiŃSki SkibiŃSki NNP 3222 123 31 and and CC 3222 123 32 Swacha Swacha NNP 3222 123 33 149 149 CD 3222 123 34 bitrate bitrate NN 3222 123 35 ( ( -LRB- 3222 123 36 of of IN 3222 123 37 , , , 3222 123 38 e.g. e.g. RB 3222 123 39 , , , 3222 123 40 RTF RTF NNP 3222 123 41 documents document NNS 3222 123 42 compared compare VBN 3222 123 43 to to IN 3222 123 44 the the DT 3222 123 45 plain plain JJ 3222 123 46 text text NN 3222 123 47 ) ) -RRB- 3222 123 48 does do VBZ 3222 123 49 not not RB 3222 123 50 mean mean VB 3222 123 51 the the DT 3222 123 52 file file NN 3222 123 53 is be VBZ 3222 123 54 smaller small JJR 3222 123 55 , , , 3222 123 56 only only RB 3222 123 57 that that IN 3222 123 58 the the DT 3222 123 59 com- com- NN 3222 123 60 pression pression NN 3222 123 61 was be VBD 3222 123 62 better well JJR 3222 123 63 . . . 3222 124 1 Uncompressed uncompressed JJ 3222 124 2 files file NNS 3222 124 3 have have VBP 3222 124 4 a a DT 3222 124 5 bitrate bitrate NN 3222 124 6 of of IN 3222 124 7 8 8 CD 3222 124 8 bits bit NNS 3222 124 9 per per IN 3222 124 10 character character NN 3222 124 11 . . . 3222 125 1 Looking look VBG 3222 125 2 at at IN 3222 125 3 the the DT 3222 125 4 results result NNS 3222 125 5 obtained obtain VBN 3222 125 6 for for IN 3222 125 7 TXT txt NN 3222 125 8 documents document NNS 3222 125 9 ( ( -LRB- 3222 125 10 table table NN 3222 125 11 3 3 CD 3222 125 12 ) ) -RRB- 3222 125 13 , , , 3222 125 14 we -PRON- PRP 3222 125 15 can can MD 3222 125 16 see see VB 3222 125 17 an an DT 3222 125 18 average average JJ 3222 125 19 improvement improvement NN 3222 125 20 of of IN 3222 125 21 17 17 CD 3222 125 22 percent percent NN 3222 125 23 for for IN 3222 125 24 CTDL CTDL NNP 3222 125 25 and and CC 3222 125 26 27 27 CD 3222 125 27 percent percent NN 3222 125 28 for for IN 3222 125 29 CTDL+ CTDL+ NNP 3222 125 30 compared compare VBN 3222 125 31 to to IN 3222 125 32 the the DT 3222 125 33 baseline baseline NN 3222 125 34 Deflate Deflate NNP 3222 125 35 implementation implementation NN 3222 125 36 . . . 3222 126 1 Compared compare VBN 3222 126 2 to to IN 3222 126 3 the the DT 3222 126 4 baseline baseline JJ 3222 126 5 LZMA lzma NN 3222 126 6 implementation implementation NN 3222 126 7 , , , 3222 126 8 the the DT 3222 126 9 improvement improvement NN 3222 126 10 is be VBZ 3222 126 11 10 10 CD 3222 126 12 percent percent NN 3222 126 13 for for IN 3222 126 14 CTDL CTDL NNP 3222 126 15 and and CC 3222 126 16 20 20 CD 3222 126 17 percent percent NN 3222 126 18 for for IN 3222 126 19 CTDL+ CTDL+ NNP 3222 126 20 . . . 3222 127 1 Also also RB 3222 127 2 , , , 3222 127 3 CTDL+ CTDL+ NNP 3222 127 4 combined combine VBN 3222 127 5 with with IN 3222 127 6 LZMA lzma NN 3222 127 7 compresses compress NNS 3222 127 8 TXT txt NN 3222 127 9 docu- docu- NN 3222 127 10 ments ment NNS 3222 127 11 31 31 CD 3222 127 12 percent percent NN 3222 127 13 better well JJR 3222 127 14 than than IN 3222 127 15 gzip gzip NN 3222 127 16 , , , 3222 127 17 11 11 CD 3222 127 18 percent percent NN 3222 127 19 better well RBR 3222 127 20 than than IN 3222 127 21 bzip2 bzip2 NN 3222 127 22 , , , 3222 127 23 and and CC 3222 127 24 slightly slightly RB 3222 127 25 better well JJR 3222 127 26 than than IN 3222 127 27 the the DT 3222 127 28 state state NN 3222 127 29 - - HYPH 3222 127 30 of of IN 3222 127 31 - - HYPH 3222 127 32 the the DT 3222 127 33 - - HYPH 3222 127 34 art art NN 3222 127 35 PPMVC ppmvc NN 3222 127 36 implementation implementation NN 3222 127 37 . . . 3222 128 1 In in IN 3222 128 2 case case NN 3222 128 3 of of IN 3222 128 4 TEX TEX NNP 3222 128 5 documents document NNS 3222 128 6 ( ( -LRB- 3222 128 7 table table NN 3222 128 8 4 4 CD 3222 128 9 ) ) -RRB- 3222 128 10 , , , 3222 128 11 the the DT 3222 128 12 gzip gzip NN 3222 128 13 results result NNS 3222 128 14 were be VBD 3222 128 15 improved improve VBN 3222 128 16 , , , 3222 128 17 on on IN 3222 128 18 average average JJ 3222 128 19 , , , 3222 128 20 by by IN 3222 128 21 16 16 CD 3222 128 22 percent percent NN 3222 128 23 using use VBG 3222 128 24 CTDL CTDL NNP 3222 128 25 and and CC 3222 128 26 by by IN 3222 128 27 26 26 CD 3222 128 28 percent percent NN 3222 128 29 using use VBG 3222 128 30 CTDL+ CTDL+ NNP 3222 128 31 ; ; : 3222 128 32 the the DT 3222 128 33 numbers number NNS 3222 128 34 for for IN 3222 128 35 LZMA LZMA NNS 3222 128 36 are be VBP 3222 128 37 10 10 CD 3222 128 38 percent percent NN 3222 128 39 for for IN 3222 128 40 CTDL CTDL NNP 3222 128 41 and and CC 3222 128 42 19 19 CD 3222 128 43 percent percent NN 3222 128 44 for for IN 3222 128 45 CTDL+ CTDL+ NNP 3222 128 46 . . . 3222 129 1 In in IN 3222 129 2 a a DT 3222 129 3 cross cross JJ 3222 129 4 - - JJ 3222 129 5 method method JJ 3222 129 6 comparison comparison NN 3222 129 7 , , , 3222 129 8 CTDL+ CTDL+ NNP 3222 129 9 with with IN 3222 129 10 LZMA lzma NN 3222 129 11 beats beat NNS 3222 129 12 gzip gzip NN 3222 129 13 by by IN 3222 129 14 31 31 CD 3222 129 15 percent percent NN 3222 129 16 , , , 3222 129 17 bzip2 bzip2 NN 3222 129 18 by by IN 3222 129 19 10 10 CD 3222 129 20 percent percent NN 3222 129 21 , , , 3222 129 22 and and CC 3222 129 23 attains attain VBZ 3222 129 24 results result NNS 3222 129 25 very very RB 3222 129 26 close close RB 3222 129 27 to to IN 3222 129 28 PPMVC PPMVC NNP 3222 129 29 . . . 3222 130 1 On on IN 3222 130 2 average average JJ 3222 130 3 , , , 3222 130 4 Deflate deflate NN 3222 130 5 - - HYPH 3222 130 6 based base VBN 3222 130 7 CTDL CTDL NNP 3222 130 8 compressed compress VBD 3222 130 9 XML xml NN 3222 130 10 documents document NNS 3222 130 11 20 20 CD 3222 130 12 percent percent NN 3222 130 13 better well JJR 3222 130 14 than than IN 3222 130 15 the the DT 3222 130 16 baseline baseline NN 3222 130 17 algorithm algorithm NNP 3222 130 18 ( ( -LRB- 3222 130 19 table table NN 3222 130 20 5 5 CD 3222 130 21 ) ) -RRB- 3222 130 22 , , , 3222 130 23 and and CC 3222 130 24 with with IN 3222 130 25 CTDL+ CTDL+ NNP 3222 130 26 the the DT 3222 130 27 improvement improvement NN 3222 130 28 rises rise VBZ 3222 130 29 to to IN 3222 130 30 26 26 CD 3222 130 31 percent percent NN 3222 130 32 . . . 3222 131 1 CTDL CTDL NNP 3222 131 2 improves improve VBZ 3222 131 3 LZMA lzma NN 3222 131 4 compression compression NN 3222 131 5 by by IN 3222 131 6 11 11 CD 3222 131 7 per- per- NN 3222 131 8 cent cent NN 3222 131 9 , , , 3222 131 10 and and CC 3222 131 11 CTDL+ CTDL+ NNP 3222 131 12 improves improve VBZ 3222 131 13 it -PRON- PRP 3222 131 14 by by IN 3222 131 15 18 18 CD 3222 131 16 percent percent NN 3222 131 17 . . . 3222 132 1 CTDL+ ctdl+ VB 3222 132 2 with with IN 3222 132 3 LZMA lzma NN 3222 132 4 beats beat NNS 3222 132 5 gzip gzip NN 3222 132 6 by by IN 3222 132 7 33 33 CD 3222 132 8 percent percent NN 3222 132 9 , , , 3222 132 10 bzip2 bzip2 NN 3222 132 11 by by IN 3222 132 12 8 8 CD 3222 132 13 percent percent NN 3222 132 14 , , , 3222 132 15 and and CC 3222 132 16 loses lose VBZ 3222 132 17 only only RB 3222 132 18 4 4 CD 3222 132 19 percent percent NN 3222 132 20 to to IN 3222 132 21 PPMVC PPMVC NNP 3222 132 22 . . . 3222 133 1 Similar similar JJ 3222 133 2 results result NNS 3222 133 3 were be VBD 3222 133 4 obtained obtain VBN 3222 133 5 for for IN 3222 133 6 HTML html NN 3222 133 7 documents document NNS 3222 133 8 ( ( -LRB- 3222 133 9 table table NN 3222 133 10 6 6 CD 3222 133 11 ) ) -RRB- 3222 133 12 : : : 3222 133 13 they -PRON- PRP 3222 133 14 were be VBD 3222 133 15 compressed compress VBN 3222 133 16 with with IN 3222 133 17 CTDL CTDL NNP 3222 133 18 and and CC 3222 133 19 Deflate Deflate NNP 3222 133 20 18 18 CD 3222 133 21 percent percent NN 3222 133 22 better well JJR 3222 133 23 than than IN 3222 133 24 with with IN 3222 133 25 the the DT 3222 133 26 Deflate Deflate NNP 3222 133 27 algorithm algorithm NN 3222 133 28 alone alone RB 3222 133 29 , , , 3222 133 30 and and CC 3222 133 31 27 27 CD 3222 133 32 percent percent NN 3222 133 33 better well RBR 3222 133 34 with with IN 3222 133 35 CTDL+ CTDL+ NNP 3222 133 36 . . . 3222 134 1 LZMA lzma NN 3222 134 2 compression compression NN 3222 134 3 efficiency efficiency NN 3222 134 4 is be VBZ 3222 134 5 improved improve VBN 3222 134 6 by by IN 3222 134 7 11 11 CD 3222 134 8 percent percent NN 3222 134 9 with with IN 3222 134 10 CTDL CTDL NNP 3222 134 11 and and CC 3222 134 12 20 20 CD 3222 134 13 percent percent NN 3222 134 14 with with IN 3222 134 15 CTDL+ CTDL+ NNP 3222 134 16 . . . 3222 135 1 CTDL+ ctdl+ VB 3222 135 2 with with IN 3222 135 3 LZMA lzma NN 3222 135 4 beats beat NNS 3222 135 5 gzip gzip NN 3222 135 6 by by IN 3222 135 7 33 33 CD 3222 135 8 percent percent NN 3222 135 9 , , , 3222 135 10 bzip2 bzip2 NN 3222 135 11 by by IN 3222 135 12 9 9 CD 3222 135 13 percent percent NN 3222 135 14 , , , 3222 135 15 and and CC 3222 135 16 loses lose VBZ 3222 135 17 only only RB 3222 135 18 2 2 CD 3222 135 19 percent percent NN 3222 135 20 to to IN 3222 135 21 PPMVC PPMVC NNP 3222 135 22 . . . 3222 136 1 For for IN 3222 136 2 RTF RTF NNP 3222 136 3 documents document NNS 3222 136 4 ( ( -LRB- 3222 136 5 table table NN 3222 136 6 7 7 CD 3222 136 7 ) ) -RRB- 3222 136 8 , , , 3222 136 9 the the DT 3222 136 10 gzip gzip NN 3222 136 11 results result NNS 3222 136 12 were be VBD 3222 136 13 improved improve VBN 3222 136 14 , , , 3222 136 15 on on IN 3222 136 16 average average JJ 3222 136 17 , , , 3222 136 18 by by IN 3222 136 19 18 18 CD 3222 136 20 percent percent NN 3222 136 21 using use VBG 3222 136 22 CTDL CTDL NNP 3222 136 23 , , , 3222 136 24 and and CC 3222 136 25 25 25 CD 3222 136 26 percent percent NN 3222 136 27 using use VBG 3222 136 28 CTDL+ CTDL+ NNP 3222 136 29 ; ; : 3222 136 30 the the DT 3222 136 31 numbers number NNS 3222 136 32 for for IN 3222 136 33 LZMA LZMA NNS 3222 136 34 are be VBP 3222 136 35 respec- respec- VBG 3222 136 36 tively tively RB 3222 136 37 9 9 CD 3222 136 38 percent percent NN 3222 136 39 for for IN 3222 136 40 CTDL CTDL NNP 3222 136 41 and and CC 3222 136 42 17 17 CD 3222 136 43 percent percent NN 3222 136 44 for for IN 3222 136 45 CTDL+ CTDL+ NNP 3222 136 46 . . . 3222 137 1 In in IN 3222 137 2 a a DT 3222 137 3 cross cross JJ 3222 137 4 - - JJ 3222 137 5 method method JJ 3222 137 6 comparison comparison NN 3222 137 7 , , , 3222 137 8 CTDL+ CTDL+ NNP 3222 137 9 with with IN 3222 137 10 LZMA lzma NN 3222 137 11 beats beat NNS 3222 137 12 gzip gzip NN 3222 137 13 by by IN 3222 137 14 34 34 CD 3222 137 15 percent percent NN 3222 137 16 , , , 3222 137 17 bzip2 bzip2 NN 3222 137 18 by by IN 3222 137 19 7 7 CD 3222 137 20 percent percent NN 3222 137 21 , , , 3222 137 22 and and CC 3222 137 23 loses lose VBZ 3222 137 24 5 5 CD 3222 137 25 percent percent NN 3222 137 26 to to IN 3222 137 27 PPMVC PPMVC NNP 3222 137 28 . . . 3222 138 1 Although although IN 3222 138 2 there there EX 3222 138 3 is be VBZ 3222 138 4 no no DT 3222 138 5 mode mode NN 3222 138 6 designed design VBN 3222 138 7 especially especially RB 3222 138 8 for for IN 3222 138 9 DOC DOC NNP 3222 138 10 documents document NNS 3222 138 11 in in IN 3222 138 12 CTDL CTDL NNP 3222 138 13 ( ( -LRB- 3222 138 14 table table NN 3222 138 15 8) 8) CD 3222 138 16 , , , 3222 138 17 the the DT 3222 138 18 basic basic JJ 3222 138 19 TXT TXT NNP 3222 138 20 mode mode NN 3222 138 21 was be VBD 3222 138 22 used use VBN 3222 138 23 , , , 3222 138 24 as as IN 3222 138 25 it -PRON- PRP 3222 138 26 was be VBD 3222 138 27 found find VBN 3222 138 28 experimentally experimentally RB 3222 138 29 to to TO 3222 138 30 be be VB 3222 138 31 the the DT 3222 138 32 best good JJS 3222 138 33 choice choice NN 3222 138 34 available available JJ 3222 138 35 . . . 3222 139 1 The the DT 3222 139 2 results result NNS 3222 139 3 show show VBP 3222 139 4 it -PRON- PRP 3222 139 5 managed manage VBD 3222 139 6 to to TO 3222 139 7 improve improve VB 3222 139 8 Deflate deflate NN 3222 139 9 - - HYPH 3222 139 10 based base VBN 3222 139 11 compression compression NN 3222 139 12 by by IN 3222 139 13 9 9 CD 3222 139 14 percent percent NN 3222 139 15 using use VBG 3222 139 16 CTDL CTDL NNP 3222 139 17 , , , 3222 139 18 and and CC 3222 139 19 by by IN 3222 139 20 21 21 CD 3222 139 21 percent percent NN 3222 139 22 using use VBG 3222 139 23 CTDL+ CTDL+ NNP 3222 139 24 , , , 3222 139 25 whereas whereas IN 3222 139 26 LZMA lzma NN 3222 139 27 - - HYPH 3222 139 28 based base VBN 3222 139 29 compression compression NN 3222 139 30 was be VBD 3222 139 31 improved improve VBN 3222 139 32 respectively respectively RB 3222 139 33 by by IN 3222 139 34 4 4 CD 3222 139 35 percent percent NN 3222 139 36 for for IN 3222 139 37 CTDL CTDL NNP 3222 139 38 and and CC 3222 139 39 14 14 CD 3222 139 40 percent percent NN 3222 139 41 for for IN 3222 139 42 CTDL+ CTDL+ NNP 3222 139 43 . . . 3222 140 1 Combined combine VBN 3222 140 2 with with IN 3222 140 3 LZMA lzma NN 3222 140 4 , , , 3222 140 5 CTDL+ CTDL+ NNP 3222 140 6 compresses compress VBZ 3222 140 7 DOC DOC NNP 3222 140 8 documents document NNS 3222 140 9 30 30 CD 3222 140 10 percent percent NN 3222 140 11 better well JJR 3222 140 12 than than IN 3222 140 13 gzip gzip NN 3222 140 14 , , , 3222 140 15 13 13 CD 3222 140 16 percent percent NN 3222 140 17 better well JJR 3222 140 18 than than IN 3222 140 19 bzip2 bzip2 NN 3222 140 20 , , , 3222 140 21 and and CC 3222 140 22 1 1 CD 3222 140 23 percent percent NN 3222 140 24 bet- bet- NN 3222 140 25 ter ter NN 3222 140 26 than than IN 3222 140 27 PPMVC PPMVC NNP 3222 140 28 . . . 3222 141 1 In in IN 3222 141 2 case case NN 3222 141 3 of of IN 3222 141 4 PS PS NNP 3222 141 5 documents document NNS 3222 141 6 ( ( -LRB- 3222 141 7 table table NN 3222 141 8 9 9 CD 3222 141 9 ) ) -RRB- 3222 141 10 , , , 3222 141 11 the the DT 3222 141 12 gzip gzip NN 3222 141 13 results result NNS 3222 141 14 were be VBD 3222 141 15 improved improve VBN 3222 141 16 , , , 3222 141 17 on on IN 3222 141 18 average average JJ 3222 141 19 , , , 3222 141 20 by by IN 3222 141 21 5 5 CD 3222 141 22 percent percent NN 3222 141 23 using use VBG 3222 141 24 CTDL CTDL NNP 3222 141 25 , , , 3222 141 26 and and CC 3222 141 27 by by IN 3222 141 28 8 8 CD 3222 141 29 percent percent NN 3222 141 30 using use VBG 3222 141 31 CTDL+ CTDL+ NNP 3222 141 32 ; ; : 3222 141 33 the the DT 3222 141 34 numbers number NNS 3222 141 35 for for IN 3222 141 36 LZMA LZMA NNS 3222 141 37 improved improve VBD 3222 141 38 3 3 CD 3222 141 39 percent percent NN 3222 141 40 for for IN 3222 141 41 CTDL CTDL NNP 3222 141 42 and and CC 3222 141 43 5 5 CD 3222 141 44 percent percent NN 3222 141 45 for for IN 3222 141 46 CTDL+ CTDL+ NNP 3222 141 47 . . . 3222 142 1 In in IN 3222 142 2 a a DT 3222 142 3 cross cross JJ 3222 142 4 - - JJ 3222 142 5 method method JJ 3222 142 6 comparison comparison NN 3222 142 7 , , , 3222 142 8 CTDL+ CTDL+ NNP 3222 142 9 with with IN 3222 142 10 LZMA lzma NN 3222 142 11 beats beat NNS 3222 142 12 gzip gzip NN 3222 142 13 by by IN 3222 142 14 8 8 CD 3222 142 15 percent percent NN 3222 142 16 , , , 3222 142 17 losing lose VBG 3222 142 18 5 5 CD 3222 142 19 percent percent NN 3222 142 20 to to TO 3222 142 21 bzip2 bzip2 VB 3222 142 22 and and CC 3222 142 23 7 7 CD 3222 142 24 percent percent NN 3222 142 25 to to IN 3222 142 26 PPMVC PPMVC NNP 3222 142 27 . . . 3222 143 1 Finally finally RB 3222 143 2 , , , 3222 143 3 CTDL CTDL NNP 3222 143 4 improved improve VBD 3222 143 5 Deflate deflate NN 3222 143 6 - - HYPH 3222 143 7 based base VBN 3222 143 8 compression compression NN 3222 143 9 of of IN 3222 143 10 PDF PDF NNP 3222 143 11 documents document NNS 3222 143 12 ( ( -LRB- 3222 143 13 table table NN 3222 143 14 10 10 CD 3222 143 15 ) ) -RRB- 3222 143 16 by by IN 3222 143 17 9 9 CD 3222 143 18 percent percent NN 3222 143 19 using use VBG 3222 143 20 CTDL CTDL NNP 3222 143 21 and and CC 3222 143 22 10 10 CD 3222 143 23 percent percent NN 3222 143 24 using use VBG 3222 143 25 CTDL+ CTDL+ NNP 3222 143 26 ( ( -LRB- 3222 143 27 compared compare VBN 3222 143 28 to to IN 3222 143 29 gzip gzip VB 3222 143 30 ; ; : 3222 143 31 the the DT 3222 143 32 numbers number NNS 3222 143 33 are be VBP 3222 143 34 Table table NN 3222 143 35 5 5 CD 3222 143 36 . . . 3222 144 1 Compression compression NN 3222 144 2 efficiency efficiency NN 3222 144 3 and and CC 3222 144 4 times time NNS 3222 144 5 for for IN 3222 144 6 the the DT 3222 144 7 XML xml NN 3222 144 8 documents document NNS 3222 144 9 Deflate deflate JJ 3222 144 10 LZMA LZMA NNS 3222 144 11 bzip2 bzip2 VBP 3222 144 12 PPMVC PPMVC NNP 3222 144 13 File File NNP 3222 144 14 Name Name NNP 3222 144 15 gzip gzip NN 3222 144 16 CTDL CTDL NNP 3222 144 17 CTDL+ CTDL+ NNP 3222 144 18 7-zip 7-zip NNP 3222 144 19 CTDL CTDL NNP 3222 144 20 CTDL+ CTDL+ NNP 3222 144 21 13601-t 13601-t VBD 3222 144 22 2.046 2.046 CD 3222 144 23 1.551 1.551 CD 3222 144 24 1.514 1.514 CD 3222 144 25 1.585 1.585 CD 3222 144 26 1.405 1.405 CD 3222 144 27 1.339 1.339 CD 3222 144 28 1.451 1.451 CD 3222 144 29 1.242 1.242 CD 3222 144 30 16514-t 16514-t CD 3222 144 31 0.871 0.871 CD 3222 144 32 0.698 0.698 CD 3222 144 33 0.670 0.670 CD 3222 144 34 0.703 0.703 CD 3222 144 35 0.612 0.612 CD 3222 144 36 0.590 0.590 CD 3222 144 37 0.599 0.599 CD 3222 144 38 0.552 0.552 CD 3222 144 39 1noam10 1noam10 CD 3222 144 40 t t NNP 3222 144 41 2.383 2.383 CD 3222 144 42 1.870 1.870 CD 3222 144 43 1.736 1.736 CD 3222 144 44 1.914 1.914 CD 3222 144 45 1.711 1.711 CD 3222 144 46 1.575 1.575 CD 3222 144 47 1.724 1.724 CD 3222 144 48 1.515 1.515 CD 3222 144 49 2ws2610 2ws2610 CD 3222 144 50 0.691 0.691 CD 3222 144 51 0.539 0.539 CD 3222 144 52 0.497 0.497 CD 3222 144 53 0.561 0.561 CD 3222 144 54 0.474 0.474 CD 3222 144 55 0.440 0.440 CD 3222 144 56 0.461 0.461 CD 3222 144 57 0.422 0.422 CD 3222 144 58 alice30 alice30 CD 3222 144 59 1.477 1.477 CD 3222 144 60 1.258 1.258 CD 3222 144 61 1.140 1.140 CD 3222 144 62 1.248 1.248 CD 3222 144 63 1.131 1.131 CD 3222 144 64 1.034 1.034 CD 3222 144 65 1.116 1.116 CD 3222 144 66 0.999 0.999 CD 3222 144 67 cdscs10 cdscs10 NN 3222 144 68 t t NN 3222 144 69 2.106 2.106 CD 3222 144 70 1.892 1.892 CD 3222 144 71 1.576 1.576 CD 3222 144 72 1.862 1.862 CD 3222 144 73 1.741 1.741 CD 3222 144 74 1.462 1.462 CD 3222 144 75 1.721 1.721 CD 3222 144 76 1.538 1.538 CD 3222 144 77 grimm10 grimm10 CD 3222 144 78 t t NN 3222 144 79 1.878 1.878 CD 3222 144 80 1.485 1.485 CD 3222 144 81 1.422 1.422 CD 3222 144 82 1.521 1.521 CD 3222 144 83 1.337 1.337 CD 3222 144 84 1.276 1.276 CD 3222 144 85 1.337 1.337 CD 3222 144 86 1.198 1.198 CD 3222 144 87 pandp12 pandp12 NNP 3222 144 88 t t NNP 3222 144 89 1.875 1.875 CD 3222 144 90 1.404 1.404 CD 3222 144 91 1.349 1.349 CD 3222 144 92 1.465 1.465 CD 3222 144 93 1.263 1.263 CD 3222 144 94 1.207 1.207 CD 3222 144 95 1.252 1.252 CD 3222 144 96 1.105 1.105 CD 3222 144 97 Average average JJ 3222 144 98 1.666 1.666 CD 3222 144 99 1.337 1.337 CD 3222 144 100 1.238 1.238 CD 3222 144 101 1.357 1.357 CD 3222 144 102 1.209 1.209 CD 3222 144 103 1.115 1.115 CD 3222 144 104 1.208 1.208 CD 3222 144 105 1.071 1.071 CD 3222 144 106 Comp Comp NNP 3222 144 107 . . . 3222 145 1 Time time NN 3222 145 2 0.750 0.750 CD 3222 145 3 1.844 1.844 CD 3222 145 4 1.390 1.390 CD 3222 145 5 10.79 10.79 CD 3222 145 6 4.891 4.891 CD 3222 145 7 5.828 5.828 CD 3222 145 8 7.047 7.047 CD 3222 145 9 3.688 3.688 CD 3222 145 10 Dec. December NNP 3222 145 11 Time Time NNP 3222 145 12 0.141 0.141 CD 3222 145 13 0.672 0.672 CD 3222 145 14 0.750 0.750 CD 3222 145 15 0.421 0.421 CD 3222 145 16 0.859 0.859 CD 3222 145 17 0.953 0.953 CD 3222 145 18 1.140 1.140 CD 3222 145 19 3.907 3.907 CD 3222 145 20 150 150 CD 3222 145 21 iNForMaTioN information NN 3222 145 22 TECHNoloGY technology NN 3222 145 23 aND and CC 3222 145 24 liBrariES liBrariES NNP 3222 145 25 | | NNP 3222 145 26 SEpTEMBEr september CD 3222 145 27 2009 2009 CD 3222 145 28 much much RB 3222 145 29 higher high JJR 3222 145 30 if if IN 3222 145 31 compared compare VBN 3222 145 32 to to IN 3222 145 33 the the DT 3222 145 34 embedded embed VBN 3222 145 35 PDF PDF NNP 3222 145 36 compres- compres- JJ 3222 145 37 sion sion NN 3222 145 38 — — : 3222 145 39 see see VB 3222 145 40 “ " `` 3222 145 41 native native JJ 3222 145 42 ” " '' 3222 145 43 column column NN 3222 145 44 in in IN 3222 145 45 table table NN 3222 145 46 10 10 CD 3222 145 47 ) ) -RRB- 3222 145 48 ; ; : 3222 145 49 the the DT 3222 145 50 numbers number NNS 3222 145 51 for for IN 3222 145 52 LZMA LZMA NNS 3222 145 53 are be VBP 3222 145 54 respectively respectively RB 3222 145 55 7 7 CD 3222 145 56 percent percent NN 3222 145 57 for for IN 3222 145 58 CTDL CTDL NNP 3222 145 59 and and CC 3222 145 60 10 10 CD 3222 145 61 percent percent NN 3222 145 62 for for IN 3222 145 63 CTDL+ CTDL+ NNP 3222 145 64 . . . 3222 146 1 Combined combine VBN 3222 146 2 with with IN 3222 146 3 LZMA lzma NN 3222 146 4 , , , 3222 146 5 CTDL+ CTDL+ NNP 3222 146 6 compresses compress VBZ 3222 146 7 PDF PDF NNP 3222 146 8 documents document NNS 3222 146 9 28 28 CD 3222 146 10 percent percent NN 3222 146 11 better well JJR 3222 146 12 than than IN 3222 146 13 gzip gzip NN 3222 146 14 , , , 3222 146 15 4 4 CD 3222 146 16 percent percent NN 3222 146 17 bet- bet- NN 3222 146 18 ter ter NN 3222 146 19 than than IN 3222 146 20 bzip2 bzip2 NN 3222 146 21 , , , 3222 146 22 and and CC 3222 146 23 5 5 CD 3222 146 24 percent percent NN 3222 146 25 worse bad JJR 3222 146 26 than than IN 3222 146 27 PPMVC PPMVC NNP 3222 146 28 . . . 3222 147 1 The the DT 3222 147 2 results result NNS 3222 147 3 presented present VBN 3222 147 4 in in IN 3222 147 5 tables table NNS 3222 147 6 3–10 3–10 CD 3222 147 7 show show VBP 3222 147 8 that that IN 3222 147 9 CTDL CTDL NNP 3222 147 10 manages manage VBZ 3222 147 11 to to TO 3222 147 12 improve improve VB 3222 147 13 compression compression NN 3222 147 14 efficiency efficiency NN 3222 147 15 of of IN 3222 147 16 the the DT 3222 147 17 gen- gen- NN 3222 147 18 eral eral JJ 3222 147 19 - - HYPH 3222 147 20 purpose purpose NN 3222 147 21 algorithms algorithm NNS 3222 147 22 it -PRON- PRP 3222 147 23 is be VBZ 3222 147 24 based base VBN 3222 147 25 on on IN 3222 147 26 . . . 3222 148 1 The the DT 3222 148 2 scale scale NN 3222 148 3 of of IN 3222 148 4 improvement improvement NN 3222 148 5 varies varie NNS 3222 148 6 between between IN 3222 148 7 document document NN 3222 148 8 types type NNS 3222 148 9 , , , 3222 148 10 but but CC 3222 148 11 for for IN 3222 148 12 most most JJS 3222 148 13 of of IN 3222 148 14 them -PRON- PRP 3222 148 15 it -PRON- PRP 3222 148 16 is be VBZ 3222 148 17 more more JJR 3222 148 18 than than IN 3222 148 19 20 20 CD 3222 148 20 percent percent NN 3222 148 21 for for IN 3222 148 22 CTDL+ CTDL+ NNP 3222 148 23 and and CC 3222 148 24 10 10 CD 3222 148 25 percent percent NN 3222 148 26 for for IN 3222 148 27 CTDL CTDL NNP 3222 148 28 . . . 3222 149 1 The the DT 3222 149 2 smallest small JJS 3222 149 3 improvement improvement NN 3222 149 4 is be VBZ 3222 149 5 achieved achieve VBN 3222 149 6 in in IN 3222 149 7 case case NN 3222 149 8 of of IN 3222 149 9 PS PS NNP 3222 149 10 ( ( -LRB- 3222 149 11 about about RB 3222 149 12 5 5 CD 3222 149 13 percent percent NN 3222 149 14 ) ) -RRB- 3222 149 15 . . . 3222 150 1 Figure figure NN 3222 150 2 1 1 CD 3222 150 3 shows show VBZ 3222 150 4 the the DT 3222 150 5 same same JJ 3222 150 6 results result NNS 3222 150 7 in in IN 3222 150 8 another another DT 3222 150 9 perspective perspective NN 3222 150 10 : : : 3222 150 11 the the DT 3222 150 12 bars bar NNS 3222 150 13 show show VBP 3222 150 14 how how WRB 3222 150 15 much much RB 3222 150 16 better well JJR 3222 150 17 compression compression NN 3222 150 18 ratios ratio NNS 3222 150 19 were be VBD 3222 150 20 obtained obtain VBN 3222 150 21 for for IN 3222 150 22 the the DT 3222 150 23 same same JJ 3222 150 24 documents document NNS 3222 150 25 using use VBG 3222 150 26 different different JJ 3222 150 27 compression compression NN 3222 150 28 schemes scheme NNS 3222 150 29 com- com- NN 3222 150 30 pared pare VBD 3222 150 31 to to TO 3222 150 32 gzip gzip VB 3222 150 33 with with IN 3222 150 34 default default NN 3222 150 35 options option NNS 3222 150 36 ( ( -LRB- 3222 150 37 0 0 CD 3222 150 38 percent percent NN 3222 150 39 means mean VBZ 3222 150 40 no no DT 3222 150 41 improvement improvement NN 3222 150 42 ) ) -RRB- 3222 150 43 . . . 3222 151 1 Compared compare VBN 3222 151 2 to to IN 3222 151 3 gzip gzip VB 3222 151 4 , , , 3222 151 5 CTDL CTDL NNP 3222 151 6 offers offer VBZ 3222 151 7 a a DT 3222 151 8 significantly significantly RB 3222 151 9 better well JJR 3222 151 10 compression compression NN 3222 151 11 ratio ratio NN 3222 151 12 at at IN 3222 151 13 the the DT 3222 151 14 expense expense NN 3222 151 15 of of IN 3222 151 16 longer long JJR 3222 151 17 processing processing NN 3222 151 18 time time NN 3222 151 19 . . . 3222 152 1 The the DT 3222 152 2 relative relative JJ 3222 152 3 difference difference NN 3222 152 4 is be VBZ 3222 152 5 especially especially RB 3222 152 6 high high JJ 3222 152 7 in in IN 3222 152 8 case case NN 3222 152 9 of of IN 3222 152 10 decompression decompression NN 3222 152 11 . . . 3222 153 1 However however RB 3222 153 2 , , , 3222 153 3 in in IN 3222 153 4 absolute absolute JJ 3222 153 5 terms term NNS 3222 153 6 , , , 3222 153 7 even even RB 3222 153 8 in in IN 3222 153 9 the the DT 3222 153 10 worst bad JJS 3222 153 11 case case NN 3222 153 12 of of IN 3222 153 13 PDF PDF NNP 3222 153 14 , , , 3222 153 15 the the DT 3222 153 16 average average JJ 3222 153 17 delay delay NN 3222 153 18 between between IN 3222 153 19 CTDL+ CTDL+ NNP 3222 153 20 and and CC 3222 153 21 gzip gzip NNP 3222 153 22 is be VBZ 3222 153 23 below below IN 3222 153 24 180 180 CD 3222 153 25 ms m NNS 3222 153 26 for for IN 3222 153 27 compression compression NN 3222 153 28 and and CC 3222 153 29 90 90 CD 3222 153 30 ms m NNS 3222 153 31 for for IN 3222 153 32 decompression decompression NN 3222 153 33 per per IN 3222 153 34 file file NN 3222 153 35 . . . 3222 154 1 Taking take VBG 3222 154 2 into into IN 3222 154 3 consideration consideration NN 3222 154 4 the the DT 3222 154 5 low low JJ 3222 154 6 - - HYPH 3222 154 7 end end NN 3222 154 8 specification specification NN 3222 154 9 of of IN 3222 154 10 the the DT 3222 154 11 test test NN 3222 154 12 computer computer NN 3222 154 13 , , , 3222 154 14 these these DT 3222 154 15 results result NNS 3222 154 16 Table table NN 3222 154 17 6 6 CD 3222 154 18 . . . 3222 155 1 Compression compression NN 3222 155 2 efficiency efficiency NN 3222 155 3 and and CC 3222 155 4 times time NNS 3222 155 5 for for IN 3222 155 6 the the DT 3222 155 7 HTML html NN 3222 155 8 documents document NNS 3222 155 9 Deflate deflate JJ 3222 155 10 LZMA LZMA NNS 3222 155 11 bzip2 bzip2 VBP 3222 155 12 PPMVC PPMVC NNP 3222 155 13 File File NNP 3222 155 14 Name Name NNP 3222 155 15 gzip gzip NN 3222 155 16 CTDL CTDL NNP 3222 155 17 CTDL+ CTDL+ NNP 3222 155 18 7-zip 7-zip NNP 3222 155 19 CTDL CTDL NNP 3222 155 20 CTDL+ CTDL+ NNP 3222 155 21 13601-t 13601-t VBD 3222 155 22 2.696 2.696 CD 3222 155 23 2.054 2.054 CD 3222 155 24 1.940 1.940 CD 3222 155 25 2.121 2.121 CD 3222 155 26 1.868 1.868 CD 3222 155 27 1.751 1.751 CD 3222 155 28 1.932 1.932 CD 3222 155 29 1.670 1.670 CD 3222 155 30 16514-t 16514-t CD 3222 155 31 1.726 1.726 CD 3222 155 32 1.405 1.405 CD 3222 155 33 1.310 1.310 CD 3222 155 34 1.436 1.436 CD 3222 155 35 1.258 1.258 CD 3222 155 36 1.180 1.180 CD 3222 155 37 1.257 1.257 CD 3222 155 38 1.113 1.113 CD 3222 155 39 1noam10 1noam10 NN 3222 155 40 t t NNP 3222 155 41 2.768 2.768 CD 3222 155 42 2.159 2.159 CD 3222 155 43 1.972 1.972 CD 3222 155 44 2.244 2.244 CD 3222 155 45 1.979 1.979 CD 3222 155 46 1.815 1.815 CD 3222 155 47 1.973 1.973 CD 3222 155 48 1.785 1.785 CD 3222 155 49 2ws2610 2ws2610 CD 3222 155 50 2.084 2.084 CD 3222 155 51 1.747 1.747 CD 3222 155 52 1.504 1.504 CD 3222 155 53 1.743 1.743 CD 3222 155 54 1.525 1.525 CD 3222 155 55 1.344 1.344 CD 3222 155 56 1.499 1.499 CD 3222 155 57 1.303 1.303 CD 3222 155 58 alice30 alice30 CD 3222 155 59 2.451 2.451 CD 3222 155 60 2.124 2.124 CD 3222 155 61 1.829 1.829 CD 3222 155 62 2.128 2.128 CD 3222 155 63 1.929 1.929 CD 3222 155 64 1.701 1.701 CD 3222 155 65 1.888 1.888 CD 3222 155 66 1.684 1.684 CD 3222 155 67 cdscs10 cdscs10 NN 3222 155 68 t t NNP 3222 155 69 2.880 2.880 CD 3222 155 70 2.593 2.593 CD 3222 155 71 2.084 2.084 CD 3222 155 72 2.597 2.597 CD 3222 155 73 2.410 2.410 CD 3222 155 74 1.966 1.966 CD 3222 155 75 2.348 2.348 CD 3222 155 76 2.131 2.131 CD 3222 155 77 grimm10 grimm10 NNS 3222 155 78 t t NN 3222 155 79 2.603 2.603 CD 3222 155 80 2.074 2.074 CD 3222 155 81 1.916 1.916 CD 3222 155 82 2.138 2.138 CD 3222 155 83 1.883 1.883 CD 3222 155 84 1.752 1.752 CD 3222 155 85 1.889 1.889 CD 3222 155 86 1.688 1.688 CD 3222 155 87 pandp12 pandp12 NNP 3222 155 88 t t NNP 3222 155 89 2.640 2.640 CD 3222 155 90 2.037 2.037 CD 3222 155 91 1.891 1.891 CD 3222 155 92 2.120 2.120 CD 3222 155 93 1.826 1.826 CD 3222 155 94 1.717 1.717 CD 3222 155 95 1.777 1.777 CD 3222 155 96 1.596 1.596 CD 3222 155 97 Average average NN 3222 155 98 2.481 2.481 CD 3222 155 99 2.024 2.024 CD 3222 155 100 1.806 1.806 CD 3222 155 101 2.066 2.066 CD 3222 155 102 1.835 1.835 CD 3222 155 103 1.653 1.653 CD 3222 155 104 1.820 1.820 CD 3222 155 105 1.621 1.621 CD 3222 155 106 Comp Comp NNP 3222 155 107 . . . 3222 156 1 Time time NN 3222 156 2 0.750 0.750 CD 3222 156 3 1.438 1.438 CD 3222 156 4 1.078 1.078 CD 3222 156 5 8.203 8.203 CD 3222 156 6 3.421 3.421 CD 3222 156 7 3.328 3.328 CD 3222 156 8 2.672 2.672 CD 3222 156 9 3.500 3.500 CD 3222 156 10 Dec. December NNP 3222 156 11 Time Time NNP 3222 156 12 0.140 0.140 CD 3222 156 13 0.515 0.515 CD 3222 156 14 0.594 0.594 CD 3222 156 15 0.359 0.359 CD 3222 156 16 0.688 0.688 CD 3222 156 17 0.750 0.750 CD 3222 156 18 0.812 0.812 CD 3222 156 19 3.672 3.672 CD 3222 156 20 Table table NN 3222 156 21 7 7 CD 3222 156 22 . . . 3222 157 1 Compression compression NN 3222 157 2 efficiency efficiency NN 3222 157 3 and and CC 3222 157 4 times time NNS 3222 157 5 for for IN 3222 157 6 the the DT 3222 157 7 RTF RTF NNP 3222 157 8 documents document NNS 3222 157 9 Deflate deflate JJ 3222 157 10 LZMA LZMA NNS 3222 157 11 bzip2 bzip2 VBP 3222 157 12 PPMVC PPMVC NNP 3222 157 13 File File NNP 3222 157 14 Name Name NNP 3222 157 15 gzip gzip NN 3222 157 16 CTDL CTDL NNP 3222 157 17 CTDL+ CTDL+ NNP 3222 157 18 7-zip 7-zip NNP 3222 157 19 CTDL CTDL NNP 3222 157 20 CTDL+ CTDL+ NNP 3222 157 21 13601-t 13601-t VBD 3222 157 22 1.882 1.882 CD 3222 157 23 1.431 1.431 CD 3222 157 24 1.372 1.372 CD 3222 157 25 1.428 1.428 CD 3222 157 26 1.267 1.267 CD 3222 157 27 1.200 1.200 CD 3222 157 28 1.300 1.300 CD 3222 157 29 1.120 1.120 CD 3222 157 30 16514-t 16514-t CD 3222 157 31 0.834 0.834 CD 3222 157 32 0.701 0.701 CD 3222 157 33 0.696 0.696 CD 3222 157 34 0.662 0.662 CD 3222 157 35 0.601 0.601 CD 3222 157 36 0.591 0.591 CD 3222 157 37 0.568 0.568 CD 3222 157 38 0.529 0.529 CD 3222 157 39 1noam10 1noam10 CD 3222 157 40 t t NNP 3222 157 41 2.244 2.244 CD 3222 157 42 1.774 1.774 CD 3222 157 43 1.637 1.637 CD 3222 157 44 1.765 1.765 CD 3222 157 45 1.594 1.594 CD 3222 157 46 1.462 1.462 CD 3222 157 47 1.601 1.601 CD 3222 157 48 1.404 1.404 CD 3222 157 49 2ws2610 2ws2610 CD 3222 157 50 0.784 0.784 CD 3222 157 51 0.630 0.630 CD 3222 157 52 0.581 0.581 CD 3222 157 53 0.629 0.629 CD 3222 157 54 0.545 0.545 CD 3222 157 55 0.500 0.500 CD 3222 157 56 0.520 0.520 CD 3222 157 57 0.485 0.485 CD 3222 157 58 alice30 alice30 CD 3222 157 59 1.382 1.382 CD 3222 157 60 1.196 1.196 CD 3222 157 61 1.065 1.065 CD 3222 157 62 1.134 1.134 CD 3222 157 63 1.046 1.046 CD 3222 157 64 0.948 0.948 CD 3222 157 65 0.995 0.995 CD 3222 157 66 0.922 0.922 CD 3222 157 67 cdscs10 cdscs10 NNP 3222 157 68 t t NNP 3222 157 69 2.059 2.059 CD 3222 157 70 1.882 1.882 CD 3222 157 71 1.558 1.558 CD 3222 157 72 1.784 1.784 CD 3222 157 73 1.704 1.704 CD 3222 157 74 1.432 1.432 CD 3222 157 75 1.645 1.645 CD 3222 157 76 1.488 1.488 CD 3222 157 77 grimm10 grimm10 NNS 3222 157 78 t t NN 3222 157 79 1.618 1.618 CD 3222 157 80 1.301 1.301 CD 3222 157 81 1.227 1.227 CD 3222 157 82 1.285 1.285 CD 3222 157 83 1.150 1.150 CD 3222 157 84 1.082 1.082 CD 3222 157 85 1.149 1.149 CD 3222 157 86 1.010 1.010 CD 3222 157 87 pandp12 pandp12 NNP 3222 157 88 t t NN 3222 157 89 1.742 1.742 CD 3222 157 90 1.340 1.340 CD 3222 157 91 1.264 1.264 CD 3222 157 92 1.336 1.336 CD 3222 157 93 1.169 1.169 CD 3222 157 94 1.115 1.115 CD 3222 157 95 1.142 1.142 CD 3222 157 96 1.012 1.012 CD 3222 157 97 Average average JJ 3222 157 98 1.568 1.568 CD 3222 157 99 1.282 1.282 CD 3222 157 100 1.175 1.175 CD 3222 157 101 1.253 1.253 CD 3222 157 102 1.135 1.135 CD 3222 157 103 1.041 1.041 CD 3222 157 104 1.115 1.115 CD 3222 157 105 0.996 0.996 CD 3222 157 106 Comp Comp NNP 3222 157 107 . . . 3222 158 1 Time time NN 3222 158 2 0.766 0.766 CD 3222 158 3 2.047 2.047 CD 3222 158 4 1.500 1.500 CD 3222 158 5 12.62 12.62 CD 3222 158 6 6.500 6.500 CD 3222 158 7 7.562 7.562 CD 3222 158 8 8.032 8.032 CD 3222 158 9 3.922 3.922 CD 3222 158 10 Dec. December NNP 3222 158 11 Time Time NNP 3222 158 12 0.156 0.156 CD 3222 158 13 0.688 0.688 CD 3222 158 14 0.766 0.766 CD 3222 158 15 0.469 0.469 CD 3222 158 16 0.875 0.875 CD 3222 158 17 0.953 0.953 CD 3222 158 18 1.312 1.312 CD 3222 158 19 4.157 4.157 CD 3222 158 20 THE the DT 3222 158 21 EFFiCiENT efficient PRP$ 3222 158 22 SToraGE storage CD 3222 158 23 oF oF NNP 3222 158 24 TExT text NN 3222 158 25 DoCuMENTS documents NN 3222 158 26 iN in IN 3222 158 27 DiGiTal DiGiTal NNP 3222 158 28 liBrariES libraries NN 3222 158 29 | | NNP 3222 158 30 SkibiŃSki SkibiŃSki NNP 3222 158 31 and and CC 3222 158 32 Swacha Swacha NNP 3222 158 33 151 151 CD 3222 158 34 certainly certainly RB 3222 158 35 seem seem VBP 3222 158 36 good good JJ 3222 158 37 enough enough RB 3222 158 38 for for IN 3222 158 39 practical practical JJ 3222 158 40 applications application NNS 3222 158 41 . . . 3222 159 1 Compared compare VBN 3222 159 2 to to IN 3222 159 3 LZMA lzma NN 3222 159 4 , , , 3222 159 5 CTDL CTDL NNP 3222 159 6 offers offer VBZ 3222 159 7 better well JJR 3222 159 8 compression compression NN 3222 159 9 and and CC 3222 159 10 a a DT 3222 159 11 shorter short JJR 3222 159 12 compression compression NN 3222 159 13 time time NN 3222 159 14 at at IN 3222 159 15 the the DT 3222 159 16 expense expense NN 3222 159 17 of of IN 3222 159 18 longer long JJR 3222 159 19 decompression decompression NN 3222 159 20 time time NN 3222 159 21 . . . 3222 160 1 Notice notice VB 3222 160 2 that that IN 3222 160 3 the the DT 3222 160 4 absolute absolute JJ 3222 160 5 gain gain NN 3222 160 6 in in IN 3222 160 7 compression compression NN 3222 160 8 time time NN 3222 160 9 is be VBZ 3222 160 10 several several JJ 3222 160 11 times time NNS 3222 160 12 the the DT 3222 160 13 loss loss NN 3222 160 14 in in IN 3222 160 15 decompres- decompres- NNP 3222 160 16 sion sion NN 3222 160 17 time time NN 3222 160 18 , , , 3222 160 19 and and CC 3222 160 20 the the DT 3222 160 21 decompression decompression NN 3222 160 22 time time NN 3222 160 23 remains remain VBZ 3222 160 24 short short JJ 3222 160 25 , , , 3222 160 26 noticeably noticeably RB 3222 160 27 shorter short JJR 3222 160 28 than than IN 3222 160 29 bzip2 bzip2 NNP 3222 160 30 ’s ’s , 3222 160 31 and and CC 3222 160 32 several several JJ 3222 160 33 times time NNS 3222 160 34 shorter short JJR 3222 160 35 than than IN 3222 160 36 PPMVC PPMVC NNP 3222 160 37 ’s ’s NNP 3222 160 38 . . . 3222 161 1 CTDL+ CTDL+ NNP 3222 161 2 beats beat VBZ 3222 161 3 bzip2 bzip2 VBP 3222 161 4 ( ( -LRB- 3222 161 5 with with IN 3222 161 6 the the DT 3222 161 7 sole sole JJ 3222 161 8 excep- excep- XX 3222 161 9 tion tion NN 3222 161 10 of of IN 3222 161 11 PS PS NNP 3222 161 12 documents document NNS 3222 161 13 ) ) -RRB- 3222 161 14 in in IN 3222 161 15 terms term NNS 3222 161 16 of of IN 3222 161 17 compression compression NN 3222 161 18 ratio ratio NN 3222 161 19 and and CC 3222 161 20 achieves achieve VBZ 3222 161 21 results result NNS 3222 161 22 that that WDT 3222 161 23 are be VBP 3222 161 24 mostly mostly RB 3222 161 25 very very RB 3222 161 26 close close JJ 3222 161 27 to to IN 3222 161 28 the the DT 3222 161 29 resource- resource- JJ 3222 161 30 hungry hungry JJ 3222 161 31 PPMVC PPMVC NNP 3222 161 32 . . . 3222 162 1 n n LS 3222 162 2 Conclusions Conclusions NNPS 3222 162 3 In in IN 3222 162 4 this this DT 3222 162 5 paper paper NN 3222 162 6 we -PRON- PRP 3222 162 7 addressed address VBD 3222 162 8 the the DT 3222 162 9 problem problem NN 3222 162 10 of of IN 3222 162 11 compressing compress VBG 3222 162 12 text text NN 3222 162 13 documents document NNS 3222 162 14 . . . 3222 163 1 Although although IN 3222 163 2 individual individual JJ 3222 163 3 text text NN 3222 163 4 documents document NNS 3222 163 5 rarely rarely RB 3222 163 6 exceed exceed VBP 3222 163 7 several several JJ 3222 163 8 megabytes megabyte NNS 3222 163 9 in in IN 3222 163 10 size size NN 3222 163 11 , , , 3222 163 12 their -PRON- PRP$ 3222 163 13 entire entire JJ 3222 163 14 col- col- NN 3222 163 15 lections lection NNS 3222 163 16 can can MD 3222 163 17 have have VB 3222 163 18 very very RB 3222 163 19 large large JJ 3222 163 20 storage storage NN 3222 163 21 space space NN 3222 163 22 requirements requirement NNS 3222 163 23 . . . 3222 164 1 Although although IN 3222 164 2 text text NN 3222 164 3 documents document NNS 3222 164 4 are be VBP 3222 164 5 often often RB 3222 164 6 compressed compress VBN 3222 164 7 with with IN 3222 164 8 general general JJ 3222 164 9 - - HYPH 3222 164 10 purpose purpose NN 3222 164 11 methods method NNS 3222 164 12 such such JJ 3222 164 13 as as IN 3222 164 14 Deflate Deflate NNP 3222 164 15 , , , 3222 164 16 much much RB 3222 164 17 better well JJR 3222 164 18 compression compression NN 3222 164 19 can can MD 3222 164 20 be be VB 3222 164 21 obtained obtain VBN 3222 164 22 with with IN 3222 164 23 a a DT 3222 164 24 scheme scheme NN 3222 164 25 specialized specialize VBN 3222 164 26 for for IN 3222 164 27 text text NN 3222 164 28 , , , 3222 164 29 and and CC 3222 164 30 even even RB 3222 164 31 better well RBR 3222 164 32 if if IN 3222 164 33 the the DT 3222 164 34 scheme scheme NN 3222 164 35 is be VBZ 3222 164 36 additionally additionally RB 3222 164 37 specialized specialize VBN 3222 164 38 for for IN 3222 164 39 individual individual JJ 3222 164 40 document document NN 3222 164 41 formats format NNS 3222 164 42 . . . 3222 165 1 We -PRON- PRP 3222 165 2 have have VBP 3222 165 3 developed develop VBN 3222 165 4 such such PDT 3222 165 5 a a DT 3222 165 6 scheme scheme NN 3222 165 7 ( ( -LRB- 3222 165 8 CTDL CTDL NNP 3222 165 9 ) ) -RRB- 3222 165 10 , , , 3222 165 11 beginning begin VBG 3222 165 12 with with IN 3222 165 13 a a DT 3222 165 14 text text NN 3222 165 15 transform transform NN 3222 165 16 designed design VBN 3222 165 17 earlier early RBR 3222 165 18 for for IN 3222 165 19 XML xml NN 3222 165 20 documents document NNS 3222 165 21 and and CC 3222 165 22 Table table NN 3222 165 23 8 8 CD 3222 165 24 . . . 3222 166 1 Compression compression NN 3222 166 2 efficiency efficiency NN 3222 166 3 and and CC 3222 166 4 times time NNS 3222 166 5 for for IN 3222 166 6 the the DT 3222 166 7 DOC DOC NNP 3222 166 8 documents document NNS 3222 166 9 Deflate deflate JJ 3222 166 10 LZMA LZMA NNS 3222 166 11 bzip2 bzip2 VBP 3222 166 12 PPMVC PPMVC NNP 3222 166 13 File File NNP 3222 166 14 Name Name NNP 3222 166 15 gzip gzip NN 3222 166 16 CTDL CTDL NNP 3222 166 17 CTDL+ CTDL+ NNP 3222 166 18 7-zip 7-zip NNP 3222 166 19 CTDL CTDL NNP 3222 166 20 CTDL+ CTDL+ NNP 3222 166 21 13601-t 13601-t VBD 3222 166 22 2.798 2.798 CD 3222 166 23 2.183 2.183 CD 3222 166 24 2.062 2.062 CD 3222 166 25 2.181 2.181 CD 3222 166 26 1.976 1.976 CD 3222 166 27 1.854 1.854 CD 3222 166 28 2.115 2.115 CD 3222 166 29 1.818 1.818 CD 3222 166 30 16514-t 16514-t CD 3222 166 31 2.226 2.226 CD 3222 166 32 2.213 2.213 CD 3222 166 33 2.073 2.073 CD 3222 166 34 1.712 1.712 CD 3222 166 35 1.712 1.712 CD 3222 166 36 1.652 1.652 CD 3222 166 37 1.919 1.919 CD 3222 166 38 1.686 1.686 CD 3222 166 39 1noam10 1noam10 CD 3222 166 40 t t NN 3222 166 41 2.851 2.851 CD 3222 166 42 2.250 2.250 CD 3222 166 43 2.025 2.025 CD 3222 166 44 2.289 2.289 CD 3222 166 45 2.057 2.057 CD 3222 166 46 1.869 1.869 CD 3222 166 47 2.113 2.113 CD 3222 166 48 1.870 1.870 CD 3222 166 49 2ws2610 2ws2610 CD 3222 166 50 2.497 2.497 CD 3222 166 51 2.499 2.499 CD 3222 166 52 2.210 2.210 CD 3222 166 53 2.095 2.095 CD 3222 166 54 2.095 2.095 CD 3222 166 55 1.890 1.890 CD 3222 166 56 2.251 2.251 CD 3222 166 57 1.999 1.999 CD 3222 166 58 alice30 alice30 CD 3222 166 59 2.744 2.744 CD 3222 166 60 2.714 2.714 CD 3222 166 61 2.270 2.270 CD 3222 166 62 2.345 2.345 CD 3222 166 63 2.345 2.345 CD 3222 166 64 2.038 2.038 CD 3222 166 65 2.348 2.348 CD 3222 166 66 2.058 2.058 CD 3222 166 67 cdscs10 cdscs10 NN 3222 166 68 t t NN 3222 166 69 2.916 2.916 CD 3222 166 70 2.891 2.891 CD 3222 166 71 2.231 2.231 CD 3222 166 72 2.559 2.559 CD 3222 166 73 2.560 2.560 CD 3222 166 74 2.062 2.062 CD 3222 166 75 2.475 2.475 CD 3222 166 76 2.196 2.196 CD 3222 166 77 grimm10 grimm10 CD 3222 166 78 t t NN 3222 166 79 2.691 2.691 CD 3222 166 80 2.677 2.677 CD 3222 166 81 2.059 2.059 CD 3222 166 82 2.179 2.179 CD 3222 166 83 2.179 2.179 CD 3222 166 84 1.856 1.856 CD 3222 166 85 2.075 2.075 CD 3222 166 86 1.833 1.833 CD 3222 166 87 pandp12 pandp12 NNP 3222 166 88 t t NNP 3222 166 89 2.761 2.761 CD 3222 166 90 2.171 2.171 CD 3222 166 91 2.050 2.050 CD 3222 166 92 2.189 2.189 CD 3222 166 93 1.955 1.955 CD 3222 166 94 1.843 1.843 CD 3222 166 95 1.983 1.983 CD 3222 166 96 1.770 1.770 CD 3222 166 97 Average average JJ 3222 166 98 2.686 2.686 CD 3222 166 99 2.450 2.450 CD 3222 166 100 2.123 2.123 CD 3222 166 101 2.194 2.194 CD 3222 166 102 2.110 2.110 CD 3222 166 103 1.883 1.883 CD 3222 166 104 2.160 2.160 CD 3222 166 105 1.904 1.904 CD 3222 166 106 Comp Comp NNP 3222 166 107 . . . 3222 167 1 Time time NN 3222 167 2 0.718 0.718 CD 3222 167 3 1.312 1.312 CD 3222 167 4 1.031 1.031 CD 3222 167 5 7.078 7.078 CD 3222 167 6 4.063 4.063 CD 3222 167 7 3.001 3.001 CD 3222 167 8 2.250 2.250 CD 3222 167 9 3.421 3.421 CD 3222 167 10 Dec. December NNP 3222 167 11 Time Time NNP 3222 167 12 0.125 0.125 CD 3222 167 13 0.375 0.375 CD 3222 167 14 0.547 0.547 CD 3222 167 15 0.344 0.344 CD 3222 167 16 0.547 0.547 CD 3222 167 17 0.718 0.718 CD 3222 167 18 0.735 0.735 CD 3222 167 19 3.625 3.625 CD 3222 167 20 Table table NN 3222 167 21 9 9 CD 3222 167 22 . . . 3222 168 1 Compression compression NN 3222 168 2 efficiency efficiency NN 3222 168 3 and and CC 3222 168 4 times time NNS 3222 168 5 for for IN 3222 168 6 the the DT 3222 168 7 PS PS NNP 3222 168 8 documents document NNS 3222 168 9 Deflate Deflate NNP 3222 168 10 LZMA LZMA NNS 3222 168 11 bzip2 bzip2 VBP 3222 168 12 PPMVC PPMVC NNP 3222 168 13 File File NNP 3222 168 14 Name Name NNP 3222 168 15 gzip gzip NN 3222 168 16 CTDL CTDL NNP 3222 168 17 CTDL+ CTDL+ NNP 3222 168 18 7-zip 7-zip NNP 3222 168 19 CTDL CTDL NNP 3222 168 20 CTDL+ CTDL+ NNP 3222 168 21 13601-t 13601-t VBD 3222 168 22 2.847 2.847 CD 3222 168 23 2.634 2.634 CD 3222 168 24 2.589 2.589 CD 3222 168 25 2.213 2.213 CD 3222 168 26 2.105 2.105 CD 3222 168 27 2.074 2.074 CD 3222 168 28 2.011 2.011 CD 3222 168 29 1.778 1.778 CD 3222 168 30 16514-t 16514-t CD 3222 168 31 3.226 3.226 CD 3222 168 32 3.129 3.129 CD 3222 168 33 3.039 3.039 CD 3222 168 34 2.730 2.730 CD 3222 168 35 2.707 2.707 CD 3222 168 36 2.699 2.699 CD 3222 168 37 2.613 2.613 CD 3222 168 38 2.505 2.505 CD 3222 168 39 1noam10 1noam10 CD 3222 168 40 t t NNP 3222 168 41 2.718 2.718 CD 3222 168 42 2.551 2.551 CD 3222 168 43 2.490 2.490 CD 3222 168 44 2.147 2.147 CD 3222 168 45 2.060 2.060 CD 3222 168 46 2.015 2.015 CD 3222 168 47 1.892 1.892 CD 3222 168 48 1.694 1.694 CD 3222 168 49 2ws2610 2ws2610 CD 3222 168 50 3.064 3.064 CD 3222 168 51 2.922 2.922 CD 3222 168 52 2.795 2.795 CD 3222 168 53 2.600 2.600 CD 3222 168 54 2.521 2.521 CD 3222 168 55 2.450 2.450 CD 3222 168 56 2.336 2.336 CD 3222 168 57 2.186 2.186 CD 3222 168 58 alice30 alice30 CD 3222 168 59 3.224 3.224 CD 3222 168 60 3.154 3.154 CD 3222 168 61 3.026 3.026 CD 3222 168 62 2.750 2.750 CD 3222 168 63 2.745 2.745 CD 3222 168 64 2.691 2.691 CD 3222 168 65 2.553 2.553 CD 3222 168 66 2.400 2.400 CD 3222 168 67 cdscs10 cdscs10 NN 3222 168 68 t t NNP 3222 168 69 3.110 3.110 CD 3222 168 70 3.029 3.029 CD 3222 168 71 2.890 2.890 CD 3222 168 72 2.657 2.657 CD 3222 168 73 2.683 2.683 CD 3222 168 74 2.579 2.579 CD 3222 168 75 2.447 2.447 CD 3222 168 76 2.276 2.276 CD 3222 168 77 grimm10 grimm10 NNS 3222 168 78 t t NN 3222 168 79 2.833 2.833 CD 3222 168 80 2.664 2.664 CD 3222 168 81 2.597 2.597 CD 3222 168 82 2.288 2.288 CD 3222 168 83 2.200 2.200 CD 3222 168 84 2.162 2.162 CD 3222 168 85 2.074 2.074 CD 3222 168 86 1.863 1.863 CD 3222 168 87 pandp12 pandp12 NNP 3222 168 88 t t NN 3222 168 89 2.814 2.814 CD 3222 168 90 2.533 2.533 CD 3222 168 91 2.468 2.468 CD 3222 168 92 2.193 2.193 CD 3222 168 93 2.049 2.049 CD 3222 168 94 1.998 1.998 CD 3222 168 95 1.858 1.858 CD 3222 168 96 1.644 1.644 CD 3222 168 97 Average average JJ 3222 168 98 2.980 2.980 CD 3222 168 99 2.827 2.827 CD 3222 168 100 2.737 2.737 CD 3222 168 101 2.447 2.447 CD 3222 168 102 2.384 2.384 CD 3222 168 103 2.334 2.334 CD 3222 168 104 2.223 2.223 CD 3222 168 105 2.043 2.043 CD 3222 168 106 Comp Comp NNP 3222 168 107 . . . 3222 169 1 Time Time NNP 3222 169 2 1.328 1.328 CD 3222 169 3 3.015 3.015 CD 3222 169 4 2.500 2.500 CD 3222 169 5 14.23 14.23 CD 3222 169 6 10.96 10.96 CD 3222 169 7 11.09 11.09 CD 3222 169 8 4.171 4.171 CD 3222 169 9 5.765 5.765 CD 3222 169 10 Dec. December NNP 3222 169 11 Time Time NNP 3222 169 12 0.203 0.203 CD 3222 169 13 0.688 0.688 CD 3222 169 14 0.781 0.781 CD 3222 169 15 0.609 0.609 CD 3222 169 16 1.063 1.063 CD 3222 169 17 1.125 1.125 CD 3222 169 18 1.360 1.360 CD 3222 169 19 6.063 6.063 CD 3222 169 20 152 152 CD 3222 169 21 iNForMaTioN iNForMaTioN NNS 3222 169 22 TECHNoloGY technology NN 3222 169 23 aND and CC 3222 169 24 liBrariES liBrariES NNP 3222 169 25 | | NNP 3222 169 26 SEpTEMBEr september CD 3222 169 27 2009 2009 CD 3222 169 28 modifying modify VBG 3222 169 29 it -PRON- PRP 3222 169 30 for for IN 3222 169 31 the the DT 3222 169 32 requirements requirement NNS 3222 169 33 of of IN 3222 169 34 each each DT 3222 169 35 of of IN 3222 169 36 the the DT 3222 169 37 investigated investigate VBN 3222 169 38 docu- docu- NN 3222 169 39 ment ment JJ 3222 169 40 formats format NNS 3222 169 41 . . . 3222 170 1 It -PRON- PRP 3222 170 2 has have VBZ 3222 170 3 two two CD 3222 170 4 operation operation NN 3222 170 5 modes mode NNS 3222 170 6 : : : 3222 170 7 basic basic JJ 3222 170 8 CTDL CTDL NNP 3222 170 9 and and CC 3222 170 10 CTDL+ CTDL+ NNP 3222 170 11 ( ( -LRB- 3222 170 12 the the DT 3222 170 13 latter latter JJ 3222 170 14 uses use VBZ 3222 170 15 a a DT 3222 170 16 common common JJ 3222 170 17 word word NN 3222 170 18 dictionary dictionary JJ 3222 170 19 for for IN 3222 170 20 improved improve VBN 3222 170 21 compres- compres- NN 3222 170 22 sion sion NN 3222 170 23 ) ) -RRB- 3222 170 24 and and CC 3222 170 25 uses use VBZ 3222 170 26 two two CD 3222 170 27 back back JJ 3222 170 28 - - HYPH 3222 170 29 end end NN 3222 170 30 com- com- NN 3222 170 31 pression pression NN 3222 170 32 algorithms algorithm NNS 3222 170 33 : : : 3222 170 34 Deflate deflate NN 3222 170 35 and and CC 3222 170 36 LZMA lzma NN 3222 170 37 ( ( -LRB- 3222 170 38 differing differ VBG 3222 170 39 in in IN 3222 170 40 compression compression NN 3222 170 41 speed speed NN 3222 170 42 and and CC 3222 170 43 efficiency efficiency NN 3222 170 44 ) ) -RRB- 3222 170 45 . . . 3222 171 1 The the DT 3222 171 2 improvement improvement NN 3222 171 3 in in IN 3222 171 4 com- com- NN 3222 171 5 pression pression NN 3222 171 6 efficiency efficiency NN 3222 171 7 , , , 3222 171 8 which which WDT 3222 171 9 can can MD 3222 171 10 be be VB 3222 171 11 observed observe VBN 3222 171 12 in in IN 3222 171 13 the the DT 3222 171 14 experimental experimental JJ 3222 171 15 results result NNS 3222 171 16 , , , 3222 171 17 amounts amount VBZ 3222 171 18 to to IN 3222 171 19 a a DT 3222 171 20 significant significant JJ 3222 171 21 reduction reduction NN 3222 171 22 of of IN 3222 171 23 data datum NNS 3222 171 24 storage storage NN 3222 171 25 require- require- NN 3222 171 26 ments ment NNS 3222 171 27 , , , 3222 171 28 giving give VBG 3222 171 29 the the DT 3222 171 30 reasons reason NNS 3222 171 31 to to TO 3222 171 32 use use VB 3222 171 33 the the DT 3222 171 34 library library NN 3222 171 35 in in IN 3222 171 36 both both CC 3222 171 37 new new JJ 3222 171 38 and and CC 3222 171 39 exist- exist- JJ 3222 171 40 ing ing NNP 3222 171 41 digital digital JJ 3222 171 42 library library NN 3222 171 43 projects project NNS 3222 171 44 instead instead RB 3222 171 45 of of IN 3222 171 46 general general JJ 3222 171 47 - - HYPH 3222 171 48 purpose purpose NN 3222 171 49 compression compression NN 3222 171 50 programs program NNS 3222 171 51 . . . 3222 172 1 To to TO 3222 172 2 facilitate facilitate VB 3222 172 3 this this DT 3222 172 4 pro- pro- NN 3222 172 5 cess cess NN 3222 172 6 , , , 3222 172 7 we -PRON- PRP 3222 172 8 implemented implement VBD 3222 172 9 the the DT 3222 172 10 scheme scheme NN 3222 172 11 as as IN 3222 172 12 an an DT 3222 172 13 open open JJ 3222 172 14 - - HYPH 3222 172 15 source source NN 3222 172 16 software software NN 3222 172 17 library library NN 3222 172 18 under under IN 3222 172 19 the the DT 3222 172 20 same same JJ 3222 172 21 name name NN 3222 172 22 , , , 3222 172 23 freely freely RB 3222 172 24 avail- avail- XX 3222 172 25 able able JJ 3222 172 26 at at IN 3222 172 27 http://www.ii.uni.wroc http://www.ii.uni.wroc NNP 3222 172 28 . . . 3222 173 1 p p NNP 3222 173 2 l l NN 3222 173 3 / / SYM 3222 173 4 ~ ~ NFP 3222 173 5 i i NN 3222 173 6 n n CC 3222 173 7 i i NNP 3222 173 8 k k NNP 3222 173 9 e e NNP 3222 173 10 p p NNP 3222 173 11 / / SYM 3222 173 12 re re NNP 3222 173 13 s s NNP 3222 173 14 e e NNP 3222 173 15 a a DT 3222 173 16 rc rc NNP 3222 173 17 h h NNP 3222 173 18 / / SYM 3222 173 19 C c NN 3222 173 20 T t NN 3222 173 21 D D NNP 3222 173 22 L l NN 3222 173 23 / / SYM 3222 173 24 CTDL09.zip CTDL09.zip NNP 3222 173 25 . . . 3222 174 1 Although although IN 3222 174 2 the the DT 3222 174 3 scheme scheme NN 3222 174 4 and and CC 3222 174 5 the the DT 3222 174 6 library library NN 3222 174 7 are be VBP 3222 174 8 now now RB 3222 174 9 complete complete JJ 3222 174 10 , , , 3222 174 11 we -PRON- PRP 3222 174 12 plan plan VBP 3222 174 13 future future JJ 3222 174 14 extensions extension NNS 3222 174 15 aiming aim VBG 3222 174 16 both both DT 3222 174 17 to to TO 3222 174 18 increase increase VB 3222 174 19 the the DT 3222 174 20 level level NN 3222 174 21 of of IN 3222 174 22 specializa- specializa- JJ 3222 174 23 tions tion NNS 3222 174 24 for for IN 3222 174 25 currently currently RB 3222 174 26 handled handle VBN 3222 174 27 docu- docu- NN 3222 174 28 ment ment JJ 3222 174 29 formats format NNS 3222 174 30 and and CC 3222 174 31 to to TO 3222 174 32 extend extend VB 3222 174 33 the the DT 3222 174 34 list list NN 3222 174 35 of of IN 3222 174 36 handled handle VBN 3222 174 37 document document NN 3222 174 38 formats format NNS 3222 174 39 . . . 3222 175 1 Table table NN 3222 175 2 10 10 CD 3222 175 3 . . . 3222 176 1 Compression compression NN 3222 176 2 efficiency efficiency NN 3222 176 3 and and CC 3222 176 4 times time NNS 3222 176 5 for for IN 3222 176 6 the the DT 3222 176 7 ( ( -LRB- 3222 176 8 uncompressed uncompressed JJ 3222 176 9 ) ) -RRB- 3222 176 10 PDF PDF NNP 3222 176 11 documents document NNS 3222 176 12 Deflate deflate JJ 3222 176 13 LZMA LZMA NNS 3222 176 14 bzip2 bzip2 VBP 3222 176 15 PPMVC PPMVC NNP 3222 176 16 File File NNP 3222 176 17 Name Name NNP 3222 176 18 native native JJ 3222 176 19 gzip gzip NN 3222 176 20 CTDL CTDL NNP 3222 176 21 CTDL+ CTDL+ NNP 3222 176 22 7-zip 7-zip NNP 3222 176 23 CTDL CTDL NNP 3222 176 24 CTDL+ CTDL+ NNP 3222 176 25 13601-t 13601-t VBD 3222 176 26 3.443 3.443 CD 3222 176 27 2.624 2.624 CD 3222 176 28 2.191 2.191 CD 3222 176 29 2.200 2.200 CD 3222 176 30 1.986 1.986 CD 3222 176 31 1.708 1.708 CD 3222 176 32 1.656 1.656 CD 3222 176 33 1.852 1.852 CD 3222 176 34 1.659 1.659 CD 3222 176 35 16514-t 16514-t CD 3222 176 36 4.370 4.370 CD 3222 176 37 2.839 2.839 CD 3222 176 38 2.836 2.836 CD 3222 176 39 2.810 2.810 CD 3222 176 40 2.422 2.422 CD 3222 176 41 2.422 2.422 CD 3222 176 42 2.328 2.328 CD 3222 176 43 2.378 2.378 CD 3222 176 44 2.241 2.241 CD 3222 176 45 1noam10 1noam10 CD 3222 176 46 t t XX 3222 176 47 3.379 3.379 CD 3222 176 48 2.522 2.522 CD 3222 176 49 2.103 2.103 CD 3222 176 50 2.094 2.094 CD 3222 176 51 1.924 1.924 CD 3222 176 52 1.659 1.659 CD 3222 176 53 1.603 1.603 CD 3222 176 54 1.770 1.770 CD 3222 176 55 1.587 1.587 CD 3222 176 56 2ws2610 2ws2610 CD 3222 176 57 3.519 3.519 CD 3222 176 58 2.204 2.204 CD 3222 176 59 2.346 2.346 CD 3222 176 60 2.248 2.248 CD 3222 176 61 1.781 1.781 CD 3222 176 62 1.947 1.947 CD 3222 176 63 1.860 1.860 CD 3222 176 64 1.625 1.625 CD 3222 176 65 1.480 1.480 CD 3222 176 66 alice30 alice30 CD 3222 176 67 3.886 3.886 CD 3222 176 68 2.863 2.863 CD 3222 176 69 2.753 2.753 CD 3222 176 70 2.668 2.668 CD 3222 176 71 2.429 2.429 CD 3222 176 72 2.308 2.308 CD 3222 176 73 2.216 2.216 CD 3222 176 74 2.315 2.315 CD 3222 176 75 2.137 2.137 CD 3222 176 76 cdscs10 cdscs10 NN 3222 176 77 t t NN 3222 176 78 3.684 3.684 CD 3222 176 79 2.835 2.835 CD 3222 176 80 2.688 2.688 CD 3222 176 81 2.557 2.557 CD 3222 176 82 2.399 2.399 CD 3222 176 83 2.276 2.276 CD 3222 176 84 2.164 2.164 CD 3222 176 85 2.260 2.260 CD 3222 176 86 2.079 2.079 CD 3222 176 87 grimm10 grimm10 CD 3222 176 88 t t NN 3222 176 89 3.543 3.543 CD 3222 176 90 2.557 2.557 CD 3222 176 91 2.135 2.135 CD 3222 176 92 2.120 2.120 CD 3222 176 93 2.008 2.008 CD 3222 176 94 1.713 1.713 CD 3222 176 95 1.661 1.661 CD 3222 176 96 1.858 1.858 CD 3222 176 97 1.696 1.696 CD 3222 176 98 pandp12 pandp12 NN 3222 176 99 t t NNP 3222 176 100 3.552 3.552 CD 3222 176 101 2.684 2.684 CD 3222 176 102 2.267 2.267 CD 3222 176 103 2.256 2.256 CD 3222 176 104 2.071 2.071 CD 3222 176 105 1.831 1.831 CD 3222 176 106 1.769 1.769 CD 3222 176 107 1.870 1.870 CD 3222 176 108 1.705 1.705 CD 3222 176 109 Average average JJ 3222 176 110 3.672 3.672 CD 3222 176 111 2.641 2.641 CD 3222 176 112 2.415 2.415 CD 3222 176 113 2.369 2.369 CD 3222 176 114 2.128 2.128 CD 3222 176 115 1.983 1.983 CD 3222 176 116 1.907 1.907 CD 3222 176 117 1.991 1.991 CD 3222 176 118 1.823 1.823 CD 3222 176 119 Comp Comp NNP 3222 176 120 . . . 3222 177 1 Time Time NNP 3222 177 2 n n NNP 3222 177 3 / / SYM 3222 177 4 a a DT 3222 177 5 1.594 1.594 CD 3222 177 6 3.672 3.672 CD 3222 177 7 3.250 3.250 CD 3222 177 8 19.62 19.62 CD 3222 177 9 13.31 13.31 CD 3222 177 10 16.32 16.32 CD 3222 177 11 5.641 5.641 CD 3222 177 12 7.375 7.375 CD 3222 177 13 Dec. December NNP 3222 177 14 Time Time NNP 3222 177 15 n n NNP 3222 177 16 / / SYM 3222 177 17 a a DT 3222 177 18 0.219 0.219 CD 3222 177 19 0.844 0.844 CD 3222 177 20 0.969 0.969 CD 3222 177 21 0.719 0.719 CD 3222 177 22 1.219 1.219 CD 3222 177 23 1.360 1.360 CD 3222 177 24 1.765 1.765 CD 3222 177 25 7.859 7.859 CD 3222 177 26 Figure figure NN 3222 177 27 1 1 CD 3222 177 28 . . . 3222 178 1 Compression compression NN 3222 178 2 improvement improvement NN 3222 178 3 relative relative JJ 3222 178 4 to to TO 3222 178 5 gzip gzip VB 3222 178 6 THE THE NNP 3222 178 7 EFFiCiENT EFFiCiENT NNP 3222 178 8 SToraGE storage CD 3222 178 9 oF oF NNP 3222 178 10 TExT text NN 3222 178 11 DoCuMENTS documents NN 3222 178 12 iN in IN 3222 178 13 DiGiTal DiGiTal NNP 3222 178 14 liBrariES libraries NN 3222 178 15 | | NNP 3222 178 16 SkibiŃSki SkibiŃSki NNP 3222 178 17 and and CC 3222 178 18 Swacha Swacha NNP 3222 178 19 153 153 CD 3222 178 20 Acknowledgements Acknowledgements NNPS 3222 178 21 Szymon Szymon NNP 3222 178 22 Grabowski Grabowski NNP 3222 178 23 is be VBZ 3222 178 24 the the DT 3222 178 25 coauthor coauthor NN 3222 178 26 of of IN 3222 178 27 the the DT 3222 178 28 XML XML NNP 3222 178 29 - - HYPH 3222 178 30 WRT WRT NNP 3222 178 31 transform transform NN 3222 178 32 , , , 3222 178 33 which which WDT 3222 178 34 served serve VBD 3222 178 35 as as IN 3222 178 36 the the DT 3222 178 37 basis basis NN 3222 178 38 for for IN 3222 178 39 the the DT 3222 178 40 CTDL CTDL NNP 3222 178 41 library library NN 3222 178 42 . . . 3222 179 1 References reference NNS 3222 179 2 1 1 CD 3222 179 3 . . . 3222 180 1 John John NNP 3222 180 2 F. F. NNP 3222 180 3 Gantz Gantz NNP 3222 180 4 et et FW 3222 180 5 al al NNP 3222 180 6 . . NNP 3222 180 7 , , , 3222 180 8 The the DT 3222 180 9 Diverse diverse JJ 3222 180 10 and and CC 3222 180 11 Exploding Exploding NNP 3222 180 12 Digital Digital NNP 3222 180 13 Universe Universe NNP 3222 180 14 : : : 3222 180 15 An an DT 3222 180 16 Updated Updated NNP 3222 180 17 Forecast Forecast NNP 3222 180 18 of of IN 3222 180 19 Worldwide Worldwide NNP 3222 180 20 Information Information NNP 3222 180 21 Growth Growth NNP 3222 180 22 Through through IN 3222 180 23 2011 2011 CD 3222 180 24 ( ( -LRB- 3222 180 25 Framingham Framingham NNP 3222 180 26 , , , 3222 180 27 Mass. Massachusetts NNP 3222 181 1 : : : 3222 181 2 IDC IDC NNP 3222 181 3 , , , 3222 181 4 2008 2008 CD 3222 181 5 ) ) -RRB- 3222 181 6 , , , 3222 181 7 http://www http://www ADD 3222 181 8 .emc.com .emc.com NNP 3222 181 9 / / SYM 3222 181 10 collateral collateral NN 3222 181 11 / / SYM 3222 181 12 analyst analyst NN 3222 181 13 - - HYPH 3222 181 14 reports report NNS 3222 181 15 / / SYM 3222 181 16 diverse diverse RB 3222 181 17 - - HYPH 3222 181 18 exploding explode VBG 3222 181 19 - - HYPH 3222 181 20 digital digital NNP 3222 181 21 -universe.pdf -universe.pdf , 3222 181 22 ( ( -LRB- 3222 181 23 accessed access VBN 3222 181 24 May May NNP 3222 181 25 7 7 CD 3222 181 26 , , , 3222 181 27 2009 2009 CD 3222 181 28 ) ) -RRB- 3222 181 29 . . . 3222 182 1 2 2 LS 3222 182 2 . . . 3222 183 1 Timothy Timothy NNP 3222 183 2 C. C. NNP 3222 183 3 Bell Bell NNP 3222 183 4 , , , 3222 183 5 Alistair Alistair NNP 3222 183 6 Moffat Moffat NNP 3222 183 7 , , , 3222 183 8 and and CC 3222 183 9 Ian Ian NNP 3222 183 10 H. H. NNP 3222 183 11 Witten Witten NNP 3222 183 12 , , , 3222 183 13 “ " `` 3222 183 14 Com- Com- NNP 3222 183 15 pressing press VBG 3222 183 16 the the DT 3222 183 17 Digital Digital NNP 3222 183 18 Library Library NNP 3222 183 19 , , , 3222 183 20 ” " '' 3222 183 21 in in IN 3222 183 22 Proceedings Proceedings NNP 3222 183 23 of of IN 3222 183 24 Digital Digital NNP 3222 183 25 Libraries Libraries NNPS 3222 183 26 ‘ ‘ POS 3222 183 27 94 94 CD 3222 183 28 ( ( -LRB- 3222 183 29 College College NNP 3222 183 30 Station Station NNP 3222 183 31 : : : 3222 183 32 Texas Texas NNP 3222 183 33 A&M A&M NNP 3222 183 34 Univ Univ NNP 3222 183 35 . . . 3222 184 1 1994 1994 CD 3222 184 2 ) ) -RRB- 3222 184 3 : : : 3222 184 4 41 41 CD 3222 184 5 . . . 3222 185 1 3 3 LS 3222 185 2 . . . 3222 186 1 Ian Ian NNP 3222 186 2 H. H. NNP 3222 186 3 Witten Witten NNP 3222 186 4 and and CC 3222 186 5 David David NNP 3222 186 6 Bainbridge Bainbridge NNP 3222 186 7 , , , 3222 186 8 How how WRB 3222 186 9 to to TO 3222 186 10 Build build VB 3222 186 11 a a DT 3222 186 12 Digital Digital NNP 3222 186 13 Library Library NNP 3222 186 14 ( ( -LRB- 3222 186 15 San San NNP 3222 186 16 Francisco Francisco NNP 3222 186 17 : : : 3222 186 18 Morgan Morgan NNP 3222 186 19 Kaufmann Kaufmann NNP 3222 186 20 , , , 3222 186 21 2002 2002 CD 3222 186 22 ) ) -RRB- 3222 186 23 . . . 3222 187 1 4 4 LS 3222 187 2 . . . 3222 188 1 Chad Chad NNP 3222 188 2 M. M. NNP 3222 188 3 Kahl Kahl NNP 3222 188 4 and and CC 3222 188 5 Sarah Sarah NNP 3222 188 6 C. C. NNP 3222 188 7 Williams Williams NNP 3222 188 8 , , , 3222 188 9 “ " `` 3222 188 10 Accessing Accessing NNP 3222 188 11 Digital Digital NNP 3222 188 12 Libraries library NNS 3222 188 13 : : : 3222 188 14 A a DT 3222 188 15 Study Study NNP 3222 188 16 of of IN 3222 188 17 ARL ARL NNP 3222 188 18 Members Members NNPS 3222 188 19 ’ ’ POS 3222 188 20 Digital Digital NNP 3222 188 21 Projects Projects NNPS 3222 188 22 , , , 3222 188 23 ” " '' 3222 188 24 The the DT 3222 188 25 Jour- Jour- NNP 3222 188 26 nal nal NN 3222 188 27 of of IN 3222 188 28 Academic Academic NNP 3222 188 29 Librarianship Librarianship NNP 3222 188 30 32 32 CD 3222 188 31 , , , 3222 188 32 no no UH 3222 188 33 . . . 3222 189 1 4 4 CD 3222 189 2 ( ( -LRB- 3222 189 3 2006 2006 CD 3222 189 4 ) ) -RRB- 3222 189 5 : : : 3222 189 6 364 364 CD 3222 189 7 . . . 3222 190 1 5 5 CD 3222 190 2 . . . 3222 191 1 Donald Donald NNP 3222 191 2 E. E. NNP 3222 191 3 Knuth Knuth NNP 3222 191 4 , , , 3222 191 5 TeX TeX VBD 3222 191 6 : : : 3222 191 7 The the DT 3222 191 8 Program Program NNP 3222 191 9 ( ( -LRB- 3222 191 10 Reading Reading NNP 3222 191 11 , , , 3222 191 12 Mass. Massachusetts NNP 3222 192 1 : : : 3222 192 2 Addison Addison NNP 3222 192 3 - - HYPH 3222 192 4 Wesley Wesley NNP 3222 192 5 , , , 3222 192 6 1986 1986 CD 3222 192 7 ) ) -RRB- 3222 192 8 ; ; : 3222 192 9 Microsoft Microsoft NNP 3222 192 10 Technical Technical NNP 3222 192 11 Support Support NNP 3222 192 12 , , , 3222 192 13 Rich Rich NNP 3222 192 14 Text Text NNP 3222 192 15 For- For- NNP 3222 192 16 mat mat NN 3222 192 17 ( ( -LRB- 3222 192 18 RTF RTF NNP 3222 192 19 ) ) -RRB- 3222 192 20 Version version NN 3222 192 21 1.5 1.5 CD 3222 192 22 Specification Specification NNP 3222 192 23 , , , 3222 192 24 1997 1997 CD 3222 192 25 , , , 3222 192 26 http://www.biblioscape http://www.biblioscape NNP 3222 192 27 .com .com . 3222 192 28 / / SYM 3222 192 29 rtf15_spec.htm rtf15_spec.htm NNP 3222 192 30 ( ( -LRB- 3222 192 31 accessed access VBN 3222 192 32 May May NNP 3222 192 33 7 7 CD 3222 192 34 , , , 3222 192 35 2009 2009 CD 3222 192 36 ) ) -RRB- 3222 192 37 ; ; : 3222 192 38 Tim Tim NNP 3222 192 39 Bray Bray NNP 3222 192 40 et et NNP 3222 192 41 al al NNP 3222 192 42 . . NNP 3222 192 43 , , , 3222 192 44 eds eds NNP 3222 192 45 . . NNP 3222 192 46 , , , 3222 192 47 Extensible Extensible NNP 3222 192 48 Markup Markup NNP 3222 192 49 Language Language NNP 3222 192 50 ( ( -LRB- 3222 192 51 XML xml NN 3222 192 52 ) ) -RRB- 3222 192 53 1.0 1.0 CD 3222 192 54 ( ( -LRB- 3222 192 55 Fourth Fourth NNP 3222 192 56 Edition Edition NNP 3222 192 57 ) ) -RRB- 3222 192 58 , , , 3222 192 59 2006 2006 CD 3222 192 60 , , , 3222 192 61 http://www.w3.org/TR/2006/REC-xml-20060816 http://www.w3.org/TR/2006/REC-xml-20060816 NNP 3222 192 62 ( ( -LRB- 3222 192 63 accessed access VBN 3222 192 64 May May NNP 3222 192 65 7 7 CD 3222 192 66 , , , 3222 192 67 2009 2009 CD 3222 192 68 ) ) -RRB- 3222 192 69 ; ; : 3222 192 70 Dave Dave NNP 3222 192 71 Raggett Raggett NNP 3222 192 72 , , , 3222 192 73 Arnaud Arnaud NNP 3222 192 74 Le Le NNP 3222 192 75 Hors Hors NNP 3222 192 76 , , , 3222 192 77 and and CC 3222 192 78 Ian Ian NNP 3222 192 79 Jacobs Jacobs NNP 3222 192 80 , , , 3222 192 81 eds eds NNP 3222 192 82 . . NNP 3222 192 83 , , , 3222 192 84 W3C W3C NNP 3222 192 85 HTML HTML NNP 3222 192 86 4.01 4.01 CD 3222 192 87 Specification Specification NNP 3222 192 88 , , , 3222 192 89 1999 1999 CD 3222 192 90 , , , 3222 192 91 http://www.w3.org/ http://www.w3.org/ NNP 3222 192 92 TR TR NNP 3222 192 93 / / SYM 3222 192 94 REC REC NNP 3222 192 95 - - HYPH 3222 192 96 html40/ html40/ NNP 3222 192 97 ( ( -LRB- 3222 192 98 accessed access VBN 3222 192 99 May May NNP 3222 192 100 7 7 CD 3222 192 101 , , , 3222 192 102 2009 2009 CD 3222 192 103 ) ) -RRB- 3222 192 104 ; ; : 3222 192 105 PostScript PostScript NNP 3222 192 106 Language Language NNP 3222 192 107 Reference Reference NNP 3222 192 108 , , , 3222 192 109 3rd 3rd JJ 3222 192 110 ed ed NN 3222 192 111 . . . 3222 193 1 ( ( -LRB- 3222 193 2 Reading Reading NNP 3222 193 3 , , , 3222 193 4 Mass. Massachusetts NNP 3222 194 1 : : : 3222 194 2 Addison Addison NNP 3222 194 3 - - HYPH 3222 194 4 Wesley Wesley NNP 3222 194 5 , , , 3222 194 6 1999 1999 CD 3222 194 7 ) ) -RRB- 3222 194 8 , , , 3222 194 9 http://www.adobe.com/devnet/postscript/pdfs/PLRM.pdf http://www.adobe.com/devnet/postscript/pdfs/PLRM.pdf NNP 3222 194 10 ( ( -LRB- 3222 194 11 accessed access VBN 3222 194 12 May May NNP 3222 194 13 7 7 CD 3222 194 14 , , , 3222 194 15 2009 2009 CD 3222 194 16 ) ) -RRB- 3222 194 17 ; ; : 3222 194 18 PDF PDF NNP 3222 194 19 Reference Reference NNP 3222 194 20 , , , 3222 194 21 6th 6th JJ 3222 194 22 ed ed NNP 3222 194 23 . . NNP 3222 194 24 , , , 3222 194 25 version version NN 3222 194 26 1.7 1.7 CD 3222 194 27 , , , 3222 194 28 2006 2006 CD 3222 194 29 , , , 3222 194 30 http://www.adobe.com/devnet/acrobat/pdfs/pdf http://www.adobe.com/devnet/acrobat/pdfs/pdf NNP 3222 194 31 _ _ NNP 3222 194 32 reference_1-7.pdf reference_1-7.pdf NNP 3222 194 33 ( ( -LRB- 3222 194 34 accessed access VBN 3222 194 35 May May NNP 3222 194 36 7 7 CD 3222 194 37 , , , 3222 194 38 2009 2009 CD 3222 194 39 ) ) -RRB- 3222 194 40 . . . 3222 195 1 6 6 CD 3222 195 2 . . . 3222 196 1 Jacob Jacob NNP 3222 196 2 Ziv Ziv NNP 3222 196 3 and and CC 3222 196 4 Abraham Abraham NNP 3222 196 5 Lempel Lempel NNP 3222 196 6 , , , 3222 196 7 “ " `` 3222 196 8 A a DT 3222 196 9 Universal Universal NNP 3222 196 10 Algorithm Algorithm NNP 3222 196 11 for for IN 3222 196 12 Sequential Sequential NNP 3222 196 13 Data Data NNP 3222 196 14 Compression Compression NNP 3222 196 15 , , , 3222 196 16 ” " '' 3222 196 17 IEEE IEEE NNP 3222 196 18 Transactions transaction NNS 3222 196 19 on on IN 3222 196 20 Informa- Informa- NNP 3222 196 21 tion tion NN 3222 196 22 Theory theory NN 3222 196 23 23 23 CD 3222 196 24 , , , 3222 196 25 no no UH 3222 196 26 . . . 3222 197 1 3 3 CD 3222 197 2 ( ( -LRB- 3222 197 3 1977 1977 CD 3222 197 4 ) ) -RRB- 3222 197 5 : : : 3222 197 6 337 337 CD 3222 197 7 . . . 3222 198 1 7 7 LS 3222 198 2 . . . 3222 199 1 Ian Ian NNP 3222 199 2 H. H. NNP 3222 199 3 Witten Witten NNP 3222 199 4 , , , 3222 199 5 Alistair Alistair NNP 3222 199 6 Moffat Moffat NNP 3222 199 7 , , , 3222 199 8 and and CC 3222 199 9 Timothy Timothy NNP 3222 199 10 C. C. NNP 3222 199 11 Bell Bell NNP 3222 199 12 , , , 3222 199 13 Man- Man- NNP 3222 199 14 aging age VBG 3222 199 15 Gigabytes gigabyte NNS 3222 199 16 : : : 3222 199 17 Compressing compressing NN 3222 199 18 and and CC 3222 199 19 Indexing Indexing NNP 3222 199 20 Documents Documents NNPS 3222 199 21 and and CC 3222 199 22 Images Images NNPS 3222 199 23 , , , 3222 199 24 2nd 2nd JJ 3222 199 25 ed ed NN 3222 199 26 . . . 3222 200 1 ( ( -LRB- 3222 200 2 San San NNP 3222 200 3 Francisco Francisco NNP 3222 200 4 : : : 3222 200 5 Morgan Morgan NNP 3222 200 6 Kaufmann Kaufmann NNP 3222 200 7 , , , 3222 200 8 1999 1999 CD 3222 200 9 ) ) -RRB- 3222 200 10 . . . 3222 201 1 8 8 LS 3222 201 2 . . . 3222 202 1 John John NNP 3222 202 2 G. G. NNP 3222 202 3 Cleary Cleary NNP 3222 202 4 and and CC 3222 202 5 Ian Ian NNP 3222 202 6 H. H. NNP 3222 202 7 Witten Witten NNP 3222 202 8 , , , 3222 202 9 “ " `` 3222 202 10 Data Data NNPS 3222 202 11 Compression Compression NNP 3222 202 12 using use VBG 3222 202 13 Adaptive Adaptive NNP 3222 202 14 Coding Coding NNP 3222 202 15 and and CC 3222 202 16 Partial Partial NNP 3222 202 17 String String NNP 3222 202 18 Matching Matching NNP 3222 202 19 , , , 3222 202 20 ” " '' 3222 202 21 IEEE IEEE NNP 3222 202 22 Transactions transaction NNS 3222 202 23 on on IN 3222 202 24 Communication Communication NNP 3222 202 25 32 32 CD 3222 202 26 , , , 3222 202 27 no no UH 3222 202 28 . . . 3222 203 1 4 4 LS 3222 203 2 , , , 3222 203 3 ( ( -LRB- 3222 203 4 1984 1984 CD 3222 203 5 ) ) -RRB- 3222 203 6 : : : 3222 203 7 396 396 CD 3222 203 8 ; ; : 3222 203 9 Michael Michael NNP 3222 203 10 Burrows Burrows NNP 3222 203 11 and and CC 3222 203 12 David David NNP 3222 203 13 J. J. NNP 3222 203 14 Wheeler Wheeler NNP 3222 203 15 , , , 3222 203 16 “ " `` 3222 203 17 A a DT 3222 203 18 Block block NN 3222 203 19 - - HYPH 3222 203 20 Sorting Sorting NNP 3222 203 21 Lossless Lossless NNP 3222 203 22 Data Data NNP 3222 203 23 Compression Compression NNP 3222 203 24 Algorithm Algorithm NNP 3222 203 25 , , , 3222 203 26 ” " '' 3222 203 27 Digital Digital NNP 3222 203 28 Equipment Equipment NNP 3222 203 29 Corporation Corporation NNP 3222 203 30 SRC SRC NNP 3222 203 31 Research Research NNP 3222 203 32 Report Report NNP 3222 203 33 124 124 CD 3222 203 34 , , , 3222 203 35 1994 1994 CD 3222 203 36 , , , 3222 203 37 www.hpl.hp.com/techreports/ www.hpl.hp.com/techreports/ NNP 3222 203 38 Compaq Compaq NNP 3222 203 39 - - HYPH 3222 203 40 DEC DEC NNP 3222 203 41 / / SYM 3222 203 42 SRC SRC NNP 3222 203 43 - - HYPH 3222 203 44 RR-124.pdf RR-124.pdf NNP 3222 203 45 ( ( -LRB- 3222 203 46 accessed access VBN 3222 203 47 May May NNP 3222 203 48 7 7 CD 3222 203 49 , , , 3222 203 50 2009 2009 CD 3222 203 51 ) ) -RRB- 3222 203 52 . . . 3222 204 1 9 9 CD 3222 204 2 . . . 3222 205 1 Witten Witten NNP 3222 205 2 , , , 3222 205 3 Moffat Moffat NNP 3222 205 4 , , , 3222 205 5 and and CC 3222 205 6 Bell Bell NNP 3222 205 7 , , , 3222 205 8 Managing manage VBG 3222 205 9 Gigabytes Gigabytes NNPS 3222 205 10 . . . 3222 206 1 10 10 CD 3222 206 2 . . . 3222 207 1 Jon Jon NNP 3222 207 2 Louis Louis NNP 3222 207 3 Bentley Bentley NNP 3222 207 4 et et NNP 3222 207 5 al al NNP 3222 207 6 . . NNP 3222 207 7 , , , 3222 207 8 “ " `` 3222 207 9 A a DT 3222 207 10 Locally Locally NNP 3222 207 11 Adaptive Adaptive NNP 3222 207 12 Data data NN 3222 207 13 Com- Com- NNP 3222 207 14 pression pression NN 3222 207 15 Scheme Scheme NNP 3222 207 16 , , , 3222 207 17 ” " '' 3222 207 18 Communications communication NNS 3222 207 19 of of IN 3222 207 20 the the DT 3222 207 21 ACM ACM NNP 3222 207 22 29 29 CD 3222 207 23 , , , 3222 207 24 no no UH 3222 207 25 . . . 3222 208 1 4 4 CD 3222 208 2 ( ( -LRB- 3222 208 3 1986 1986 CD 3222 208 4 ) ) -RRB- 3222 208 5 : : : 3222 208 6 320 320 CD 3222 208 7 ; ; : 3222 208 8 R. R. NNP 3222 208 9 Nigel Nigel NNP 3222 208 10 Horspool Horspool NNP 3222 208 11 and and CC 3222 208 12 Gordon Gordon NNP 3222 208 13 V. V. NNP 3222 208 14 Cormack Cormack NNP 3222 208 15 , , , 3222 208 16 “ " `` 3222 208 17 Constructing construct VBG 3222 208 18 Word Word NNP 3222 208 19 - - HYPH 3222 208 20 Based Based NNP 3222 208 21 Text Text NNP 3222 208 22 Compression Compression NNP 3222 208 23 Algorithms Algorithms NNP 3222 208 24 , , , 3222 208 25 ” " '' 3222 208 26 Proceedings proceeding NNS 3222 208 27 of of IN 3222 208 28 the the DT 3222 208 29 Data Data NNP 3222 208 30 Compression Compression NNP 3222 208 31 Conference Conference NNP 3222 208 32 ( ( -LRB- 3222 208 33 Snowbird Snowbird NNP 3222 208 34 , , , 3222 208 35 Utah Utah NNP 3222 208 36 , , , 3222 208 37 1992 1992 CD 3222 208 38 ) ) -RRB- 3222 208 39 : : : 3222 208 40 62 62 CD 3222 208 41 . . . 3222 209 1 11 11 CD 3222 209 2 . . . 3222 210 1 See see VB 3222 210 2 for for IN 3222 210 3 example example NN 3222 210 4 Andrei Andrei NNP 3222 210 5 V. V. NNP 3222 210 6 Kadach Kadach NNP 3222 210 7 , , , 3222 210 8 “ " `` 3222 210 9 Text Text NNP 3222 210 10 and and CC 3222 210 11 Hypertext Hypertext NNP 3222 210 12 Compression Compression NNP 3222 210 13 , , , 3222 210 14 ” " '' 3222 210 15 Programming Programming NNP 3222 210 16 & & CC 3222 210 17 Computer Computer NNP 3222 210 18 Software Software NNP 3222 210 19 23 23 CD 3222 210 20 , , , 3222 210 21 no no UH 3222 210 22 . . . 3222 211 1 4 4 CD 3222 211 2 ( ( -LRB- 3222 211 3 1997 1997 CD 3222 211 4 ) ) -RRB- 3222 211 5 : : : 3222 211 6 212 212 CD 3222 211 7 ; ; : 3222 211 8 Alistair Alistair NNP 3222 211 9 Moffat Moffat NNP 3222 211 10 , , , 3222 211 11 “ " `` 3222 211 12 Word word NN 3222 211 13 - - HYPH 3222 211 14 based base VBN 3222 211 15 text text NN 3222 211 16 compression compression NN 3222 211 17 , , , 3222 211 18 ” " '' 3222 211 19 Software software NN 3222 211 20 — — : 3222 211 21 Practice Practice NNP 3222 211 22 & & CC 3222 211 23 Experience Experience NNP 3222 211 24 2 2 CD 3222 211 25 , , , 3222 211 26 no no UH 3222 211 27 . . . 3222 212 1 19 19 CD 3222 212 2 ( ( -LRB- 3222 212 3 1989 1989 CD 3222 212 4 ) ) -RRB- 3222 212 5 : : : 3222 212 6 185 185 CD 3222 212 7 ; ; : 3222 212 8 Przemysław Przemysław NNP 3222 212 9 Skibiński Skibiński NNP 3222 212 10 , , , 3222 212 11 Szymon Szymon NNP 3222 212 12 Grabowski Grabowski NNP 3222 212 13 , , , 3222 212 14 and and CC 3222 212 15 Sebastian Sebastian NNP 3222 212 16 Deo- Deo- NNP 3222 212 17 rowicz rowicz NN 3222 212 18 , , , 3222 212 19 “ " `` 3222 212 20 Revisiting Revisiting NNP 3222 212 21 Dictionary Dictionary NNP 3222 212 22 - - HYPH 3222 212 23 Based base VBN 3222 212 24 Compression Compression NNP 3222 212 25 , , , 3222 212 26 ” " '' 3222 212 27 Software software NN 3222 212 28 — — : 3222 212 29 Practice Practice NNP 3222 212 30 & & CC 3222 212 31 Experience Experience NNP 3222 212 32 35 35 CD 3222 212 33 , , , 3222 212 34 no no UH 3222 212 35 . . . 3222 213 1 15 15 CD 3222 213 2 ( ( -LRB- 3222 213 3 2005 2005 CD 3222 213 4 ) ) -RRB- 3222 213 5 : : : 3222 213 6 1455 1455 CD 3222 213 7 . . . 3222 214 1 12 12 CD 3222 214 2 . . . 3222 215 1 Przemysław Przemysław NNP 3222 215 2 Skibiński Skibiński NNP 3222 215 3 , , , 3222 215 4 Jakub Jakub NNP 3222 215 5 Swacha Swacha NNP 3222 215 6 , , , 3222 215 7 and and CC 3222 215 8 Szymon Szymon NNP 3222 215 9 Grabowski Grabowski NNP 3222 215 10 , , , 3222 215 11 “ " `` 3222 215 12 A a DT 3222 215 13 Highly highly RB 3222 215 14 Efficient efficient JJ 3222 215 15 XML xml NN 3222 215 16 Compression Compression NNP 3222 215 17 Scheme Scheme NNP 3222 215 18 for for IN 3222 215 19 the the DT 3222 215 20 Web web NN 3222 215 21 , , , 3222 215 22 ” " '' 3222 215 23 Proceedings proceeding NNS 3222 215 24 of of IN 3222 215 25 the the DT 3222 215 26 34th 34th JJ 3222 215 27 International International NNP 3222 215 28 Conference Conference NNP 3222 215 29 on on IN 3222 215 30 Cur- Cur- NNP 3222 215 31 rent rent NN 3222 215 32 Trends trend NNS 3222 215 33 in in IN 3222 215 34 Theory Theory NNP 3222 215 35 and and CC 3222 215 36 Practice Practice NNP 3222 215 37 of of IN 3222 215 38 Computer Computer NNP 3222 215 39 Science Science NNP 3222 215 40 , , , 3222 215 41 LNCS LNCS NNP 3222 215 42 4910 4910 CD 3222 215 43 ( ( -LRB- 3222 215 44 2008 2008 CD 3222 215 45 ) ) -RRB- 3222 215 46 : : : 3222 215 47 766 766 CD 3222 215 48 . . . 3222 216 1 13 13 CD 3222 216 2 . . . 3222 217 1 Jon Jon NNP 3222 217 2 Louis Louis NNP 3222 217 3 Bentley Bentley NNP 3222 217 4 et et NNP 3222 217 5 al al NNP 3222 217 6 . . NNP 3222 217 7 , , , 3222 217 8 “ " `` 3222 217 9 A a DT 3222 217 10 Locally Locally NNP 3222 217 11 Adaptive Adaptive NNP 3222 217 12 Data data NN 3222 217 13 Com- Com- NNP 3222 217 14 pression pression NN 3222 217 15 Scheme Scheme NNP 3222 217 16 , , , 3222 217 17 ” " '' 3222 217 18 Communications communication NNS 3222 217 19 of of IN 3222 217 20 the the DT 3222 217 21 ACM ACM NNP 3222 217 22 29 29 CD 3222 217 23 , , , 3222 217 24 no no UH 3222 217 25 . . . 3222 218 1 4 4 CD 3222 218 2 ( ( -LRB- 3222 218 3 1986 1986 CD 3222 218 4 ) ) -RRB- 3222 218 5 : : : 3222 218 6 320 320 CD 3222 218 7 . . . 3222 219 1 14 14 CD 3222 219 2 . . . 3222 220 1 Skibiński Skibiński NNP 3222 220 2 , , , 3222 220 3 Grabowski Grabowski NNP 3222 220 4 , , , 3222 220 5 and and CC 3222 220 6 Deorowicz Deorowicz NNP 3222 220 7 , , , 3222 220 8 “ " `` 3222 220 9 Revisiting revisit VBG 3222 220 10 Dic- Dic- NNP 3222 220 11 tionary tionary NN 3222 220 12 - - HYPH 3222 220 13 Based base VBN 3222 220 14 Compression Compression NNP 3222 220 15 , , , 3222 220 16 ” " '' 3222 220 17 1455 1455 CD 3222 220 18 . . . 3222 221 1 15 15 CD 3222 221 2 . . . 3222 222 1 Skibiński Skibiński NNP 3222 222 2 , , , 3222 222 3 Swacha Swacha NNP 3222 222 4 , , , 3222 222 5 and and CC 3222 222 6 Grabowski Grabowski NNP 3222 222 7 , , , 3222 222 8 “ " `` 3222 222 9 A a DT 3222 222 10 Highly highly RB 3222 222 11 Efficient efficient JJ 3222 222 12 XML xml NN 3222 222 13 Compression Compression NNP 3222 222 14 Scheme Scheme NNP 3222 222 15 for for IN 3222 222 16 the the DT 3222 222 17 Web web NN 3222 222 18 , , , 3222 222 19 ” " '' 3222 222 20 766 766 CD 3222 222 21 . . . 3222 223 1 16 16 CD 3222 223 2 . . . 3222 224 1 Peter Peter NNP 3222 224 2 Deutsch Deutsch NNP 3222 224 3 , , , 3222 224 4 “ " `` 3222 224 5 DEFLATE DEFLATE NNP 3222 224 6 Compressed Compressed NNP 3222 224 7 Data Data NNP 3222 224 8 Format Format NNP 3222 224 9 Specification Specification NNP 3222 224 10 version version NN 3222 224 11 1.3 1.3 CD 3222 224 12 , , , 3222 224 13 ” " '' 3222 224 14 RFC1951 RFC1951 NNP 3222 224 15 , , , 3222 224 16 Network Network NNP 3222 224 17 Working Working NNP 3222 224 18 Group Group NNP 3222 224 19 , , , 3222 224 20 1996 1996 CD 3222 224 21 , , , 3222 224 22 www.ietf.org/rfc/rfc1951.txt www.ietf.org/rfc/rfc1951.txt NNP 3222 224 23 ( ( -LRB- 3222 224 24 accessed access VBN 3222 224 25 May May NNP 3222 224 26 7 7 CD 3222 224 27 , , , 3222 224 28 2009 2009 CD 3222 224 29 ) ) -RRB- 3222 224 30 . . . 3222 225 1 17 17 CD 3222 225 2 . . . 3222 226 1 Christian Christian NNP 3222 226 2 Schneider Schneider NNP 3222 226 3 , , , 3222 226 4 Precomp Precomp NNP 3222 226 5 — — : 3222 226 6 A a DT 3222 226 7 Command Command NNP 3222 226 8 Line Line NNP 3222 226 9 Precom- Precom- NNP 3222 226 10 pressor pressor NN 3222 226 11 , , , 3222 226 12 2009 2009 CD 3222 226 13 , , , 3222 226 14 http://schnaader.info/precomp.html http://schnaader.info/precomp.html NN 3222 226 15 ( ( -LRB- 3222 226 16 accessed access VBN 3222 226 17 May May NNP 3222 226 18 7 7 CD 3222 226 19 , , , 3222 226 20 2009 2009 CD 3222 226 21 ) ) -RRB- 3222 226 22 . . . 3222 227 1 18 18 CD 3222 227 2 . . . 3222 228 1 The the DT 3222 228 2 technical technical JJ 3222 228 3 details detail NNS 3222 228 4 of of IN 3222 228 5 the the DT 3222 228 6 algorithm algorithm NNP 3222 228 7 constructing construct VBG 3222 228 8 code code NN 3222 228 9 words word NNS 3222 228 10 and and CC 3222 228 11 assigning assign VBG 3222 228 12 them -PRON- PRP 3222 228 13 to to IN 3222 228 14 indexes index NNS 3222 228 15 , , , 3222 228 16 and and CC 3222 228 17 encoding encode VBG 3222 228 18 num- num- JJ 3222 228 19 bers ber NNS 3222 228 20 and and CC 3222 228 21 special special JJ 3222 228 22 tokens token NNS 3222 228 23 , , , 3222 228 24 are be VBP 3222 228 25 given give VBN 3222 228 26 in in IN 3222 228 27 Skibiński Skibiński NNP 3222 228 28 , , , 3222 228 29 Swacha Swacha NNP 3222 228 30 , , , 3222 228 31 and and CC 3222 228 32 Grabowski Grabowski NNP 3222 228 33 , , , 3222 228 34 “ " `` 3222 228 35 A a DT 3222 228 36 Highly highly RB 3222 228 37 Efficient efficient JJ 3222 228 38 XML xml NN 3222 228 39 Compression Compression NNP 3222 228 40 Scheme Scheme NNP 3222 228 41 for for IN 3222 228 42 the the DT 3222 228 43 Web web NN 3222 228 44 , , , 3222 228 45 ” " '' 3222 228 46 766 766 CD 3222 228 47 . . . 3222 229 1 19 19 CD 3222 229 2 . . . 3222 230 1 David David NNP 3222 230 2 Solomon Solomon NNP 3222 230 3 , , , 3222 230 4 Data Data NNP 3222 230 5 Compression Compression NNP 3222 230 6 : : : 3222 230 7 The the DT 3222 230 8 Complete Complete NNP 3222 230 9 Reference Reference NNP 3222 230 10 , , , 3222 230 11 4th 4th JJ 3222 230 12 ed ed NN 3222 230 13 . . . 3222 231 1 ( ( -LRB- 3222 231 2 London London NNP 3222 231 3 : : : 3222 231 4 Springer Springer NNP 3222 231 5 - - HYPH 3222 231 6 Verlag Verlag NNP 3222 231 7 , , , 3222 231 8 2006 2006 CD 3222 231 9 ) ) -RRB- 3222 231 10 . . . 3222 232 1 20 20 CD 3222 232 2 . . . 3222 233 1 Skibiński Skibiński NNP 3222 233 2 , , , 3222 233 3 Swacha Swacha NNP 3222 233 4 , , , 3222 233 5 and and CC 3222 233 6 Grabowski Grabowski NNP 3222 233 7 , , , 3222 233 8 “ " `` 3222 233 9 A a DT 3222 233 10 Highly highly RB 3222 233 11 Efficient efficient JJ 3222 233 12 XML xml NN 3222 233 13 Compression Compression NNP 3222 233 14 Scheme Scheme NNP 3222 233 15 for for IN 3222 233 16 the the DT 3222 233 17 Web web NN 3222 233 18 , , , 3222 233 19 ” " '' 3222 233 20 766 766 CD 3222 233 21 . . . 3222 234 1 21 21 CD 3222 234 2 . . . 3222 235 1 Dave Dave NNP 3222 235 2 Raggett Raggett NNP 3222 235 3 , , , 3222 235 4 Arnaud Arnaud NNP 3222 235 5 Le Le NNP 3222 235 6 Hors Hors NNP 3222 235 7 , , , 3222 235 8 and and CC 3222 235 9 Ian Ian NNP 3222 235 10 Jacobs Jacobs NNP 3222 235 11 , , , 3222 235 12 eds eds NNP 3222 235 13 . . NNP 3222 235 14 , , , 3222 235 15 W3C W3C NNP 3222 235 16 HTML HTML NNP 3222 235 17 4.01 4.01 CD 3222 235 18 Specification Specification NNP 3222 235 19 , , , 3222 235 20 1999 1999 CD 3222 235 21 , , , 3222 235 22 http://www.w3.org/TR/REC http://www.w3.org/TR/REC -LRB- 3222 235 23 -html40/ -html40/ NFP 3222 235 24 ( ( -LRB- 3222 235 25 accessed access VBN 3222 235 26 May May NNP 3222 235 27 7 7 CD 3222 235 28 , , , 3222 235 29 2009 2009 CD 3222 235 30 ) ) -RRB- 3222 235 31 . . . 3222 236 1 22 22 CD 3222 236 2 . . . 3222 237 1 Ian Ian NNP 3222 237 2 H. H. NNP 3222 237 3 Witten Witten NNP 3222 237 4 , , , 3222 237 5 David David NNP 3222 237 6 Bainbridge Bainbridge NNP 3222 237 7 , , , 3222 237 8 and and CC 3222 237 9 Stefan Stefan NNP 3222 237 10 Boddie Boddie NNP 3222 237 11 , , , 3222 237 12 “ " `` 3222 237 13 Greenstone greenstone NN 3222 237 14 : : : 3222 237 15 Open open VB 3222 237 16 Source source NN 3222 237 17 DL DL NNP 3222 237 18 Software Software NNP 3222 237 19 , , , 3222 237 20 ” " '' 3222 237 21 Communications communication NNS 3222 237 22 of of IN 3222 237 23 the the DT 3222 237 24 ACM ACM NNP 3222 237 25 44 44 CD 3222 237 26 , , , 3222 237 27 no no UH 3222 237 28 . . . 3222 238 1 5 5 CD 3222 238 2 ( ( -LRB- 3222 238 3 2001 2001 CD 3222 238 4 ) ) -RRB- 3222 238 5 : : : 3222 238 6 47 47 CD 3222 238 7 . . . 3222 239 1 23 23 CD 3222 239 2 . . . 3222 240 1 Project Project NNP 3222 240 2 Gutenberg Gutenberg NNP 3222 240 3 , , , 3222 240 4 2008 2008 CD 3222 240 5 , , , 3222 240 6 http://www.gutenberg.org/ http://www.gutenberg.org/ NN 3222 240 7 ( ( -LRB- 3222 240 8 accessed access VBN 3222 240 9 May May NNP 3222 240 10 7 7 CD 3222 240 11 , , , 3222 240 12 2009 2009 CD 3222 240 13 ) ) -RRB- 3222 240 14 . . . 3222 241 1 24 24 CD 3222 241 2 . . . 3222 242 1 Przemysław Przemysław NNP 3222 242 2 Skibiński Skibiński NNP 3222 242 3 and and CC 3222 242 4 Szymon Szymon NNP 3222 242 5 Grabowski Grabowski NNP 3222 242 6 , , , 3222 242 7 “ " `` 3222 242 8 Variable- Variable- NNP 3222 242 9 Length Length NNP 3222 242 10 Contexts Contexts NNP 3222 242 11 for for IN 3222 242 12 PPM PPM NNP 3222 242 13 , , , 3222 242 14 ” " '' 3222 242 15 Proceedings proceeding NNS 3222 242 16 of of IN 3222 242 17 the the DT 3222 242 18 IEEE IEEE NNP 3222 242 19 Data Data NNP 3222 242 20 Compres- Compres- NNP 3222 242 21 sion sion NN 3222 242 22 Conference Conference NNP 3222 242 23 ( ( -LRB- 3222 242 24 Snowbird Snowbird NNP 3222 242 25 , , , 3222 242 26 Utah Utah NNP 3222 242 27 , , , 3222 242 28 2004 2004 CD 3222 242 29 ) ) -RRB- 3222 242 30 : : : 3222 242 31 409 409 CD 3222 242 32 . . . 3222 243 1 ALCTS ALCTS NNP 3222 243 2 cover cover VBP 3222 243 3 2 2 CD 3222 243 4 LITA LITA NNP 3222 243 5 cover cover NN 3222 243 6 3 3 CD 3222 243 7 , , , 3222 243 8 cover cover VB 3222 243 9 4 4 CD 3222 243 10 Index Index NNP 3222 243 11 to to IN 3222 243 12 Advertisers advertiser NNS