incbwt/README

   1 General Information
   2 ===================
   3
   4 This is an implementation of the Run-Length Compressed Suffix Array (RLCSA) [1,2] and its incremental construction [2]. The implementation includes experimental support for LCP information [3].
   5
   6 Copyright 2007 - 2010, Jouni Siren, except when otherwise noted. See LICENSE for further information.
   7
   8 This package contains an implementation of the algorithm presented in "Faster Suffix Sorting" by N. Jesper Larsson (jesper@cs.lth.se) and Kunihiko Sadakane (sada@is.s.u-tokyo.ac.jp). Copyright 1999, N. Jesper Larsson.
   9
  10
  11 Compiling
  12 ---------
  13
  14 The code should compile in both 32-bit and 64-bit environments. Uncomment SIZE_FLAGS in the makefile to use 64-bit integers in the 64-bit version.
  15
  16 Parallelism is supported by libstdc++ Parallel Mode and by MCSTL. Uncomment either version of PARALLEL_FLAGS to compile the parallel version of the library, and set MCSTL_ROOT if necessary. GCC 4.2 or newer is required for the MCSTL version.
  17
  18 Uncomment VECTOR_FLAGS to use a faster encoding for the run-length encoded bit vectors in .rlcsa.array. This increases the size somewhat.
  19
  20 32-bit integers limit the size of the collection to less than 4 gigabytes. The size of individual input files is limited to less than 2 gigabytes in both 32-bit and 64-bit versions.
  21
  22 Note that if 32-bit integers are used, then the bit-aligned arrays are limited to less than 512 megabytes (2^32 bits) in size. Hence if n is the collection size in characters and d is the sample rate, then (n / d) log ceil(n / d) must be less than 2^32. Otherwise the suffix array samples cannot be stored.
  23
  24
  25 Index Construction
  26 ------------------
  27
  28 The naming conventions for files are:
  29
  30   base_name - the sequences
  31   base_name.rlcsa.array - most of the CSA
  32   base_name.rlcsa.sa_samples - suffix array samples for locate and display
  33   base_name.rlcsa.parameters - some of the index parameters
  34   base_name.lcp_samples - sampled LCP array
  35   base_name.plcp - run-length encoded PLCP array
  36
  37 A typical parameter file looks like:
  38
  39   RLCSA_BLOCK_SIZE = 32
  40   SAMPLE_RATE = 64
  41   SUPPORT_DISPLAY = 1
  42   SUPPORT_LOCATE = 1
  43
  44 parallel_build is used to construct the index, as described in [2]. The program takes 2 to 4 parameters:
  45
  46   use '-n' as the first parameter to stop before merging the partial indexes
  47   a list of input files (a text file, one file name per line)
  48   base name of the output
  49   number of threads to use (optional)
  50
  51 The default parameters are 32 bytes for block size and 512 for sample rate. To modify these, one should create the parameter file before running the construction program. Each input file should be a concatenation of C-style '\0'-terminated strings. The files must be smaller than 2 GB each.
  52
  53 build_rlcsa indexes a text file using a single thread. The program takes 2-4 parameters:
  54
  55   name of the input file
  56   buffer size (in megabytes)
  57   block size (optional)
  58   sample rate (optional)
  59
  60 Every line is inserted as a separate sequence. The lines must not contain character '\0'. The buffer size must be less than 2 gigabytes.
  61
  62
  63 Operations
  64 ----------
  65
  66 A number of operations have been implemented. The most important ones are the following:
  67
  68   pair_type count(const std::string& pattern)
  69   Returns the suffix array range corresponding to the matches of the pattern. The range is reported as a closed interval.
  70
  71   usint* locate(pair_type range)
  72   usint* locate(pair_type range, usint* data)
  73   usint locate(usint index)
  74   These return the suffix array values at given range/position. The user is responsible for the allocated data.
  75
  76   uchar* display(usint sequence)
  77   uchar* display(usint sequence, pair_type range)
  78   uchar* display(usint sequence, pair_type range, uchar* data)
  79   These return a substring of the given sequence, as determined by the closed interval 'range'. The user is responsible for the allocated data.
  80
  81   uchar* display(usint position, usint len, usint context)
  82   This is intended for displaying an occurrence of a pattern of length 'len' at position 'position' with 'context' extra characters on both sides.
  83
  84   uchar* readBWT()
  85   uchar* readBWT(pair_type range)
  86   Returns the BWT of the collection or a part of it. The user is responsible for the allocated string. Note that unlike the suffix array, the BWT includes all end markers.
  87
  88   pair_type getSequenceRange(usint number)
  89   T^{number} -> [sp, ep], where T^{number} === T[sp, ep]
  90
  91   pair_type getSequenceRangeForPosition(usint value)
  92   [sp, ep], where sp <= value <= ep
  93
  94   usint getSequenceForPosition(usint value)
  95   i, where T^{i} === T[sp, ep] and sp <= value <= ep
  96
  97   usint* getSequenceForPosition(usint* value, usint length)
  98   As above, but for multiple positions at once.
  99
 100   pair_type getRelativePosition(usint value)
 101   (i, offset), where T^{i}[offset] === T[value]
 102
 103 Locate and display can only be used when the corresponding parameter (SUPPORT_LOCATE or SUPPORT_DISPLAY) has value 1 and the suffix array samples have been created during the construction. If both of the parameters are missing or have value 0, the suffix array samples will not be loaded into memory.
 104
 105 These operations are const and hence thread-safe.
 106
 107
 108 Construction Interface
 109 ----------------------
 110
 111 Class RLCSABuilder provides other possibilities for index construction. The constructor takes four parameters:
 112
 113   block size for the Psi vectors in bytes
 114   suffix array sample rate
 115   buffer size for construction in bytes
 116   number of threads to use
 117
 118 If suffix array sample rate is set to 0, the samples will not be created. The buffer size must be less than 2 gigabytes.
 119
 120 Function insertSequence is called to insert a new sequence into the collection. The parameters are:
 121
 122   sequence as a char array
 123   length of the sequence (not including the trailing 0, if present)
 124   should we free the memory used by the sequence
 125
 126 Function insertFromFile can be used to merge existing indexes into the collection. It takes the base name of the index as a parameter. The sequences and the index should both be available.
 127
 128 Function getRLCSA is used to finish the construction and get the final index. After the call, the builder no longer contains the index. The caller is responsible for freeing the index.
 129
 130 For example, the following inserts the sequences into the collection one at a time:
 131
 132   // Use a buffer of n megabytes.
 133   RLCSABuilder builder(block_size, sample_rate, n * MEGABYTE, threads);
 134
 135   // For each sequence:
 136   builder.insertSequence(sequence, length, false);
 137
 138   // If succesful, write the index to disk.
 139   if(builder.isOk())
 140   {
 141     RLCSA* rlcsa = builder.getRLCSA();
 142     rlcsa->writeTo(base_name);
 143     delete rlcsa;
 144   }
 145
 146
 147 Incremental construction for multiple sequences
 148 -----------------------------------------------
 149
 150 When there are multiple sequences in one increment, character '\0' is assumed to represent the end of sequence marker. Hence the sequences themselves cannot contain character '\0'. This is always the case when using RLCSABuilder to build the partial indexes.
 151
 152 If there is just one sequence in the increment, character '\0' is considered a normal character. This requires setting multiple_sequences = false in the RLCSA constructor. Note that RLCSABuilder cannot be used to merge these indexes, as it assumes character '\0' an end of sequence marker.
 153
 154
 155 LCP Support
 156 -----------
 157
 158 The implementation includes experimental support for two representations of the LCP array: run-length encoded PLCP array and the sampled LCP array. sample_lcp and build_plcp can be used to build the representations. lcp_test was used in the experiments reported in [3].
 159
 160
 161 Other Programs
 162 --------------
 163
 164 extract_sequence can be used to extract individual sequences from the index.
 165
 166 rlcsa_grep is an example program offering limited grep-like functionality on indexes built with build_rlcsa.
 167
 168 rlcsa_test and display_test were used in the experiments reported in [2]. Some small modifications have been required to use the current interface.
 169
 170 The rest of the programs have not been used recently. They might no longer work correctly.
 171
 172
 173 Technical Information
 174 =====================
 175
 176 A collection of sequences
 177 -------------------------
 178
 179 The index contains a collection C of sequences T1, T2, ..., Tr. When constructing the index, each Ti is assumed to be followed by an implicit end of sequence marker $. The markers are assumed to be less than any character in the alphabet, and their mutual order is defined by the sequences they are following.
 180
 181 The end of sequence markers are not included in the sequences or their suffix arrays. They are implicitly included in the values of Psi array as the first r values. If Psi(i) = j, then the pointer to the suffix starting at position SA[i] + 1 is stored at SA[Psi(i) - r]. These values are not stored in the index, however, making the Psi array effectively a collection of cycles instead of a single cycle.
 182
 183 We use a bit vector E to mark the ending positions of each sequence. The indexed sequence is implicitly padded with empty characters so that each sequence starts at a position divisible by sample rate d. The starting position of sequence k > 0 is d * ((E.select(k - 1) / d) + 1).
 184
 185
 186 Suffix array samples
 187 --------------------
 188
 189 For a given sample rate d, we store the positions of each sequence divisible by d. The sampled positions of suffix array are marked in a bit vector S, while the multipliers of d are stored in an array A in the same order. Another array B contains the inverse permutation of A.
 190
 191 When locating, we can use S.valueAfter(i - r) to get (j, k = S.rank(j) - 1) for the first sampled j >= i - r in the suffix array order. If j == i - r, we can get the suffix array value as d * A[k]. If i < r, we have reached the implicit end of sequence marker. In this case, the suffix array value is E.select(i) + 1 (this value should only be used to derive SA values for earlier positions). Note that i is a value of Psi, not an index of the suffix array.
 192
 193 When displaying text starting from position i, j = B[i / d] gives us the sample used as a starting point. The sample is located in position k = S.select(j) of suffix array corresponding to position k + r of Psi.
 194
 195
 196 Data formats
 197 ------------
 198
 199 .rlcsa.array
 200   distribution of characters (CHARS * sizeof(usint) bytes)
 201   RLEVector for each character appearing in the text
 202   DeltaVector E
 203   sample rate d (sizeof(usint) bytes)
 204
 205 .rlcsa.sa_samples
 206   DeltaVector S
 207   array A (number_of_samples items of length(number_of_samples - 1) bits)
 208
 209 Any bit vector
 210   universe size (sizeof(usint) bytes)
 211   item count (sizeof(usint) bytes)
 212   number of blocks (sizeof(usint) bytes)
 213   block size in words (sizeof(usint) bytes)
 214   block data (number_of_blocks * block_size words)
 215   sample array (2 * (number_of_blocks + 1) items of length(size) bits)
 216
 217
 218 LCP Information
 219 ---------------
 220
 221 The (P)LCP representations have been genralized to support multiple sequences. As the end markers are not included in the collection, the LCP values corresponding to the last characters of the sequences can be 1 or 0. The padding characters between the sequences are also assigned LCP values in the PLCP representation to ease its use. The sampled LCP array is used in a similar way as the SA samples in locate.
 222
 223 Data formats:
 224
 225 .lcp_samples
 226   DeltaVector for the sampled positions
 227   Array for the sampled values
 228
 229 .plcp
 230   DeltaVector
 231
 232 Array
 233   item count (sizeof(usint bytes))
 234   number of blocks (sizeof(usint) bytes)
 235   block size in words (sizeof(usint) bytes)
 236   block data (number_of_blocks * block_size words)
 237   file.write((char*)&(this->number_of_blocks), sizeof(this->number_of_blocks));
 238   file.write((char*)&(this->block_size), sizeof(this->block_size));
 239   file.write((char*)(this->array), this->block_size * this->number_of_blocks * sizeof(usint));
 240   sample array (number_of_blocks + 1 items of length(items) bits)
 241
 242
 243 References
 244 ==========
 245
 246 [1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
 247 Journal of Computational Biology 17(3):281-308, 2010.
 248
 249 [2] Jouni Sirén: Compressed Suffix Arrays for Massive Data.
 250 In SPIRE 2009, Springer LNCS 5721, pp. 63-74, Saariselkä, Finland, August 25-27, 2009.
 251
 252 [3] Jouni Sirén: Sampled Longest Common Prefix Array.
 253 In CPM 2010, Springer LNCS 6129, pp. 227-237, New York, USA, June 21-23, 2010.