Abstract:
A data warehouse is a collection of a large amount of data gathered from
numerous sources in one organization. Those data must be efficiently organized so
that they can be analyzed and the summarized results can be used to support a
particular decision. Thus, querying data in a data warehouse must be fast.
Implementing indexing on those data can help improve the query response time
significantly. One indexing technique commonly used is bitmap indexing since it has
the advantage of dealing with replicated data in a data warehouse where the cardinality
of the data is often more than one. Another common scenario found in a data
warehouse is that most queries involve a time domain such as a monthly or yearly
report of sales volume for a particular product. In addition, time domain used is finite
and has little different granularity ranging from just days to years. Hence, bitmap
indices for time seem to be a good fit.
To improve both time and space requirements for bitmap indices, there are
several bitmap indexing techniques designed and implemented. However, due to the
vast amount of data and numerous data cardinalities, most bitmap techniques aim to
reduce the amount of space required to store bitmap indices. Therefore, most of them
propose to encode or compress bitmap indices.
This thesis proposes a model to evaluate the performance of five common bitmap
indexing techniques. They are simple bitmap (SB), simple bitmap with range encoding
(SBR), simple bitmap with interval encoding (SBI), bit-sliced indexing (BI) and
encoded bitmap indexing (EB). TPC-H benchmark data and five selected TPC-H
queries are used to evaluate the model. In addition, the query response time of both
exact and range queries under various time granularities are measured for all five
bitmap indexing methods. The experimental results show that the simple bitmap gives
the best response time for exact queries, and the simple bitmap with interval encoding
gives the best response time for range queries. Even though the simple bitmap
techniques require more space and generate more bitmap vectors, their performance
gain should considerably overcome the cost of storage space, which is now getting
lower every day.