Data streams, unbounded sets of continuous data, pose unique challenges towards issues of indexing and membership queries. Queries over data streams of the nature, flagging stock-value fluctuations in stock update streams, detecting news headlines of interest in RSS feeds, object lookup over in a cache summary etc., need to consider large or unbounded data sets. Bloom filter, a probabilistic data structures with finite memory requirements, is a useful data structure to index and lookup stream of data items. The basic bloom filter mechanism has been extended to adapt to the continuous indexing and querying requiring of unbounded data.
While all extensions consider the unbounded constraint, we argue that another characteristic, data-importance, is an important factor towards indexing and querying. The notion of data importance is related to application- specific priority of different items of a data set—some stocks are more important that others, cost of fetching objects from a cache are directly related to their size etc. As part of this work, we make a case for the need to develop solutions that are importance-aware and towards this we propose Importance -aware Bloom Filter (IBF). As part of IBF, we provide a set insertion and deletion heuristics to make the bloom filter importance-aware.
Our comparison of IBF with other bloom filter-based mechanisms shows that IBF performs very well—it has low false positives, and false negatives that occur due to deletions decrease with data-importance. Our precision and recall measurements that reflect the metric in terms of data-importance are also encouraging, and we believe that IBF provides a practical framework to balance the application-specific requirements. There exist numerous challenges in storing and querying a data tream. Our work focuses on indexing an unbounded stream of data items to answer set membership queries. Since data streams are continuous and unbounded, it is impractical to store the entire data during query processing, especially so in finite-memory systems.
In such contexts, it is difficult to provide precise answers to queries. To limit the amount data to be stored, or to be processed per data item, time-based sliding window  techniques have been used for continuous queries (CQ) . A continuous query executes over successive instances of the sliding time-window. In our work, we consider a landmark window, where one end of the time-window is fixed, for example start-of-day, and the other end of the window is unspecified. With this setting, the system would index all or partial data items and answer membership queries. Another important observation that motivates our work is the notion of importance of data items (or tuples). The following examples introduce data-importance and motivates the need to be importance-aware during indexing, querying and measuring performance of query accuracy.
Scenario 1: Consider a portfolio of stocks and an importance related to each. Possible importance functions can be based on individual or combinations of stock values, number of stocks, past and predicted fluctuations etc. Over the period of a day, the value of each stock fluctuates. An user is interested in knowing which stocks changed by more than 10% of its opening value. A less accurate answer regarding less important stocks is tolerable, whereas an inaccurate answer on more important stocks can result in sub-optimal actions to maximize portfolio value.
Scenario 2: A webcrawler starts crawling from a particular web page, collects links on that page, and then crawls these links recursively, similar to a breadth-first-traversal of a graph. It is obvious that a some links are popular and will appear on multiple web pages, e.g., a BBC news article may appear on international news listing of several web-sites. Crawlers need to accurately identify such popular links and avoid crawling them multiple times. An importance function in such cases can be correlated with the popularity of a web page or a web-site. A crawlers expectation of accuracy regarding the question of whether a page has been previously crawled can be related to its importance— popular pages require more accurate answers.
Scenario 3: Consider a data structure associated with a cache that can be looked-up to determine if an object is present in the cache. Objects stored in the cache are of different sizes and the cost to fetch them from a remote server is a function of their size. Assuming the look-up service provides a probabilistic answer, a requirement that aims to minimize cost would be less tolerable to false negatives for larger objects as compared to objects with smaller sizes (less important objects). These examples motivate us to look at schemes that exploit data importance during indexing