After attempting to tackle a rather formidable treatise on information retrieval (Baeza-Yates & Ribeiro, 1999), I’ve run into a few roadblocks. I need some definitions:
int mask = 5; /* binary 101 */
if (x & mask) ...
A bit mask might also be used to set certain bits using bitwise OR, or to invert them using bitwise exclusive OR.
Source: http://dict.die.net/bit%20mask/ (attributed to: The Free On-line Dictionary of Computing (09 FEB 02))
Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.
Hashes play a role in security systems where they're used to ensure that transmitted messages have not been tampered with. The sender generates a hash of the message, encrypts it, and sends it with the message itself. The recipient then decrypts both the message and the hash, produces another hash from the received message, and compares the two hashes. If they're the same, there is a very high probability that the message was transmitted intact.
Hashing is also a common method of accessing data records. Consider, for example, a list of names:
· John Smith
· Sarah Jones
· Roger Adams
To create an index, called a hash table, for these records, you would apply a formula to each name to produce a unique numeric value. So you might get something like:
· 1345873 John smith
· 3097905 Sarah Jones
· 4060964 Roger Adams
Then to search for the record containing Sarah Jones, you just need to reapply the formula, which directly yields the index key to the record. This is much more efficient than searching through all the records till the matching record is found.
For text files, the second important thing is the number of distinct words or vocabulary of each document. We use the Heaps’ Law. This is a very precise law ruling the growth of the vocabulary in natural language texts... Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.
Definition: A compact representation of a trie where all nodes with one child are merged with their parents.
See also suffix tree, compact DAWG.
Note: A compact directed acyclic word graph (DAWG) merges common suffix trees to save additional space.
Definition: The end characters of a string. More formally a string v is a suffix of a string u if u=u'v for some string u'.
Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.
See also multi suffix tree, Patricia tree, suffix array, directed acyclic word graph.
Note: A suffix tree is a Patricia tree corresponding to the suffixes of a given string. A directed acyclic word graph (DAWG) is a more compact form.
Definition: A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.
Note: The name comes from retrieval and is pronounced, "tree."
Baeza-Yates, R., & Ribeiro, B. d. A. N. (1999). Modern information retrieval. New York: ACM Press.