Basic IR Definitions
After attempting to tackle a rather formidable treatise on information retrieval (Baeza-Yates & Ribeiro, 1999), I’ve run into a few roadblocks. I need some definitions:
Bit Mask
A pattern of binary values which is combined with some value using bitwise AND with the result that bits in the value in positions where the mask is zero are also set to zero. For example, if, in C, we want to test if bits 0 or 2 of x are set, we can write
int mask = 5; /* binary 101 */
if (x & mask) ...
A bit mask might also be used to set certain bits using bitwise OR, or to invert them using bitwise exclusive OR.
Source: http://dict.die.net/bit%20mask/ (attributed to: The Free On-line Dictionary of Computing (09 FEB 02))
Hashing
Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.
Hashes play a role in security systems where they're used to ensure that transmitted messages have not been tampered with. The sender generates a hash of the message, encrypts it, and sends it with the message itself. The recipient then decrypts both the message and the hash, produces another hash from the received message, and compares the two hashes. If they're the same, there is a very high probability that the message was transmitted intact.
Hashing is also a common method of accessing data records. Consider, for example, a list of names:
· John Smith
· Sarah Jones
· Roger Adams
To create an index, called a hash table, for these records, you would apply a formula to each name to produce a unique numeric value. So you might get something like:
· 1345873 John smith
· 3097905 Sarah Jones
· 4060964 Roger Adams
Then to search for the record containing Sarah Jones, you just need to reapply the formula, which directly yields the index key to the record. This is much more efficient than searching through all the records till the matching record is found.
Source: http://www.webopedia.com/TERM/h/hashing.html
Heap’s Law
For text files, the second important thing is the number of distinct words or vocabulary of each document. We use the Heaps’ Law. This is a very precise law ruling the growth of the vocabulary in natural language texts... Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.
Source: http://www.pnclink.org/annual/annual1998/1998pdf/yates.pdf
Patricia Tree
Definition: A compact representation of a trie where all nodes with one child are merged with their parents.
See also suffix tree, compact DAWG.
Note: A compact directed acyclic word graph (DAWG) merges common suffix trees to save additional space.
Source: http://www.nist.gov/dads/HTML/patriciatree.html
Suffix
Definition: The end characters of a string. More formally a string v is a suffix of a string u if u=u'v for some string u'.
Source: http://www.nist.gov/dads/HTML/suffix.html
Suffix Tree
Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.
See also multi suffix tree, Patricia tree, suffix array, directed acyclic word graph.
Note: A suffix tree is a Patricia tree corresponding to the suffixes of a given string. A directed acyclic word graph (DAWG) is a more compact form.
Source: http://www.nist.gov/dads/HTML/suffixtree.html
Trie
Definition: A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.
Note: The name comes from retrieval and is pronounced, "tree."
Source: http://www.nist.gov/dads/HTML/trie.html
References
Baeza-Yates, R., & Ribeiro, B. d. A. N. (1999). Modern information retrieval. New York: ACM Press.
After attempting to tackle a rather formidable treatise on information retrieval (Baeza-Yates & Ribeiro, 1999), I’ve run into a few roadblocks. I need some definitions:
Bit Mask
int mask = 5; /* binary 101 */
if (x & mask) ...
A bit mask might also be used to set certain bits using bitwise OR, or to invert them using bitwise exclusive OR.
Source: http://dict.die.net/bit%20mask/ (attributed to: The Free On-line Dictionary of Computing (09 FEB 02))
Hashing
Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.
Hashes play a role in security systems where they're used to ensure that transmitted messages have not been tampered with. The sender generates a hash of the message, encrypts it, and sends it with the message itself. The recipient then decrypts both the message and the hash, produces another hash from the received message, and compares the two hashes. If they're the same, there is a very high probability that the message was transmitted intact.
Hashing is also a common method of accessing data records. Consider, for example, a list of names:
· John Smith
· Sarah Jones
· Roger Adams
To create an index, called a hash table, for these records, you would apply a formula to each name to produce a unique numeric value. So you might get something like:
· 1345873 John smith
· 3097905 Sarah Jones
· 4060964 Roger Adams
Then to search for the record containing Sarah Jones, you just need to reapply the formula, which directly yields the index key to the record. This is much more efficient than searching through all the records till the matching record is found.
Source: http://www.webopedia.com/TERM/h/hashing.html
Heap’s Law
For text files, the second important thing is the number of distinct words or vocabulary of each document. We use the Heaps’ Law. This is a very precise law ruling the growth of the vocabulary in natural language texts... Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root.
Source: http://www.pnclink.org/annual/annual1998/1998pdf/yates.pdf
Patricia Tree
Definition: A compact representation of a trie where all nodes with one child are merged with their parents.
See also suffix tree, compact DAWG.
Note: A compact directed acyclic word graph (DAWG) merges common suffix trees to save additional space.
Source: http://www.nist.gov/dads/HTML/patriciatree.html
Suffix
Definition: The end characters of a string. More formally a string v is a suffix of a string u if u=u'v for some string u'.
Source: http://www.nist.gov/dads/HTML/suffix.html
Suffix Tree
Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.
See also multi suffix tree, Patricia tree, suffix array, directed acyclic word graph.
Note: A suffix tree is a Patricia tree corresponding to the suffixes of a given string. A directed acyclic word graph (DAWG) is a more compact form.
Source: http://www.nist.gov/dads/HTML/suffixtree.html
Trie
Definition: A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.
Note: The name comes from retrieval and is pronounced, "tree."
Source: http://www.nist.gov/dads/HTML/trie.html
References
Baeza-Yates, R., & Ribeiro, B. d. A. N. (1999). Modern information retrieval. New York: ACM Press.
11 Comments:
XUnzAb The best blog you have!
rbg5Yi Good job!
Wonderful blog.
Please write anything else!
actually, that's brilliant. Thank you. I'm going to pass that on to a couple of people.
Good job!
Good job!
Good job!
Nice Article.
Thanks to author.
2XNx37 write more, thanks.
Post a Comment
Subscribe to Post Comments [Atom]
<< Home