-
Notifications
You must be signed in to change notification settings - Fork 0
Dictionaries
Dictionaries manage mappings between sets of strings (usually words or n-grams) and integers. A basic dictionary can be created from a matrix of strings (a CSMat) or an SBMat.
> val a = csrow("you", "me", "them") > val d = Dict(a) > d(1) me > d("them") 2 > d(irow(0,2)) you,them
Its often useful to maintain counts of terms (e.g. for tfidf weighting) so dictionaries accept a second argument which is a set of counts.
> val a = csrow("you", "me", "them"); val c = irow(1,2,3) > val d = Dict(a,c) > d.count(0) 1.0 > d.count("them") 3.0
Counts are implemented with doubles, rather than ints or floats. This gives plenty of precision for large dictionaries, and also the flexibility to use non-integer counts.
BIDMat like many other systems encodes strings as integer references into dictionaries. There is no universal dictionary, and dictionaries are typically created by merging local dictionaries built from subsets of a corpus. When a dictionary is changed, previous references to that dictionary need to be updated. The dictionary class (Dict) includes an operator -->
which does computes a mapping from one dictionary into another. Lets suppose d1
and d2
are two dictionaries, and i1
and i2
are IMats of references into them. We can first create a merged dictionary d
and then compute mappings and updated references like this:
> val d = Dict.union(d1,d2) > val map1 = d1 --> d > val map2 = d2 --> d
now we can use the maps to update the indices. Suppose:
> val d1 = Dict(csrow("it","was","the","best","of","times")) > val i1 = irow(0,1,2,3,4,5) > val d2 = Dict(csrow("it","was","the","worst","of","times")) > val i2 = irow(0,1,2,3,4,5) > d1(i1) it,was,the,best,of,times > d2(i2) it,was,the,worst,of,times > val d = Dict.union(d1,d2) > d.cstr.t it,was,the,best,of,times,worst > val dm1 = d1 --> d ... > val dm2 = d2 --> d ... > val i1x = dm1(i1) 0,1,2,3,4,5 > val i2x = dm2(i2) 0,1,2,6,4,5 > d(i1x \ i2x) it,was,the,best,of,times,it,was,the,worst,of,times
Since its common to need the maps when merging dictionaries, a single function is provided to do this:
> val (d, dm1, dm2) = Dict.union3(d1,d2)
The dictionary constructor does not check by default if the input strings are unique or not. So its possible to create dictionaries with repeated entries. You can use the flatten method to create a valid dictionary from a redundant one:
> val dd = Dict(csrow("the","time","has","come","the","walrus","said")) > dd.cstr the,time,has,come,the,walrus,said > val d = dd.flatten > d.cstr.t the,time,has,come,walrus,said
IDicts extend the functionality of string dictionaries to n-grams. An IDict is built from a IMat argument whose rows represent n-grams. A set of bigrams would be represented as an IMat with two columns, trigrams with a 3-column IMat, and unigrams with a single-column IMat. As with Dicts, the IDict constructor does not look for repeated rows. To construct a valid IDict from a collection of n-grams, use this command:
> val a = IMat(floor(3*rand(7,3))) 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 > val da = IDict.dictFromData(a) > da.grams 0 0 0 1 0 0 1 0 1 1 1 0 1 1 1
which produces a unique set of (sorted) n-grams. You can also use the flatten method on an already-created dictionary:
> val a = IMat(floor(3*rand(7,3))) > val da = IDict(a) > a.grams 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 0 > val fa = a.flatten > fa.grams 0 0 0 1 0 0 1 0 1 1 1 0 1 1 1
IDicts are used to support n-gram features. Each n-gram feature is an index into a row of an IDict dictionary. The entries in an IDict are integers which in turn point to a string in a Dict dictionary. You can merge and compute remappings for IDicts just as you would for Dicts.
> val d = union(d1, d2) > val d1m = d1 --> d > val d2m = d2 --> 2
which can be done with a single union2
function as before:
> val (d, d1m, d2m) = union2(d1, d2)
Since an IDict contains references to a string dictionary, those references also need to change when the string dictionary changes. That mapping should happen before doing anything else with the IDict representation. Here's an example of fully merging some bigram data:
> bg1:IMat // IMat contain bigram data (references bd1) > d1:Dict // String dictionary > bd1:IDict // bigram dictionary, k x 2, references d1 > bg2:IMat // IMat contain bigram data (references bd1) > d2:Dict // String dictionary > bd2:IDict // bigram dictionary, k x 2, references d2 > val (d, d1m, d2m) = union2(d1, d2) > val b1 = IDict(d1m(bd1.grams)) > val b2 = IDict(d2m(bd2.grams)) > val (bd, bm1, bm2) = union2(b1, b2) > val dat1 = bm1(bg1) // New data, references global dictionaries bd, d > val dat2 = bm2(bg2) // New data, references global dictionaries bd, d