Matthew Johnson
8/22/2006 11:34:00 PM
I read the post regarding a taxonomy of hash-like data structures
with interest. It is something I have been thinking about a bit
lately (as well as the more general subject of all collections). It
may be appropriate to back up a step and consider the idea I
mentioned in another thread regarding the naming of concepts. Doing
that will allow more powerful duck typing in the design.
Ruby already defines Enumerable and Comparable. By depending upon
#each Enumerable the requirements / concept that must be met by
classes that include it. To keep things simple we can call the
concpet Enumerable as well. Here are some other potential concepts
that may fit nicely in Ruby (please focus more on the definition /
duck typing than on the names):
Integral
supports to_int
MutableEnumerable
depends on each, add, remove
Hashable
depends up hash
Associative
depends on fetch, store, possibly remove, each key, and each value
Unique
depends a single value per key, duplicate keys could be either
rejected or overwritten
Multi Valued
depends upon values being Enumerable
Sequencable
depends on each, insert, remove, and possibly fetch and store
Sorted / Ordered
see below
This allows us to define mixins which depend upon these operations or
properties. For example, a MutableEnumerable mixin can be defined
that implements set operations, compact, clear, remove_if, etc. An
Sequencable mixin can be defined that implements all sorts of
operations such as append, concat, splice, sort, etc. The dependency
of a Sorted / Ordered mixin would be a bit different. It would
depend upon a property of the a class mixing it in rather than
operations. For example it could define merge and binary search
operations.
A Pair concept depending upon first and second may also be
desirable. It allows things such as implementing each for
Associative that takes a single parameter.
It is clear that there are some relationships between the various
concepts. For example, some classes implementing Associative
behavior are likely to depend upon keys that implement Hashable
behavior and Sequencable classes are also Associative if we assume
Integral keys. Also, in some cases a user may know about the state
of a particular object that would allow them to leverage a mixin that
otherwise wouldn't apply to a class. For example, extending an
instance of Array with Sorted if the array is known to be sorted.
Now returning to the thread at hand we can see that the difference
between the associative array and hash hierarchies is that the hash
hierarchy depends upon Hashable keys (please forgive me if this is
stating the obvious). Of course they would also have different
implementations of primitive methods. However, many higher level
operations that depend upon those primitives would be the same and
could indeed be implemented in mixins that expect these primitive
operations. The concrete implementations would also be welcome to
implement optimizations and additional parameter to higher level
operations that are not possible without knowledge of the
implementation of the data structure, but this is a bonus and is not
strictly necessary. By clearly defining small groups of *naturally*
cohesive primitive operations and naming the group (the concepts) we
can gain the ability to create concrete implementations with a
variety of properties that all support the rich higher level
operations that can be supported by their primitives for free (almost
for free - the mixins do each need to be implemented once, but will
then be available to all concrete implementations support the
operations the depend upon). We are now exploiting the power of
combinations rather than letting it run away from us.
The properties I see being important are:
Associativity
Integral (ex. Array), Hashable (ex. Hash), any object (ex.
associative array)
Ordering
insertion order, key order (by <=>), value order (by <=>), any sort
function
Quantity
any value, unique valued, multi-valued
Sequencability
Mutability
probably less important in Ruby, but immutable variants may be
desirable for various reasons
The taxonomy in terms of the concepts (please keep in mind that there
are probably better names for the concepts than the names I am using):
> a. associative array
> i. unordered
Associative, MutableEnumerable, Sequencable?
> ii. insertion order-ordered
Associative, MutableEnumerable, Sorted
> iii. sort function-ordered
> 1. manual (adhoc)
Associative, MutableEnumerable, Sequencable
> 2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted
>
> b. hash
> i. unordered
Associative, MutableEnumerable, Hashable elements
> ii. insertion order-ordered
Associative, MutableEnumerable, Sorted, Hashable elements
> iii. sort function-ordered
> 1. manual (adhoc)
Associative, MutableEnumerable, Sequencable, Hashable elements
> 2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted, Hashable elements
As we can see, sequencability was not explicitly mentioned in the
previous post, although it was implied by the ad-hoc sort. In order
to be ad-hoc sortable by an arbitrary sort function the ability to re-
order elements must be supported. Once we have this we actually have
much more general capabilities of arbitrarily sequencing elements
whether the sequence is sorted or not. We can also see that
automatic ordering by key, value, and automatic ordering of arrays
may also be useful. We can generate further types of collection
structures based on a unique value per key or enumerable values in
addition to the default of an arbitrary quantity of values per key.
> For instance are a.i. and a.ii. really the same thing by virtue of
> there not being a hashing function?
I would consider a.i to be most likely insertion ordered by default
(when created with literal syntax for example), but should proabably
be Sequencable and thus the ordering may not always be maintained.
Also, insert and other such operations may be supported thus allowing
items to be inserted without being "insertion sorted" (in terms of
the actual chronological insertion order).
> Does a.i. make any sense and can be done away with from the get-go?
It is definitely an independent type of data structure with its own
use cases.
> And are a.ii. and b.ii, and a.iii. and b.iii. really the same thing
> also? Not if the hash still retained the hash as you would it
> expect it to.
No. Using hashable elements is different and is likely to increase
the performance of fetch operations.
> And is there a term which could reasonably cover all of a. and b.?
> I might have used the term 'hash' to refer to anything which is a
> key-value pair, but this is apparently confusing or inaccurate.
Both a and b are associative.
> I tried set, but that's wrong. I tried enumerable, but that's
> wrong too.
A set essentially an associative collection where the elements are
objects. Conceptually a hash table, an array, and an associative
array are associative based on key or index pairs.
> And so are these three sorts of data structures (array, associative
> array, and hash) sufficiently different to warrant none subclassing
> from the other? If so, then it would also probably warrant a
> syntax addition and a core implementation (as discussed). And they
> are presumably sufficiently similar that it might make sense to
> subclass all of them from the same base class; list for instance,
> and then mixin Enumerable there instead. Although one may as well
> just put the methods in directly to List then however?
I hope I have been able to shed some light on the similarities and
differences of them. The implementation of primitive operations are
different leading to different potential operations and performance
characteristics while many of the higher level operations are the
same and can be shared via mixins.
> There have been concerns that a sort function-ordered or an
> insertion-ordered, or just ordered hash will cause slowdowns and
> memory bloat, so I was wondering if it was possible to adjust
> according to need. So, thank you to Charles, as I am reminded of
> what I did in compsci all that time ago!
The ability to select the features you require for a given
application would be very desirable indeed. Here's a syntax idea -
let's borrow the idea of regex options where we can have /a regex/ix
and use option characters to specify what concepts should be
supported. {}s could be sequencable, {}i could be insertion ordered,
{}k would be key ordered {}v would be value ordered, etc. The same
could then be implemented for associative arrays [key => value]v for
a value ordered associative array. We could even define ordering for
regular arrays if desired [a, b, c]o would be ordered by <=> on the
value of a, b, and c and maintain ordering when items are added.
Some of these options may be of debatable value, but I think the
notation is elegant, concise, and consistent with our current notation.
This post has discussed duck typed concepts in the context of
collections, but the usefulness of duck typed concepts are certainly
not limited to collections. Their essence is to identify and name a
set *naturally* cohesive group of requirements on a type. These
requirements can be operations, operations of an associated type (ex.
Enumerable requiring elements to support <=> for sort operations to
work),
Matthew