Asp Forum - Re: A use case for an ordered hash

thoran@thoran.com

8/22/2006 5:27:00 PM

Hi,

I switched from 'my' thread because this seems more appropriate to
this thread...

Previously "Re: Why Does Hash Apparently Reorder Its Internal
Representation And Other Associated Ponderings"

Or, Is It Possible To Have One's Cake And Eat It Too?

First, a taxa of the different sorts of beasts of which we are
speaking might help:

a. associative array
i. unordered
ii. insertion order-ordered
iii. sort function-ordered
1. manual (adhoc)
2. automatic (upon insertion)

b. hash
i. unordered
ii. insertion order-ordered
iii. sort function-ordered
1. manual (adhoc)
2. automatic (upon insertion)

Does this cover it or is there a redundancy or something missing? For
instance are a.i. and a.ii. really the same thing by virtue of there
not being a hashing function? Does a.i. make any sense and can be
done away with from the get-go? And are a.ii. and b.ii, and a.iii.
and b.iii. really the same thing also? Not if the hash still retained
the hash as you would it expect it to.

And is there a term which could reasonably cover all of a. and b.? I
might have used the term 'hash' to refer to anything which is a
key-value pair, but this is apparently confusing or inaccurate.

I tried set, but that's wrong. I tried enumerable, but that's wrong
too.

I propose that the term, list, be used to cover anything which would
be able to mix in Enumerable and that the constituent parts of a list
are called items, rather than an element, which is a set/array term
and not appropriate for a hash or associative array. Still that
doesn't give a name for associative lists. I love that when it just
comes out like that... AssociativeList it is then!

And so are these three sorts of data structures (array, associative
array, and hash) sufficiently different to warrant none subclassing
from the other? If so, then it would also probably warrant a syntax
addition and a core implementation (as discussed). And they are
presumably sufficiently similar that it might make sense to subclass
all of them from the same base class; list for instance, and then
mixin Enumerable there instead. Although one may as well just put the
methods in directly to List then however?

There have been concerns that a sort function-ordered or an
insertion-ordered, or just ordered hash will cause slowdowns and
memory bloat, so I was wondering if it was possible to adjust
according to need. So, thank you to Charles, as I am reminded of what
I did in compsci all that time ago!

Now given that it is possible or desirable to have a Hash redefine the
number of buckets, or rehash, is there any reason why it can't also
reassign its internal type at a deeper level depending on access?

Thought experiment time...

What if all Hashes were able to be ordered, but rehashable for storage
type depending on how it is instantiated and how it is accessed? One
could specify storage organisation at instantiation or later, or allow
it to decide for itself what sort of hash it will be.

m = [:a => 'a'] # m is for map. I could go for this term too as it
has brevity.
=> [:a => 'a']

m[0]
=> [:a => 'a'] # m remains as an associative array.

m[:a]
=> {:a => 'a'} # m becomes an explicit hash.

m[0]
=> {:a => 'a'} # m becomes an associative array.

m[:a]
=> {:a => 'a'} # m becomes an explicit hash.

m[[0]]
=> [:a => 'a'] # m becomes an associative array.

{} = m, or
m {}= m, or
m.to_h
=> {:a => 'a'}

This would be read as "make m equal to itself if it is already a hash,
else make it so". Similarly, going the other way:

[]= m, or
m []= m, or
m.to_assoc
=> [:a => 'a']

Then perhaps one might like to also do something like this?...

n = [:b => 'b']
=> [:b => 'b']

o {}= n
=> {:b => 'b'}

Useful? Anyhow, that was an aside.

Now, why not take this further?... What about a generalized list? As
in:

l = [:a => 'a', :b, {:c => 'c'}] # which is really, or at least
appears to be as if it were: {0 => [:a => 'a'], 1 => :b, 2 => {:c =>
'c'}}

The method of access determines how it is to be treated...

l{0}
=> {:a => 'a'} # or should that be {0 => [:a => 'a']}?

l[[0]]
=> 'a'

l[0]
=> [:a => 'a']

l{1}
=> {1 => :b} # or should that be {1 => {:b => nil}}?

l[1]
=> [1 => :b] # or should that simply be :b?

l[[1]]
=> :b

l[2]
=> {:c => 'c'}

l[[2]]
=> 'c'

l{2}
=> {2 => {:c => 'c'}} # or should that be {:c => 'c'}?

Having only lists might then do away with the literal instantiation of
an empty AssociativeArray problem (assuming a double bracket
notation):

a = [[]] # Is this a nested array or an associative array?
=> [[]]

list = []
array = [] # Really a list.
hash = {} # Explicit hash list sub-type declaration.

list = {:a => 'a'}
=> {:a => 'a'}

By assigning a non key-value pair to the hash it becomes a more
general list internally and this is represented by the form of
notation, from {} to [].

list[1] = 'b'
=> [:a => 'a', 'b']

In a list the numeric value of an item is always present, so an
assignment which uses a numeric index will assume that it is a simple
placement at the position given by the numeric index of whatever
object is presented to the list.

An assignment which treats this as an associative array will work the
same as if it were an assignment without:

list[[2]] = 'c'
=> [:a => 'a', 'b', 'c']

Another example might be good:

list[3] = {:d => 'd'}
=> [:a => 'a', 'b', 'c', {:d => 'd'}]

list[3]
=> {:d => 'd'}

list[[4]] = {:e => 'e'} # Allowable?
=> [:a => 'a', 'b', 'c', {:d => 'd'}, {:e => 'e'}]

list[4]
=> {:e => 'e'}

list[[4]]
=> 'e'

However,

list[0]
=> [:a => 'a']

list[[0]]
=> 'a'

This is all of the top of my head, so it probably needs some work...

And as for a dynamically altering Hash, List also could change its
internal organisation on the fly depending on need. Such that if all
elements are homogenous, then the internal organisation would be
optimised for that homogeneity. If that changes to be heterogenous,
then a more general and less optimised form of internal organisation
can be chosen dynamically to accomodate.

I realise the overhead if types and methods of access changed all the
time, but does it?

t

P.S. The irony of my apparently having no thoughts about language
design is choking me.

1 Answer

Matthew Johnson

8/22/2006 11:34:00 PM

I read the post regarding a taxonomy of hash-like data structures
with interest. It is something I have been thinking about a bit
lately (as well as the more general subject of all collections). It
may be appropriate to back up a step and consider the idea I
mentioned in another thread regarding the naming of concepts. Doing
that will allow more powerful duck typing in the design.

Ruby already defines Enumerable and Comparable. By depending upon
#each Enumerable the requirements / concept that must be met by
classes that include it. To keep things simple we can call the
concpet Enumerable as well. Here are some other potential concepts
that may fit nicely in Ruby (please focus more on the definition /
duck typing than on the names):

Integral
supports to_int
MutableEnumerable
depends on each, add, remove
Hashable
depends up hash
Associative
depends on fetch, store, possibly remove, each key, and each value
Unique
depends a single value per key, duplicate keys could be either
rejected or overwritten
Multi Valued
depends upon values being Enumerable
Sequencable
depends on each, insert, remove, and possibly fetch and store
Sorted / Ordered
see below

This allows us to define mixins which depend upon these operations or
properties. For example, a MutableEnumerable mixin can be defined
that implements set operations, compact, clear, remove_if, etc. An
Sequencable mixin can be defined that implements all sorts of
operations such as append, concat, splice, sort, etc. The dependency
of a Sorted / Ordered mixin would be a bit different. It would
depend upon a property of the a class mixing it in rather than
operations. For example it could define merge and binary search
operations.

A Pair concept depending upon first and second may also be
desirable. It allows things such as implementing each for
Associative that takes a single parameter.

It is clear that there are some relationships between the various
concepts. For example, some classes implementing Associative
behavior are likely to depend upon keys that implement Hashable
behavior and Sequencable classes are also Associative if we assume
Integral keys. Also, in some cases a user may know about the state
of a particular object that would allow them to leverage a mixin that
otherwise wouldn't apply to a class. For example, extending an
instance of Array with Sorted if the array is known to be sorted.

Now returning to the thread at hand we can see that the difference
between the associative array and hash hierarchies is that the hash
hierarchy depends upon Hashable keys (please forgive me if this is
stating the obvious). Of course they would also have different
implementations of primitive methods. However, many higher level
operations that depend upon those primitives would be the same and
could indeed be implemented in mixins that expect these primitive
operations. The concrete implementations would also be welcome to
implement optimizations and additional parameter to higher level
operations that are not possible without knowledge of the
implementation of the data structure, but this is a bonus and is not
strictly necessary. By clearly defining small groups of *naturally*
cohesive primitive operations and naming the group (the concepts) we
can gain the ability to create concrete implementations with a
variety of properties that all support the rich higher level
operations that can be supported by their primitives for free (almost
for free - the mixins do each need to be implemented once, but will
then be available to all concrete implementations support the
operations the depend upon). We are now exploiting the power of
combinations rather than letting it run away from us.

The properties I see being important are:
Associativity
Integral (ex. Array), Hashable (ex. Hash), any object (ex.
associative array)
Ordering
insertion order, key order (by <=>), value order (by <=>), any sort
function
Quantity
any value, unique valued, multi-valued
Sequencability
Mutability
probably less important in Ruby, but immutable variants may be
desirable for various reasons

The taxonomy in terms of the concepts (please keep in mind that there
are probably better names for the concepts than the names I am using):

> a. associative array
> i. unordered
Associative, MutableEnumerable, Sequencable?
> ii. insertion order-ordered
Associative, MutableEnumerable, Sorted
> iii. sort function-ordered
> 1. manual (adhoc)
Associative, MutableEnumerable, Sequencable
> 2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted
>
> b. hash
> i. unordered
Associative, MutableEnumerable, Hashable elements
> ii. insertion order-ordered
Associative, MutableEnumerable, Sorted, Hashable elements
> iii. sort function-ordered
> 1. manual (adhoc)
Associative, MutableEnumerable, Sequencable, Hashable elements
> 2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted, Hashable elements

As we can see, sequencability was not explicitly mentioned in the
previous post, although it was implied by the ad-hoc sort. In order
to be ad-hoc sortable by an arbitrary sort function the ability to re-
order elements must be supported. Once we have this we actually have
much more general capabilities of arbitrarily sequencing elements
whether the sequence is sorted or not. We can also see that
automatic ordering by key, value, and automatic ordering of arrays
may also be useful. We can generate further types of collection
structures based on a unique value per key or enumerable values in
addition to the default of an arbitrary quantity of values per key.

> For instance are a.i. and a.ii. really the same thing by virtue of
> there not being a hashing function?

I would consider a.i to be most likely insertion ordered by default
(when created with literal syntax for example), but should proabably
be Sequencable and thus the ordering may not always be maintained.
Also, insert and other such operations may be supported thus allowing
items to be inserted without being "insertion sorted" (in terms of
the actual chronological insertion order).

> Does a.i. make any sense and can be done away with from the get-go?

It is definitely an independent type of data structure with its own
use cases.

> And are a.ii. and b.ii, and a.iii. and b.iii. really the same thing
> also? Not if the hash still retained the hash as you would it
> expect it to.

No. Using hashable elements is different and is likely to increase
the performance of fetch operations.

> And is there a term which could reasonably cover all of a. and b.?
> I might have used the term 'hash' to refer to anything which is a
> key-value pair, but this is apparently confusing or inaccurate.

Both a and b are associative.

> I tried set, but that's wrong. I tried enumerable, but that's
> wrong too.

A set essentially an associative collection where the elements are
objects. Conceptually a hash table, an array, and an associative
array are associative based on key or index pairs.

> And so are these three sorts of data structures (array, associative
> array, and hash) sufficiently different to warrant none subclassing
> from the other? If so, then it would also probably warrant a
> syntax addition and a core implementation (as discussed). And they
> are presumably sufficiently similar that it might make sense to
> subclass all of them from the same base class; list for instance,
> and then mixin Enumerable there instead. Although one may as well
> just put the methods in directly to List then however?

I hope I have been able to shed some light on the similarities and
differences of them. The implementation of primitive operations are
different leading to different potential operations and performance
characteristics while many of the higher level operations are the
same and can be shared via mixins.

> There have been concerns that a sort function-ordered or an
> insertion-ordered, or just ordered hash will cause slowdowns and
> memory bloat, so I was wondering if it was possible to adjust
> according to need. So, thank you to Charles, as I am reminded of
> what I did in compsci all that time ago!

The ability to select the features you require for a given
application would be very desirable indeed. Here's a syntax idea -
let's borrow the idea of regex options where we can have /a regex/ix
and use option characters to specify what concepts should be
supported. {}s could be sequencable, {}i could be insertion ordered,
{}k would be key ordered {}v would be value ordered, etc. The same
could then be implemented for associative arrays [key => value]v for
a value ordered associative array. We could even define ordering for
regular arrays if desired [a, b, c]o would be ordered by <=> on the
value of a, b, and c and maintain ordering when items are added.
Some of these options may be of debatable value, but I think the
notation is elegant, concise, and consistent with our current notation.

This post has discussed duck typed concepts in the context of
collections, but the usefulness of duck typed concepts are certainly
not limited to collections. Their essence is to identify and name a
set *naturally* cohesive group of requirements on a type. These
requirements can be operations, operations of an associated type (ex.
Enumerable requiring elements to support <=> for sort operations to
work),

Matthew

comp.lang.ruby

Re: A use case for an ordered hash

thoran@thoran.com

Matthew Johnson

x Login to ForumsZone