On Tue, 31 May 2011 13:06:21 +0000 (UTC), dutour@quatramaran.ens.fr
(Mathieu Dutour) wrote:
>I want to be able to store some numbers (in an interval [1..N] with N small
>about 200) in a set X and be able to do the following
>---test if a number i is inside
>---remove an element of X
>---add an element to X.
>---get one element from X if it is non-empty.
>And I want the maximum speed.
>
>Right now, I am using the following structure
>
>typedef struct {
> int MaxElement;
> int *ListElementInSubset;
> int *ListNext;
> int *ListPrev;
>} IntegerSubsetStorage;
>
>That is the code reads as
>---MaxElement is the number N
>---ListElementInSubset has size N and is a 0/1 vector to answer to the
> first question.
>---ListNext and ListPrev are array of size N+2 that indicate what is the
> next element in X and what is the previous element in X (this allows
> to answer the last 3 questions).
>
>All this works fine and I can answer all questions without any iterative
>loop. But can it be done better?
Nice little problem. Here is a list of the different answers you got
plus a couple not mentioned.
Storage:
Mathieu Dutour -- Doubly linked list in two arrays + a {1,0} array
Gene -- Doubly linked list in one array
Stefan Ram -- Int array with {1,0} entries with linear search
China Blue Angels -- Bit vector, keep min and max
Roberto Waltman -- Bit vector, short search for get
Richard Harter -- Tiered bit vector (see below)
Richard Harter -- The devil's list algorithm (see below)
Richard Harter -- The packed set algorithm
One of the problems with asking for the "best" solution is that there
is no best solution for all environments. "Fast" depends upon sundry
considerations. Thus in the past century execution times were roughly
proportional to instructions. In this century cache misses are often
the dominant factor affecting execution times. In algorithm analysis
it is customary to characterize algorithms by their big O behaviour.
However this often is inappropriate for small n. Another factor is
that the cost of very simple functions is dominated by context
switching costs. Etc.
All of that said, it is useful to look at (a) the instruction cost,
and (b) the storage used. The storage costs are:
Storage use:
Devil's list algorithm: 3 words per item
Doubly linked list: 3 words per item (Dotour)
Doubly linked list: 2 words per item (Gene)
Packed set algorithm: 2 words per item
Int list, {0,1} entries: 1 word per item
Bit vector: 1 bit per item
Instruction costs:
The linked list variants are all O(1); however they all take about 5
statements per operation with dereferencing costs. The integer list
with all entries 0 or 1 requires the fewest statements per operation
except for "get" which has a potential O(n) cost. (See below).
The bit vector variants all have about 3-5 bit manipulations per
operation. They are cheaper than the linked list variants because
they have lower dereferencing costs. The "keep min and max" variant
has O(n) costs and cannot be recommended. The others are O(1) for all
operations.
Manipulating bit vectors:
One disadvantage of using bit vectors is that they can be word size
dependent; I prefer to use unsigned char arrays for that reason; YMMV.
Another is that you need to be familiar with the bit hacks. Frex
(check these, it's been ages and I probably have goofed.)
unsigned char v[25];
v[x>>3] |= 1<<(x&7); // Sets the x'th bit 1
v[x>>3] &= ~(1<<(x&7)); // Unsets the x'th bit
The formula for finding the lowest bit in x that is set is
x & ~(x-1)
Use this for the "get" operation.
Tiered bit vectors:
The idea here is to a bit vector for the set/unset elements, and a
second bit vector for the zero/non-zero chars/words in the base bit
vector. Thus if you are setting a bit in the j'th char, also set the
j'th bit in the upper bit vector (one char in this case). If you are
unsetting a bit in the j'th char and doing so clears the j'th char,
then unset the j'th bit in the upper bit vector.
For the get operation use the lowest bit formula to find a set bit in
upper bit vector; this tells you which char to extract an int from.
Tiered bit vectors are more expensive per operation (get excepted)
than a simple bit vector but they are O(log n) for gets rather than
the O(n) search needed for a simple bit vector. For n ~= 200 a tiered
bit vector is overkill.
The Devil's list algorithm:
The devil's algorithm for deleting an item from a singly linked list
is as follows: Let x be the the node holding the item; replace the
contents of x by the contents of x->link. This actually works
provided two conditions are met: (a) the list is terminated with a
node that will never be deleted, and (b) all references to the former
x->link node must be updated.
For this problem we need a singly linked list with nodes {i,link} and
an array of node pointers that, for each element, holds a pointer to
the corresponding list node. When an item is deleted the entry for
the deleted integer is cleared and the entry moved integer is updated.
It turns out that the Devil's list algorithm uses slightly fewer
statements than a doubly linked list implementation; however it uses
more storage.
The packed set algorithm:
In the packed set algorithm the integers in the set are kept in an
array without gaps. When an item is added it is added to the end of
the array and the array length is incremented. When an item is
deleted the item at the end replaces the deleted item and the array
length is decremented.
In this context we need a second array. Each location in the second
array holds the location of the integer in the packed array. When an
item is deleted/added the entry in the second array is updated.
The packed set algorithm is an O(1) algorithm. It uses fewer
statements per operation than using a doubly linked list and no more
space.
Behaviour of the integer list search algorithm:
Briefly, the search tend to take O(m) probes where m is the number of
empty slots in the array. The reason is that the search scans from
one side to the other. Over time the end not searched becomes full.
New additions are made into the empty end are rapidly removed. This
effect is minimal if m is small or if the ratio of delete to gets is
large.
Conclusions:
The choices from best to worst probably are:
Simple bit vector
Tiered bit vector
* Array of {0,1} integers
Packed set array
Doubly linked list
Devil's list algorithm
* "Get" may be expensive
In any particular situation the results may be different.