Asp Forum - Re: Ranking algorithm

Victor Bazarov

10/16/2008 4:16:00 PM

Giuliano Bertoletti wrote:
> Hello,
>
> I'm looking for a clever way to solve this problem.
>
> Suppose I've been given M > N items, each one with a score associated.
> I need to extract the top N items, sorted by score (decreasing order).
>
> The point is that I would typically have M >> N (far greater than) and I
> cannot simply store all M items in memory and perform a sort operation.
>
> I need to process items on the fly, as soon as I see them.
>
> My best guess is to use a list and delete items at the end as soon as
> its size grows larger than N, but I have a linear time item insertion
> (to keep it properly sorted) which I would like to avoid if possible.
>
> Any suggestions ?

Start with an empty 'std::set' and after inserting another value, check
the size, and if it's greater than N, remove the first in the set. That
way you can always keep the size of the set equal to N, and have those
elements sorted. Insertion into a set is has O(log(N)), I believe.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

23 Answers

Giuliano Bertoletti

10/16/2008 6:11:00 PM

Good. It seems to be working.

I just have to be carefull when defining the < operator since it has to
correctly handle different items with the same score.

For example:

======================= cut =======================

class CItem {
public:
char name[20];
int score;

public:
bool operator<(const CItem &I) const
{
// if i were to compare only scores, items with the same
// score would just overwrite one another.

if(score != I.score) return score < I.score;
return strcmp(name,I.name) < 0;
}
};

======================= cut =======================

Thank you

Regards,
Giuliano Bertoletti.

Victor Bazarov ha scritto:
> Giuliano Bertoletti wrote:
>> Hello,
>>
>> I'm looking for a clever way to solve this problem.
>>
>> Suppose I've been given M > N items, each one with a score associated.
>> I need to extract the top N items, sorted by score (decreasing order).
>>
>> The point is that I would typically have M >> N (far greater than) and
>> I cannot simply store all M items in memory and perform a sort operation.
>>
>> I need to process items on the fly, as soon as I see them.
>>
>> My best guess is to use a list and delete items at the end as soon as
>> its size grows larger than N, but I have a linear time item insertion
>> (to keep it properly sorted) which I would like to avoid if possible.
>>
>> Any suggestions ?
>
> Start with an empty 'std::set' and after inserting another value, check
> the size, and if it's greater than N, remove the first in the set. That
> way you can always keep the size of the set equal to N, and have those
> elements sorted. Insertion into a set is has O(log(N)), I believe.
>
> V

Juha Nieminen

10/17/2008 11:49:00 AM

Victor Bazarov wrote:
> Start with an empty 'std::set' and after inserting another value, check
> the size, and if it's greater than N, remove the first in the set.

If N is very large, std::set can be a real memory hog.

If a more memory-efficient algorithm is needed, then a heap can be
used. (The exact same type of heap as used in heap sorting.) Simply make
the element comparison so that the element with the lowest score ends up
as the first element of the heap. When the heap grows one element larger
than N, remove this first element.

Inserting an element in the heap is O(lg n), and removing the first
element is also O(lg n) (which is the reason why heap sort is
O(n lg n).) This is the same complexity as with std::set, the difference
being that the heap is much more memory-efficient: an array to store N+1
elements is enough.

Victor Bazarov

10/17/2008 12:58:00 PM

Juha Nieminen wrote:
> Victor Bazarov wrote:
>> Start with an empty 'std::set' and after inserting another value, check
>> the size, and if it's greater than N, remove the first in the set.
>
> If N is very large, std::set can be a real memory hog.
>
> If a more memory-efficient algorithm is needed, then a heap can be
> used. (The exact same type of heap as used in heap sorting.) Simply make
> the element comparison so that the element with the lowest score ends up
> as the first element of the heap. When the heap grows one element larger
> than N, remove this first element.
>
> Inserting an element in the heap is O(lg n), and removing the first
> element is also O(lg n) (which is the reason why heap sort is
> O(n lg n).) This is the same complexity as with std::set, the difference
> being that the heap is much more memory-efficient: an array to store N+1
> elements is enough.

Inserting in the middle would mean either reallocation or moving of the
trailing elements, would it? That's why it will actually be slower than
the set. Right?

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

Kai-Uwe Bux

10/17/2008 1:44:00 PM

Victor Bazarov wrote:

> Juha Nieminen wrote:
>> Victor Bazarov wrote:
>>> Start with an empty 'std::set' and after inserting another value, check
>>> the size, and if it's greater than N, remove the first in the set.
>>
>> If N is very large, std::set can be a real memory hog.
>>
>> If a more memory-efficient algorithm is needed, then a heap can be
>> used. (The exact same type of heap as used in heap sorting.) Simply make
>> the element comparison so that the element with the lowest score ends up
>> as the first element of the heap. When the heap grows one element larger
>> than N, remove this first element.
>>
>> Inserting an element in the heap is O(lg n), and removing the first
>> element is also O(lg n) (which is the reason why heap sort is
>> O(n lg n).) This is the same complexity as with std::set, the difference
>> being that the heap is much more memory-efficient: an array to store N+1
>> elements is enough.
>
> Inserting in the middle would mean either reallocation or moving of the
> trailing elements, would it?

No. Juha is talking about heap operations, [25.3.6]. Insertion and deletion
of an element means swapping at most log(N) elements.

> That's why it will actually be slower than the set. Right?

Depends on how costly 2*log(N) swaps are in comparison to allocation and
deallocation of a node. The number of comparisons is the same.

Best

Kai-Uwe Bux

James Kanze

10/17/2008 4:48:00 PM

On Oct 17, 3:43 pm, Kai-Uwe Bux <jkherci...@gmx.net> wrote:
> Victor Bazarov wrote:
> > Juha Nieminen wrote:
> >> Victor Bazarov wrote:
> >>> Start with an empty 'std::set' and after inserting another
> >>> value, check the size, and if it's greater than N, remove
> >>> the first in the set.

> >> If N is very large, std::set can be a real memory hog.

> >> If a more memory-efficient algorithm is needed, then a heap
> >> can be used. (The exact same type of heap as used in heap
> >> sorting.) Simply make the element comparison so that the
> >> element with the lowest score ends up as the first element
> >> of the heap. When the heap grows one element larger than N,
> >> remove this first element.

> >> Inserting an element in the heap is O(lg n), and removing
> >> the first element is also O(lg n) (which is the reason why
> >> heap sort is O(n lg n).) This is the same complexity as
> >> with std::set, the difference being that the heap is much
> >> more memory-efficient: an array to store N+1 elements is
> >> enough.

> > Inserting in the middle would mean either reallocation or
> > moving of the trailing elements, would it?

> No. Juha is talking about heap operations, [25.3.6]. Insertion
> and deletion of an element means swapping at most log(N)
> elements.

It may also involve some copying if you need to increase the
array size in order to do the insert. Depending on the data
types involved (and the size of the array), either:

v.resize( v.size() + 1 ) ;
std::push_heap( v.begin(), v.end(), newElement ) ;

or

v.insert( std::upperBound( v.begin(), v.end(), newElement ),
newElement ) ;

may be faster.

Of course, since the OP knows the target size to begin with, he
can construct the vector with the correct size, filled with
dummy elements which are guaranteed to rank lower, and then use
the heap operations (inversing the comparison function, so that
the first element is the lowest rank); comparing each element he
sees with the first element, then if it ranks higher, swapping
the first element with the last and using push_heap. If the
element type is optimized for swapping (as are all of the
standard containers---but not std::string, so if he needs to
store a string, he might consider using std::vector<char>
instead), this should be a lot faster than the alternatives.

> > That's why it will actually be slower than the set. Right?

> Depends on how costly 2*log(N) swaps are in comparison to
> allocation and deallocation of a node. The number of
> comparisons is the same.

Just a guess, but if he uses std::string, std::set will be
faster; if he uses std::vector<char>, in a struct, and provides
an intelligent swap, the heap operations will be faster.
Unless, of course, the poorer locality of std::set causes more
page misses.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Stephen Horne

10/18/2008 12:33:00 PM

On Fri, 17 Oct 2008 09:48:16 -0700 (PDT), James Kanze
<james.kanze@gmail.com> wrote:

>Of course, since the OP knows the target size to begin with, he
>can construct the vector with the correct size, filled with
>dummy elements which are guaranteed to rank lower

Sounds to me as if the standard heap algorithms approach is a bit
fiddly.

priority_queue doesn't seem to provide access to the underlying
container, which is a shame - it would allow the use of the reserve
method for an underlying vector.

Using heaps may be inappropriate anyway, if there's a need to read off
the current ranking order non-destructively. An std::set can do this
with a simple and efficient traversal. With a heap, you probably end
up extracting all the items any, then rebuilding the heap before
continuing. Though with heap algorithms, at least you have the option
of copying the heap and destructively reading the copy.

*If* std::set manages to cause memory issues, this is probably a case
for a custom allocator.

Juha Nieminen

10/19/2008 10:01:00 AM

Stephen Horne wrote:
> Sounds to me as if the standard heap algorithms approach is a bit
> fiddly.

I don't understand why.

> Using heaps may be inappropriate anyway, if there's a need to read off
> the current ranking order non-destructively.

It didn't sound like the original poster would want to read the
current "top n highest-scored entries, sorted by score" in the middle of
processing the data (rather than doing it after all the data has been
processed).

Anyways, if he needs to do that, he could simply *sort* the array of
elements. Given that he already has a heap built into it, he could just
as well continue with heap sort to get a sorted array. The starting
heapifying step can be skipped in this heap sort because the array is
already heapified. After he has done this, if he needs to continue
processing the input, he can re-heapify the array in O(n lg n). (And in
fact it might be possible to heapify an array which is known to be
sorted in O(n). I don't know this for sure, though.)

Sure, doing this is not as fast as with a std::set, but the whole idea
was to consume less memory. All that is needed here is memory for an
array of N+1 elements.

> *If* std::set manages to cause memory issues, this is probably a case
> for a custom allocator.

A custom allocator won't help. Each element of std::set takes, besides
the element itself, three pointers plus a flag, plus whatever ancillary
data the memory allocator requires. Even if you are able to create an
allocator which doesn't require *any* ancillary data for an allocated
block, you still can't get rid of those three pointers and the flag.

With a heap each element only requires the size of the element and
that's it, because a heap can be built into an array.

Martin Eisenberg

10/19/2008 12:15:00 PM

Juha Nieminen wrote:

> (And in fact it might be possible to heapify an array which is
> known to be sorted in O(n). I don't know this for sure, though.)

A sorted array *is* a heap, no?

Martin

--
Whatever you do will be insignificant,
but it is very important that you do it.
--Mahatma Gandhi

Kai-Uwe Bux

10/19/2008 12:33:00 PM

Martin Eisenberg wrote:

> Juha Nieminen wrote:
>
>> (And in fact it might be possible to heapify an array which is
>> known to be sorted in O(n). I don't know this for sure, though.)
>
> A sorted array *is* a heap, no?

No. A heap satisfies the rule

a[k] >= a[2k+1] and a[k] >= a[2k+2]

provided the entries do exist. The functions std::make_heap and
std::sort_heap convert between heaps and sorted sequences.

Best

Kai-Uwe Bux

Martin Eisenberg

10/19/2008 11:25:00 PM

Kai-Uwe Bux wrote:

> Martin Eisenberg wrote:
>
>> Juha Nieminen wrote:
>>
>>> (And in fact it might be possible to heapify an array which is
>>> known to be sorted in O(n). I don't know this for sure, though.)
>>
>> A sorted array *is* a heap, no?
>
> No. A heap satisfies the rule
>
> a[k] >= a[2k+1] and a[k] >= a[2k+2]

Either I'm being thick or you have overinterpreted my words. I do
think that sortedness implies heapness (but not the converse).

Martin

--
Let others praise ancient times;
I am glad I was born in these.
--Ovid

comp.lang.c++

Re: Ranking algorithm

Victor Bazarov

Giuliano Bertoletti

Juha Nieminen

Victor Bazarov

Kai-Uwe Bux

James Kanze

Stephen Horne

Juha Nieminen

Martin Eisenberg

Kai-Uwe Bux

Martin Eisenberg

x Login to ForumsZone