Asp Forum - Unofficial quiz: Here's an algorithmic question for you

Hal E. Fulton

10/9/2004 7:26:00 PM

This is actually related to an interesting problem given to me in college
by one of my professors.

I'm thinking of digging out that disk and resurrecting the problem as a
quiz. So I won't go into detail on it here.

The original problem:
- 100 text fragments come from four different sources.
- There are 25 from each source.
- They're now scrambled.
- The sources are not labeled or named as such (their order
does not matter).
- Do whatever textual or statistical analysis you deem appropriate,
and separate them into four buckets, approximating their original
buckets as well as you can.

The question I'm asking here is related only to the evalution of the
quality of a given result:

Given four bins e,f,g,h (the results of our guesswork) and given
the four original bins a,b,c,d -- evaluate how well we did.

For simplicity, assume these bins are just arrays of integers.

Note that:
- each bin has exactly 25 items
- the union of a,b,c,d is 0..99
- the union of e,f,g,h is also 0..99
- there is no special correspondence between a and e,
b and f, etc.

So my real questions are:

1. How would you solve the problem of evaluating the correctness
of the partitioning? This would have to be "fuzzy" in some
sense, of course -- maybe a Float answer between 0 and 1.
2. And the meta-question: Is the "correctness" ambiguous? I think
not, but I haven't examined it thoroughly.

Cheers,
Hal

3 Answers

Mikael Brockman

10/9/2004 7:36:00 PM

Hal Fulton <hal9000@hypermetrics.com> writes:

> This is actually related to an interesting problem given to me in college
> by one of my professors.
>
> I'm thinking of digging out that disk and resurrecting the problem as a
> quiz. So I won't go into detail on it here.
>
> The original problem:
> - 100 text fragments come from four different sources.
> - There are 25 from each source.
> - They're now scrambled.
> - The sources are not labeled or named as such (their order
> does not matter).
> - Do whatever textual or statistical analysis you deem appropriate,
> and separate them into four buckets, approximating their original
> buckets as well as you can.
>
> The question I'm asking here is related only to the evalution of the
> quality of a given result:
>
> Given four bins e,f,g,h (the results of our guesswork) and given
> the four original bins a,b,c,d -- evaluate how well we did.
>
> For simplicity, assume these bins are just arrays of integers.
>
> Note that:
> - each bin has exactly 25 items
> - the union of a,b,c,d is 0..99
> - the union of e,f,g,h is also 0..99
> - there is no special correspondence between a and e,
> b and f, etc.
>
> So my real questions are:
>
> 1. How would you solve the problem of evaluating the correctness
> of the partitioning? This would have to be "fuzzy" in some
> sense, of course -- maybe a Float answer between 0 and 1.
> 2. And the meta-question: Is the "correctness" ambiguous? I think
> not, but I haven't examined it thoroughly.
>
>
> Cheers,
> Hal

You may find http://arxiv.org/abs/cs.... interesting:

>> We present a fully automatic method for music classification, based
>> only on compression of strings that represent the music pieces. The
>> method uses no background knowledge about music whatsoever: it is
>> completely general and can, without change, be used in different
>> areas like linguistic classification and genomics. It is based on an
>> ideal theory of the information content in individual objects
>> (Kolmogorov complexity), information distance, and a universal
>> similarity metric. Experiments show that the method distinguishes
>> reasonably well between various musical genres and can even cluster
>> pieces by composer.

Mauricio Fernández

10/9/2004 11:52:00 PM

On Sun, Oct 10, 2004 at 04:26:28AM +0900, Hal Fulton wrote:
> This is actually related to an interesting problem given to me in college
> by one of my professors.
>
> I'm thinking of digging out that disk and resurrecting the problem as a
> quiz. So I won't go into detail on it here.
>
> The original problem:
> - 100 text fragments come from four different sources.
> - There are 25 from each source.
> - They're now scrambled.
> - The sources are not labeled or named as such (their order
> does not matter).
> - Do whatever textual or statistical analysis you deem appropriate,
> and separate them into four buckets, approximating their original
> buckets as well as you can.

This implies that the sources have different characteristic properties ---
meaning that all the fragments from the same source are in some (non
obvious) way similar, right?

If so, your problem can be seen as a clusterization problem, which you
can solve using any of the numerous methods in the literature (say for
instance K-means).

> The question I'm asking here is related only to the evalution of the
> quality of a given result:
>
> Given four bins e,f,g,h (the results of our guesswork) and given
> the four original bins a,b,c,d -- evaluate how well we did.
>
> For simplicity, assume these bins are just arrays of integers.
>
> Note that:
> - each bin has exactly 25 items
> - the union of a,b,c,d is 0..99
> - the union of e,f,g,h is also 0..99
> - there is no special correspondence between a and e,
> b and f, etc.
>
> So my real questions are:
>
> 1. How would you solve the problem of evaluating the correctness
> of the partitioning? This would have to be "fuzzy" in some
> sense, of course -- maybe a Float answer between 0 and 1.

mmm what about the following:

Define a mapping from elements in your working set to a feature vector.
Compute the average feature vector for each of A={a,...,d} and B={e,...,h},
note them V_i and W_j (for i,j=1..4)
then you can define a distance as the minimum of
sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
between averages in A and B
(there are 24 such mappings)

you might want to add a term like
K*sum(sum(|W_j-elem_i|^2,i=1..25), j=1..4)
to represent the distance from the elements in each cluster to the
average.

If you use the method suggested in the other msg in the thread (the one
based on the Kolmogorov complexity), you could take
length(gzip(concat(ORIG BUCKET, sort(CANDIDATE BUCKET))))
as a distance estimation in the 1st equation, but this can fail
depending on the nature of your sources.

> 2. And the meta-question: Is the "correctness" ambiguous? I think
> not, but I haven't examined it thoroughly.

In which sense?

--
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Robert Klemme

10/10/2004 1:53:00 PM

"Mauricio Fernández" <batsman.geo@yahoo.com> schrieb im Newsbeitrag
news:20041009235149.GA10987@student.ei.uni-stuttgart.de...
>> So my real questions are:
>>
>> 1. How would you solve the problem of evaluating the correctness
>> of the partitioning? This would have to be "fuzzy" in some
>> sense, of course -- maybe a Float answer between 0 and 1.
>
> mmm what about the following:
>
> Define a mapping from elements in your working set to a feature vector.
> Compute the average feature vector for each of A={a,...,d} and
> B={e,...,h},
> note them V_i and W_j (for i,j=1..4)
> then you can define a distance as the minimum of
> sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
> between averages in A and B
> (there are 24 such mappings)

That's interesting. I took a bit less formal mathematical approach of
thinking, but I think the solution is quite similar. Here's what I'd have
done: for all permutations of e,f,g,h divide the size of the set
intersection of each pair by 25 (the set size) and then compute the average
of these values for all four pairs. Now choose the permutation of e,f,g,h
with the highest value.

As the faculty function is involved for a general solution there could be
numerous optimizaions. For example, it is reasonable to first calculate the
match indicators (intersection size / set size) for all pairs with O(n^2),
order the result buckets per original bucket with decreasing match indicator
and use that to guide the permutation. If you're lucky (i.e. if the bucket
finding algorithm worked well) every original bucket has a different
calculated bucket as first element in the ordered list and you're done
immediately.

If not, it's more difficult if you want to find the optimal permutation. If
you just want to detect failure of the bucket finding algorithm it might as
well suffice to recognize that two of the original buckets have the same
best match calculated bucket (which then both must have a match indicator <=
0.5 which seems pretty bad IMHO).

Another line of action could be to rule out pairs below a certain match
indicator (say 0.8) to reduce the number of permutations to consider.

Kind regards

robert

comp.lang.ruby

Unofficial quiz: Here's an algorithmic question for you

Hal E. Fulton

Mikael Brockman

Mauricio Fernández

Robert Klemme

x Login to ForumsZone