Mauricio Fernández
10/9/2004 11:52:00 PM
On Sun, Oct 10, 2004 at 04:26:28AM +0900, Hal Fulton wrote:
> This is actually related to an interesting problem given to me in college
> by one of my professors.
>
> I'm thinking of digging out that disk and resurrecting the problem as a
> quiz. So I won't go into detail on it here.
>
> The original problem:
> - 100 text fragments come from four different sources.
> - There are 25 from each source.
> - They're now scrambled.
> - The sources are not labeled or named as such (their order
> does not matter).
> - Do whatever textual or statistical analysis you deem appropriate,
> and separate them into four buckets, approximating their original
> buckets as well as you can.
This implies that the sources have different characteristic properties ---
meaning that all the fragments from the same source are in some (non
obvious) way similar, right?
If so, your problem can be seen as a clusterization problem, which you
can solve using any of the numerous methods in the literature (say for
instance K-means).
> The question I'm asking here is related only to the evalution of the
> quality of a given result:
>
> Given four bins e,f,g,h (the results of our guesswork) and given
> the four original bins a,b,c,d -- evaluate how well we did.
>
> For simplicity, assume these bins are just arrays of integers.
>
> Note that:
> - each bin has exactly 25 items
> - the union of a,b,c,d is 0..99
> - the union of e,f,g,h is also 0..99
> - there is no special correspondence between a and e,
> b and f, etc.
>
> So my real questions are:
>
> 1. How would you solve the problem of evaluating the correctness
> of the partitioning? This would have to be "fuzzy" in some
> sense, of course -- maybe a Float answer between 0 and 1.
mmm what about the following:
Define a mapping from elements in your working set to a feature vector.
Compute the average feature vector for each of A={a,...,d} and B={e,...,h},
note them V_i and W_j (for i,j=1..4)
then you can define a distance as the minimum of
sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
between averages in A and B
(there are 24 such mappings)
you might want to add a term like
K*sum(sum(|W_j-elem_i|^2,i=1..25), j=1..4)
to represent the distance from the elements in each cluster to the
average.
If you use the method suggested in the other msg in the thread (the one
based on the Kolmogorov complexity), you could take
length(gzip(concat(ORIG BUCKET, sort(CANDIDATE BUCKET))))
as a distance estimation in the 1st equation, but this can fail
depending on the nature of your sources.
> 2. And the meta-question: Is the "correctness" ambiguous? I think
> not, but I haven't examined it thoroughly.
In which sense?
--
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com