Olaf Klischat
7/17/2005 10:01:00 PM
Olaf Klischat <klischat@cs.tu-berlin.de> writes:
> Paolo Capriotti <p.capriotti@gmail.com> writes:
>
>> It's wrong anyways. You both are assuming that all these events are
>> independent, and multiply their probability, but they are not. One
>> neat way to see this is remarking that you didn't use the fact that
>> the intervals are 5000000. If there were only one such interval (that
>> is, if you were doing an 1 out of 200 sampling), the probability would
>> have been obviously 1. If your formula were correct, I could apply it
>> to this case, obtaining a contradiction.
>
> Yeah. You have a point there ;).
>
> I still think that 0.3688 and 0.3688**20 are not far off because the
> interval size (200) and number (20) is much smaller than MEMBERS and
> LIMIT, so the occurance of an event like "the sampling contains one
> number from a given (preselected) interval" does not influence the
> probability of another such event (for another, disjunct interval)
> very much. Right?
I've pondered some more and I think, considering "event dependency"
(i.e. changed probabilities once you've selected some numbers), the
actual probability that exactly one number occurs in a given
200-numbers interval isn't (199/200)**199=0.36880183, but
(1-4_999_999/999_999_999)*
(1-4_999_999/999_999_998)*
(1-4_999_999/999_999_997)*
.
.
.. *
(1-4_999_999/999_999_801)
= 0.368801868
(so the approximation has an error of only 1e-7), or in the general
case:
# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
(1...intervalsize).inject(intervalsize * members/limit){|x,i|
x * (1-(members-1)/(limit-i))
}
end
I've now written a "check" script that gets a sample on stdin and
determines this value (all consecutive intervals of size limit/members
are checked), and compares it to the theoretically expected one:
--------------------------------
#!/usr/bin/ruby
# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
(1...intervalsize).inject(intervalsize * members/limit){|x,i|
x * (1-(members-1)/(limit-i))
}
end
members,limit = ARGV.map{|x|x.to_i}
warn "#{limit} mod #{members} != 0 -- inaccurate results" if limit%members!=0
interval = limit/members
puts "expected: #{p(members.to_f,limit.to_f,interval)}"
found=0
begin
last=0; count=0;
loop do
num=STDIN.readline.to_i
new=num/interval
if new==last then
count+=1
else
last=new
found+=1 if count==1
count=1
end
end
rescue EOFError
end
found+=1 if count==1
puts "actual: #{found.to_f/(limit/interval)}"
--------------------------------
For the big_sample.txt here, this gives:
olaf@tack:~/doc/mydocs/ruby/quiz$ ./check 5_000_000 1_000_000_000 <big_sample.txt
expected: 0.368801867760757
actual: 0.368865
olaf@tack:~/doc/mydocs/ruby/quiz$
Other runs:
olaf@tack:~/doc/mydocs/ruby/quiz$ ./sample 5000 70000 | ./check 5000 70000
expected: 0.381630036177205
actual: 0.386
olaf@tack:~/doc/mydocs/ruby/quiz$
olaf@tack:~/doc/mydocs/ruby/quiz$ ./sample 1 100 | ./check 1 100
expected: 1.0
actual: 1.0
olaf@tack:~/doc/mydocs/ruby/quiz$
The "cheated" samples that have one value in each interval will always
yield actual: 1.0.
All things considered, this seems to be quite a good indicator to tell
the really random solutions from the not-so-random ones :)
Olaf