.com.unity Forums - View Single Post

Jack Simth · #17 August 22nd, 2003, 01:52 AM

Quote:

Originally posted by DavidG:
The biggest software company in the world that has written some of the most complex programs can't remove duplicate addresses from a list??? What's wrong with this picture.

They could, it's just a matter of the computer time required. The standard algorythm for removeing duplicates goes something like:

code:

for(i=0; i<max; i++)

{

for(j=i-1; j>=0; j--)

{

if(entry(i) == entry(j))

{ 

clear(i)

}

}

}

If they have 10^9 entries, the statement

code:

if(entry(i) == entry(j))

gets run, at most (1 + 2 + 3 + 4 + 5 + ... + ((10^9)-1) +(10^9)) times - roughly (10^9)^2, or about 10^18 times. As it is almost impossible to hold 10^9 e-mail addresses in live memory at once (if you allow, say, 100 bytes per entry, that works out to 10^11 bytes - about one hundred gigabytes - of RAM for a single project; not likely), disk access times need to be used for dealing with the entries. If you then assign a disk acess time of, say, 10^-6 seconds per entry, and multiply that by the number of entries accessed (roughly 10^18 accesses) you get an estimate on the amount of time the algorythm will take: 10^12 seconds. That's roughly 16,666,666,666 minutes, 277,777,777 hours, 11,574,074 days, or 31,688 years. Throw 10,000 machines at the task, and it still takes a little over three years (actually, more than that, due to communication time between them). It isn't that they couldn't, it's just that it would cost more resources to eliminate the duplicates than doing so would save them.

Granted, there are several ways to shave time off of the above analysis, but that just gives a general idea of what it would take.