|
|
|
 |

August 22nd, 2003, 01:52 AM
|
 |
Major General
|
|
Join Date: Oct 2002
Posts: 2,174
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Re: Scam Or Not?
Quote:
Originally posted by DavidG:
The biggest software company in the world that has written some of the most complex programs can't remove duplicate addresses from a list??? What's wrong with this picture.
|
They could, it's just a matter of the computer time required. The standard algorythm for removeing duplicates goes something like:
code:
for(i=0; i<max; i++)
{
for(j=i-1; j>=0; j--)
{
if(entry(i) == entry(j))
{
clear(i)
}
}
}
If they have 10^9 entries, the statement
code:
if(entry(i) == entry(j))
gets run, at most (1 + 2 + 3 + 4 + 5 + ... + ((10^9)-1) +(10^9)) times - roughly (10^9)^2, or about 10^18 times. As it is almost impossible to hold 10^9 e-mail addresses in live memory at once (if you allow, say, 100 bytes per entry, that works out to 10^11 bytes - about one hundred gigabytes - of RAM for a single project; not likely), disk access times need to be used for dealing with the entries. If you then assign a disk acess time of, say, 10^-6 seconds per entry, and multiply that by the number of entries accessed (roughly 10^18 accesses) you get an estimate on the amount of time the algorythm will take: 10^12 seconds. That's roughly 16,666,666,666 minutes, 277,777,777 hours, 11,574,074 days, or 31,688 years. Throw 10,000 machines at the task, and it still takes a little over three years (actually, more than that, due to communication time between them). It isn't that they couldn't, it's just that it would cost more resources to eliminate the duplicates than doing so would save them.
Granted, there are several ways to shave time off of the above analysis, but that just gives a general idea of what it would take.
__________________
Of course, by the time I finish this post, it will already be obsolete. C'est la vie.
|

August 22nd, 2003, 01:59 AM
|
 |
Shrapnel Fanatic
|
|
Join Date: Jul 2001
Location: Southern CA, USA
Posts: 18,394
Thanks: 0
Thanked 12 Times in 10 Posts
|
|
Re: Scam Or Not?
You do not have to load every single address into active memory at once. In fact, with that loop, each address is deleted from active memory (essentially) after it is checked against the one you are comparing it to.
[ August 22, 2003, 01:00: Message edited by: Imperator Fyron ]
|

August 22nd, 2003, 02:18 AM
|
 |
Major General
|
|
Join Date: Oct 2002
Posts: 2,174
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Re: Scam Or Not?
Quote:
Originally posted by Imperator Fyron:
You do not have to load every single address into active memory at once. In fact, with that loop, each address is deleted from active memory (essentially) after it is checked against the one you are comparing it to.
|
Apparently you missed the numerous disclaimers:
rather than "is"; my note at the bottom
Quote:
Granted, there are several ways to shave time off of the above analysis, but that just gives a general idea of what it would take.
|
and the to indicate it was a worst-case (for that algorythm, worst-case = no duplicates) analysis.
In the worst case, the Last entry checked must be checked against every other entry, and so all must be available (in memory, or accessed from the disk). The point was to give a general idea of what was required, not the exact algorythm needed. Things would be further complicated by the likelyhood that it isn't a matter of a single database of addresses being worked with. There are a zillion (exaggeration) assumptions in my analysis, and several valid shortcuts that could be built into the algorythm. It's an estimate to support what I said that DavidG had doubts about, not an exact analysis for that particular number set.
__________________
Of course, by the time I finish this post, it will already be obsolete. C'est la vie.
|

August 22nd, 2003, 02:22 AM
|
 |
Lieutenant Colonel
|
|
Join Date: Jan 2002
Location: Dundas, Ontario, Canada
Posts: 1,498
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Re: Scam Or Not?
Quote:
Originally posted by Jack Simth:
quote: Originally posted by DavidG:
The biggest software company in the world that has written some of the most complex programs can't remove duplicate addresses from a list??? What's wrong with this picture.
|
They could, it's just a matter of the computer time required. The standard algorythm for removeing duplicates goes something like:
Well I'm not going to dispute your math (frankly I didn't take the time to really understand it ) But I have MSN Messanger at work. I signed up for the service and provided my e-mail adress to MS exactly ONE time. And yet I got that message 10 times. Face it MS got something screwed up. Other corps with large databases seem to get things OK. (like Symantec)
|

August 22nd, 2003, 02:25 AM
|
 |
Lieutenant Colonel
|
|
Join Date: Jan 2002
Location: Dundas, Ontario, Canada
Posts: 1,498
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Re: Scam Or Not?
Quote:
Originally posted by Jack Simth:
However, apparently, it's possible to get on their lists more than once
|
Perhaps this is the problem. Now surely this shouldn't be hard to avoid.
|

August 22nd, 2003, 02:26 AM
|
 |
Major General
|
|
Join Date: Oct 2002
Posts: 2,174
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Re: Scam Or Not?
Quote:
Originally posted by DavidG:
Well I'm not going to dispute your math (frankly I didn't take the time to really understand it ) But I have MSN Messanger at work. I signed up for the service and provided my e-mail adress to MS exactly ONE time. And yet I got that message 10 times. Face it MS got something screwed up. Other corps with large databases seem to get things OK. (like Symantec)
|
Granted; MS could have done much better. The analysis only applies when going back to fix it, not when compiling the list originally. It is easier to maintain a no-duplicates list than it is to change a list to the no-duplicates variety.
__________________
Of course, by the time I finish this post, it will already be obsolete. C'est la vie.
|

August 22nd, 2003, 02:30 AM
|
 |
Shrapnel Fanatic
|
|
Join Date: Feb 2001
Location: Waterloo, Ontario, Canada
Posts: 11,451
Thanks: 1
Thanked 4 Times in 4 Posts
|
|
Re: Scam Or Not?
Best would be an algorithm which checks new addresses against the list before even putting them in!
Keep N small in the first place, and there's less trouble later.
Each submitting machine could keep a filter list of the Last couple submissions so as to cut down on the work the main server has to do.
If you had a sorted list, then the duplicate checking would be really easy.
Decent insertion routines would help a lot too.
Bucket sort to servers holding a piece of the list, then insert using your favorite routine.
Get the n^2 work done as it drips in, so you have years to spend on the problem, instead of rushing it just before trying to send emails.
__________________
Things you want:
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is On
|
|
|
|
|