Tuesday, November 30, 2010

Spam spam spam spam...

This is a paper about using regular expression URL matching to help detect email spam, especially spam campaigns. While they have a really interesting tool, I worry about the haste with which they dismiss some false positive sources:

Legitimate emails sent by a big company advertising a product or event could also be bursty. But they will be unlikely sent from hosts spanning more than a few ASes. One false positive case could be email flash crowd, where people forward each other a few popular URL links. We expect such events to be very rare. In our experience of using three months of data and the source AS threshold of 20, we did not encounter a single such event.
Three months of data from a single email provider seems like a small sample set to conclude that they will not mislabel some sort of spontaneous/viral non-spam URL. The lack of "flash crowd" data could lead to real problems with the email providers' customers if they should begin to participate in such an event; we have no confidence from the results of this study that such an event would be differentiated. Additionally, I wonder about the large-companies-will-span-few-ASes assumption. Does that hold for CDNs? Global companies? I think that time and trends may need to be considered more here.

No comments:

Post a Comment