Email spam is still a great problem on today’s Internet. Most users who have email accounts receive spam regularly. While most of the spam is detected by mail filters, at least when it comes to most providers, there is still enough spam that slips through the cracks.
Google launched a new text classifier on Gmail that promises better detection rates, less false positives and also improved performance. Called RETVec — Resilient & Efficient Text Vectorizer — it is improving spam detection on Gmail by 38% and reducing false positives by almost 20%.
Google says that RETVec achieves this “combining a novel, highly-compact character encoder, an augmentation-driven training regime, and the use of metric learning”.
Its architecture makes RETVec compatible with any language out of the box and all UTF-8 characters without the need for text processing.
Spammers and malicious actors use different methods to bypass spam filters. Frequent methods include the use of homoglyphs, characters that look very much alike, or the use of invisible characters.
Google claims that Gmail’s new anti-spam system is better suited to identify these tactics and deal with them accordingly.
The company trained the new model internally at Google for a time to better understand its effectiveness. Google says it found it “highly effective for security and anti-abuse applications” as a result of its internal tests.
RETVec in detail
RETVec is released as open source. You may visit the GitHub project website for access to the source. There, you will also find more information, including the paper and links to demos.
Google describes RETVec in the following way on GitHub to a development-focused audience:
RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently.
Google notes that RETVec may also be a choice for “on-device and web use cases”. The technology is supported natively in TensorFlow Lite and there is also a custom JavaScript implementation.
Closing Words
Gmail users benefit from the new anti-spam filter on the site. A reduction by 38% is a massive improvement, especially considering Gmail’s daily mail volume. Google benefits from the deployment as well, as performance improves significantly thanks to the lightweight nature of the new text vectorizer.
Now You: do you use Gmail?