Profanity Filtering 101: Embedding

Brian Pontarelli
  • By Brian Pontarelli
  • CleanSpeak
  • September 3, 2013

The sixth in a series of posts about the finer points of profanity filtering…

Embedding

Embedded words occur when a dictionary word or proper name contain profanity:

  1. Don’t assume profanity filters are inaccurate
  2. Harry Lipshitz has a hard time creating accounts on web sites
  3. This has been documented as the Scunthorpe problem

CleanSpeak’s sophisticated profanity filter looks for dictionary words that contain profanity and safely ignores them during the filtering process. Poorly written filters will often get caught up on these simple cases and flag large number of dictionary words as profanity. CleanSpeak pulls from a large set of dictionary words and proper names in real time, over 140,000 in all, to correctly handle this situation and avoid a potentially large number of false positives without hindering performance.

Learn more about the Scunthorpe problem online at:

Further Reading:

Profanity Filtering 101: The Grawlix

Profanity Filtering 101: Character Replacements & Leet Speak

Profanity Filtering 101: Phonetics

Profanity Filtering 101: Repeat Characters

Profanity Filtering 101: Separators