Welcome Guest: Sign InRegister

Inversoft Profanity Filter Documentation

Getting started using the Inversoft Profanity Filter

This guide will help you get started using the Inversoft Profanity Filter. This covers the basics of using the regular expression filter and the Inversoft Profanity Database. For more advanced topics, consult the JavaDoc for the filter.

The ProfanityFilter interface

The main interface to the filter is com.inversoft.profanity.ProfanityFilter. This interface contains all the methods to check for, locate and replace profanity wihtin Strings. This guide will walk through how to locate all of the profanity in a String, which is the most common usage of the filter.

The findProfanity method has two variations. The first version returns an array of ProfanityResult objects, one for each profanity found. The second takes a ProfanityListener, which is called when the filter finds profanity within the String. This second method does not return a value since the listener is called for each profanity and can store the results as they are encountered. To keep things simple, we will look at the first version of this method. The JavaDoc contains information about all the other methods on this interface.

The signature of this method looks like this:

  ProfanityResult[] findProfanity(String str, int tolerance, String... types);

This method can be called from JDK 1.4 by passing an array of Strings or null as the final parameter.

The str parameter is the String to be searched. The tolerance parameter defines how lenient the filter should be with respect to profanity. The types parameter defines the types of profanity to search for. In order to suppor these parameters, The Inversoft Profanity Database has two additional attributes for each definition it contains that are used by the filters. These additional attributes and the parameters passed to the filter method can also increase the speed of filtering by reducing the work of the filter. These attributes are:

  • Rating
  • Type

The rating defines on a scale from 1 to 10 the severity of the word where 10 is the most severe and offensive and 1 is the least. The type defines the category of the word such as Slang, Swear, Drug, etc. The tolerance parameter determines the lowest rating of words to use from the database. For example, if 5 is passed as the tolerance to the filter, the filter will only search for words whose rating is 5-10. Likewise, the types parameter defines which categories of words to search for. Passing in new String[]{"Swear"} will only search for words whose type is Swear. By reducing the total set of words to search for, the filtering time is reduced.

The result of this method is an array of ProfanityResult objects. These Objects contain information about a match found in the String. The offset, length and profanity matched are all contained within the ProfanityResult. This array is sorted by the offset (start position) of the match within the String.

The RegexProfanityFilter class

The implementation of the ProfanityFilter interface that we will be using is com.inversoft.profanity.RegexProfanityFilter. This class uses regular expressions to locate profanity within Strings. In order to create an instance of this class we will need instances of three other interfaces. These are:

  • com.inversoft.profanity.ProfanitySource
  • com.inversoft.profanity.CommonWordSource
  • com.inversoft.profanity.ExclusionWordSource
The ProfanitySource interface

The ProfanitySource is how the filter locates the profanity to search for. This is an interface to allow custom data sources to be used. The default implementation of this interface uses the Inversoft Profanity Database XML file, which is loaded from the file system, classpath or a URL. This implementation is com.inversoft.profanity.CachingFullXMLProfanitySource. This class loads the XML file into memory and caches it there for faster access. The constructor for this class looks like this:

  public CachingFullXMLProfanitySource(String resource, long reloadSeconds, int tolerance);

The resource parameter is either a file path to the Inversoft Profanity Database XML file, a location in the classpath of the file or a URL to load the database from. The reloadSeconds parameter defines how often the database file should be reloaded. If resource points to a file on the file system or a URL that correctly returns the last modified header, this parameter specifies how often the file/URL should be checked to determine if it has changed. In this case, only if the file/URL has changed is it reloaded. For all other resources, this parameter controls how often the database is reloaded. Finally, the tolerance parameter controls which words from the database are loaded. The database might contain hundreds of words and loading all the words into memory might consume too many resources. Therefore, this parameter can be set so that only words whose rating is equal to or greater than the tolerance are loaded. For example, if tolerance is 5 only words whose rating is 5-10 are loaded.

This class can be constructed like this:

  ProfanitySource source = new CachingFullXMLProfanitySource("http://example.com/badwords-english.xml",
      Long.MAX_VALUE, 0);

This tells the class to load all the words in the database and to never reload.

The CommonWordSource interface

The CommonWordSource defines a list of words that are ignored during processing. This increases the performance of the filter because it reduces the amount of the String that needs to be searched. Therefore, although providing an instance of this interface is optional, it is highly recommended. The default implementation of this interface loads a standard set of common words that ships with the Inversoft Profanity Filter and has been tuned to work in conjunction with the Inversoft Profanity Database. This class is com.inversoft.profanity.ResourceBundleCommonWordSource and can be constructed like this:

  ResourceBundleCommonWordSource source = new ResourceBundleCommonWordSource();

This will use the default source that comes with the filter.

The ExclusionWordSource interface

The ExclusionWordSource interface defines a list of words that are ignored during processing because they contain profanity but are not profanity. For example, the words assume and peacock both contain profanity but are not themselves profanity. This list is used to contextualize words in the String during processing. The default implementation of this interface uses the Inversoft Profanity Database as its source for exclusions. Each word in the Inversoft Profanity Database contains a list of exclusion words and these are used by the filter to reduce false positive matches during processing. This implementation is com.inversoft.profanity.CachingXMLExclusionWordSource and can be constructed in the same manner as the ProfanitySource implementation. It takes a resource, reload time and tolerance, all of which have the same meaning as they do for the ProfanitySource. This class can be constructed like this:

  ExclusionWordSource source = new CachingXMLExclusionWordSource("badwords-english.xml",
      Long.MAX_VALUE, 0);

This tells the class to load all the exclusion words from the database and to never reload.

Now that we have all the class required to create the RegexProfanityFilter we can construct it like this:

  ProfanitySource profanitySource = new CachingFullXMLProfanitySource("badwords-english.xml",
      Long.MAX_VALUE, 0);
  ResourceBundleCommonWordSource commonWordSource = new ResourceBundleCommonWordSource();
  ExclusionWordSource exclusionWordSource = new CachingXMLExclusionWordSource("badwords-english.xml",
      Long.MAX_VALUE, 0);
  RegexProfanityFilter filter = new RegexProfanityFilter(profanitySource, commonWordSource,
      exclusionWordSource);

Calling the filter

Once the filter has been instantiated it is a simple matter of calling the findProfanity method. Here is an example of calling the method:

  // Example call using JDK 5.0
  String str = getStringFromSomewhere();
  ProfanityResult[] results = filter.findProfanity(str, 4, "Swear", "Slang");

  // using JDK 1.4 this would become
  // ProfanityResult[] results = filter.findProfanity(str, 4, new String[]{"Swear", "Slang"});

  if (results.length > 0) {
     for (int i = 0; i < results.length; i++) {
         System.out.println("Found profanity at " + results[i].getOffset());
     }
  }

Considerations

There are a few considerations when using the Inversoft Profanity Filter. First, the ProfanitySource implementations usually take a few seconds to load because of the size of the Inversoft Profanity Database. Therefore, it is a good idea to create this during application start up and reuse the same instance. All the implementations that come with the filter are thread safe.

Similarly, the ExclusionWordSource implementations also take a little time to create and should be created at startup. Using the reload feature of the sources can cause the application to pause due to synchronization requirements and therefore should be avoided unless the source is constructed using a file location on the local file system. In this case the source classes check whether or not the file has changed and if it hasn't a reload is not performed.