Welcome Guest: Sign InRegister

Inversoft Profanity Filter Documentation | Regex Filter

The Regex Filter

This guide will help you create a Regex Filter by hand instead of using the ProfanityFilterFactory. Creating a filter by hand allows you to have finer grained control over the performance and functionality of the filter.

The com.inversoft.profanity.RegexProfanityFilter class uses regular expressions to locate profanity within Strings. In order to create an instance of this class we will need instances of two other interfaces. These are:

  • com.inversoft.profanity.ProfanitySource
  • com.inversoft.profanity.CommonWordSource

The ProfanitySource interface

The ProfanitySource is how the filter locates the profanity to search for. This is an interface to allow custom data sources to be used. There are two different implementations of this interface that can be used in conjunction with the Regex Filter:

  • com.inversoft.profanity.CachingFullXMLProfanitySource
  • com.inversoft.profanity.StaticFullXMLProfanitySource

These implementation must be used because they load the regular expressions from the database.

The Caching profanity source

The Caching implementation loads the database into memory and periodically checks to see if the database has been updated. If a new database is deployed, this implementation will load the new database into memory and begin using the new information.

The Caching profanity source uses synchronization in order to check and reload the database. This can cause performance bottle problems for some applications. In most cases, you don't need to use this implementation of the profanity source because more often than not you will restart the application when a new database is deployed. In addition, most applications don't update the database frequently enough to warrant the overhead of synchronization.

This profanity source can be constructed using this constructor:

  public CachingFullXMLProfanitySource(long reloadSeconds, int tolerance, String... resources);

The reloadSeconds parameter defines how often the database file should be reloaded.

The tolerance parameter controls which words from the database are loaded. The database might contain hundreds of words and loading all the words into memory might consume too many resources. Therefore, this parameter can be set so that only words whose rating is equal to or greater than the tolerance are loaded. For example, if tolerance is 5 only words whose rating is 5-10 are loaded.

The resources parameter is a list of file paths, URLs or classpath resources that point to one or more profanity databases. Each resource can be a different type (i.e. one URL, one file and one classpath entry). If any of the resources references a file on the file system or a URL that correctly returns the last modified header, the reloadSeconds parameter will be honored and those resources will be periodically checked to determine if the database has been updated. If a resource is a classpath entry or a URL that doesn't return the last modified header correctly, it will not be reloaded.

This class can be constructed like this:

  ProfanitySource source = new CachingFullXMLProfanitySource(60, 0,
      "http://example.com/profanity-database-2.0.xml");

This tells the class to load all the words in the database and to check if the database should be reloaded every 60 seconds.

The Static profanity source

The Static implementation of the profanity source loads the database into memory once and does not check for updates. This source can be much faster than the caching source, depending on the number of concurrent threads that use the profanity filter at the same time. This profanity source is the recommended version for most applications.

This profanity source can be constructed using this constructor:

  public StaticFullXMLProfanitySource(int tolerance, String... resources);

These parameters have the same meaning as the parameters for the Caching profanity source and are described above. The only difference is that this source does not perform any reloading.

This class can be constructed like this:

  ProfanitySource source = new StaticFullXMLProfanitySource(4,
      "http://example.com/profanity-database-2.0.xml");

This tells the class to load all the words in the database whose rating is 4 or greater

The CommonWordSource interface

The CommonWordSource defines a list of words that are ignored during processing. This increases the performance of the filter because it reduces the amount of the String that needs to be searched. Therefore, although providing an instance of this interface is optional, it is highly recommended. The default implementation of this interface loads a standard set of common words that ships with the Inversoft Profanity Filter and has been tuned to work in conjunction with the Inversoft Profanity Database. This class is com.inversoft.profanity.ResourceBundleCommonWordSource and can be constructed like this:

  ResourceBundleCommonWordSource source = new ResourceBundleCommonWordSource();

This will use the default source that comes with the filter.

Constructing the Regex Filter

Now that we have all the class required to create the RegexProfanityFilter we can construct it like this:

  ProfanitySource profanitySource = new StaticFullXMLProfanitySource(5, "profanity-database-2.0.xml");
  CommonWordSource commonWordSource = new ResourceBundleCommonWordSource();
  ProfanityFilter filter = new RegexProfanityFilter(profanitySource, commonWordSource);

Calling the filter

Once the filter has been instantiated it is a simple matter of calling the findProfanity method. Here is an example of calling the method:

  // Example call using JDK 5.0
  String str = getStringFromSomewhere();
  ProfanityResult[] results = filter.findProfanity(str, 4, "Swear", "Slang");

  // using JDK 1.4 this would become
  // ProfanityResult[] results = filter.findProfanity(str, 4, new String[]{"Swear", "Slang"});

  if (results.length > 0) {
     for (int i = 0; i < results.length; i++) {
         System.out.println("Found profanity at " + results[i].getOffset());
     }
  }

Considerations

There are a few considerations when using the Inversoft Profanity Filter. First, the ProfanitySource implementations usually take a few seconds to load because of the size of the Inversoft Profanity Database. Therefore, it is a good idea to create this during application start up and reuse the same instance. All the implementations that come with the filter are thread safe.