The Regex Filter
This guide will help you create a Regex Filter by hand instead of using the
ProfanityFilterFactory. Creating a filter by hand allows you to have finer grained
control over the performance and functionality of the filter.
The com.inversoft.profanity.RegexProfanityFilter class uses regular expressions
to locate profanity within Strings. In order to create an instance of this class we will need
instances of two other interfaces. These are:
- com.inversoft.profanity.ProfanitySource
- com.inversoft.profanity.CommonWordSource
The ProfanitySource interface
The ProfanitySource is how the filter locates the profanity to search for. This
is an interface to allow custom data sources to be used. There are two different implementations
of this interface that can be used in conjunction with the Regex Filter:
- com.inversoft.profanity.CachingFullXMLProfanitySource
- com.inversoft.profanity.StaticFullXMLProfanitySource
These implementation must be used because they load the regular expressions from the database.
The Caching profanity source
The Caching implementation loads the database into memory and periodically checks to see if the database has been updated. If a new database is deployed, this implementation will load the new database into memory and begin using the new information.
The Caching profanity source uses synchronization in order to check and reload the database. This can cause performance bottle problems for some applications. In most cases, you don't need to use this implementation of the profanity source because more often than not you will restart the application when a new database is deployed. In addition, most applications don't update the database frequently enough to warrant the overhead of synchronization.
This profanity source can be constructed using this constructor:
public CachingFullXMLProfanitySource(long reloadSeconds, int tolerance, String... resources);
The reloadSeconds parameter defines how often the database file should be reloaded.
The tolerance parameter controls which words from the database are loaded. The
database might contain hundreds of words and loading all the words into memory might consume
too many resources. Therefore, this parameter can be set so that only words whose rating is
equal to or greater than the tolerance are loaded. For example, if tolerance is 5 only words
whose rating is 5-10 are loaded.
The resources parameter is a list of file paths, URLs or classpath resources that
point to one or more profanity databases. Each resource can be a different type (i.e. one URL,
one file and one classpath entry). If any of the resources references a file on
the file system or a URL that correctly returns the last modified header, the
reloadSeconds parameter will be honored and those resources will be periodically
checked to determine if the database has been updated. If a resource is a classpath entry or
a URL that doesn't return the last modified header correctly, it will not be reloaded.
This class can be constructed like this:
ProfanitySource source = new CachingFullXMLProfanitySource(60, 0,
"http://example.com/profanity-database-2.0.xml");
This tells the class to load all the words in the database and to check if the database should be reloaded every 60 seconds.
The Static profanity source
The Static implementation of the profanity source loads the database into memory once and does not check for updates. This source can be much faster than the caching source, depending on the number of concurrent threads that use the profanity filter at the same time. This profanity source is the recommended version for most applications.
This profanity source can be constructed using this constructor:
public StaticFullXMLProfanitySource(int tolerance, String... resources);
These parameters have the same meaning as the parameters for the Caching profanity
source and are described above. The only difference is that this source does not perform any
reloading.
This class can be constructed like this:
ProfanitySource source = new StaticFullXMLProfanitySource(4,
"http://example.com/profanity-database-2.0.xml");
This tells the class to load all the words in the database whose rating is 4 or greater
The CommonWordSource interface
The CommonWordSource defines a list of words that are ignored during processing.
This increases the performance of the filter because it reduces the amount of the String that
needs to be searched. Therefore, although providing an instance of this interface is optional,
it is highly recommended. The default implementation of this interface loads a standard set
of common words that ships with the Inversoft Profanity Filter and has been tuned to work
in conjunction with the Inversoft Profanity Database. This class is
com.inversoft.profanity.ResourceBundleCommonWordSource and can be constructed like
this:
ResourceBundleCommonWordSource source = new ResourceBundleCommonWordSource();
This will use the default source that comes with the filter.
Constructing the Regex Filter
Now that we have all the class required to create the RegexProfanityFilter we can
construct it like this:
ProfanitySource profanitySource = new StaticFullXMLProfanitySource(5, "profanity-database-2.0.xml"); CommonWordSource commonWordSource = new ResourceBundleCommonWordSource(); ProfanityFilter filter = new RegexProfanityFilter(profanitySource, commonWordSource);
Calling the filter
Once the filter has been instantiated it is a simple matter of calling the findProfanity
method. Here is an example of calling the method:
// Example call using JDK 5.0
String str = getStringFromSomewhere();
ProfanityResult[] results = filter.findProfanity(str, 4, "Swear", "Slang");
// using JDK 1.4 this would become
// ProfanityResult[] results = filter.findProfanity(str, 4, new String[]{"Swear", "Slang"});
if (results.length > 0) {
for (int i = 0; i < results.length; i++) {
System.out.println("Found profanity at " + results[i].getOffset());
}
}
Considerations
There are a few considerations when using the Inversoft Profanity Filter. First, the
ProfanitySource implementations usually take a few seconds to load because of the
size of the Inversoft Profanity Database. Therefore, it is a good idea to create this during
application start up and reuse the same instance. All the implementations that come with the
filter are thread safe.










