Getting started using the Inversoft Profanity Filter
This guide will help you get started using the Inversoft Profanity Filter. This covers the basics of using the regular expression filter and the Inversoft Profanity Database. For more advanced topics, consult the JavaDoc for the filter.
The ProfanityFilter interface
The main interface to the filter is com.inversoft.profanity.ProfanityFilter. This
interface contains all the methods to check for, locate and replace profanity wihtin Strings.
This guide will walk through how to locate all of the profanity in a String, which is the most
common usage of the filter.
The findProfanity method has two variations. The first version returns an array of
ProfanityResult objects, one for each profanity found. The second takes a
ProfanityListener, which is called when the filter finds profanity within the String.
This second method does not return a value since the listener is called for each profanity and
can store the results as they are encountered. To keep things simple, we will look at the first
version of this method. The JavaDoc contains information about all the other methods on this
interface.
The signature of this method looks like this:
ProfanityResult[] findProfanity(String str, int tolerance, String... types);
This method can be called from JDK 1.4 by passing an array of Strings or null as the final parameter.
The str parameter is the String to be searched. The tolerance parameter defines
how lenient the filter should be with respect to profanity. The types parameter defines the
types of profanity to search for. In order to suppor these parameters, The Inversoft Profanity
Database has two additional attributes for each definition it contains that are used by
the filters. These additional attributes and the parameters passed to the filter method can
also increase the speed of filtering by reducing the work of the filter. These attributes are:
- Rating
- Type
The rating defines on a scale from 1 to 10 the severity of the word where 10 is the most severe
and offensive and 1 is the least. The type defines the category of the word such as Slang, Swear,
Drug, etc. The tolerance parameter determines the lowest rating of words to use from the database.
For example, if 5 is passed as the tolerance to the filter, the filter will only search for
words whose rating is 5-10. Likewise, the types parameter defines which categories of words to
search for. Passing in new String[]{"Swear"} will only search for words whose type
is Swear. By reducing the total set of words to search for, the filtering time is
reduced.
The result of this method is an array of ProfanityResult objects. These Objects contain information about a match found in the String. The offset, length and profanity matched are all contained within the ProfanityResult. This array is sorted by the offset (start position) of the match within the String.
The RegexProfanityFilter class
The implementation of the ProfanityFilter interface that we will be using is
com.inversoft.profanity.RegexProfanityFilter. This class uses regular expressions
to locate profanity within Strings. In order to create an instance of this class we will need
instances of three other interfaces. These are:
- com.inversoft.profanity.ProfanitySource
- com.inversoft.profanity.CommonWordSource
- com.inversoft.profanity.ExclusionWordSource
The ProfanitySource interface
The ProfanitySource is how the filter locates the profanity to search for. This
is an interface to allow custom data sources to be used. The default implementation of this
interface uses the Inversoft Profanity Database XML file, which is loaded from the file system,
classpath or a URL. This implementation is com.inversoft.profanity.CachingFullXMLProfanitySource.
This class loads the XML file into memory and caches it there for faster access. The
constructor for this class looks like this:
public CachingFullXMLProfanitySource(String resource, long reloadSeconds, int tolerance);
The resource parameter is either a file path to the Inversoft Profanity Database
XML file, a location in the classpath of the file or a URL to load the database from. The
reloadSeconds parameter defines how often the database file should be reloaded.
If resource points to a file on the file system or a URL that correctly returns
the last modified header, this parameter specifies how often the file/URL should be checked to
determine if it has changed. In this case, only if the file/URL has changed is it reloaded.
For all other resources, this parameter controls how often the database is reloaded. Finally,
the tolerance parameter controls which words from the database are loaded. The database might
contain hundreds of words and loading all the words into memory might consume too many
resources. Therefore, this parameter can be set so that only words whose rating is equal to or
greater than the tolerance are loaded. For example, if tolerance is 5 only words whose rating
is 5-10 are loaded.
This class can be constructed like this:
ProfanitySource source = new CachingFullXMLProfanitySource("http://example.com/badwords-english.xml",
Long.MAX_VALUE, 0);
This tells the class to load all the words in the database and to never reload.
The CommonWordSource interface
The CommonWordSource defines a list of words that are ignored during processing.
This increases the performance of the filter because it reduces the amount of the String that
needs to be searched. Therefore, although providing an instance of this interface is optional,
it is highly recommended. The default implementation of this interface loads a standard set
of common words that ships with the Inversoft Profanity Filter and has been tuned to work
in conjunction with the Inversoft Profanity Database. This class is
com.inversoft.profanity.ResourceBundleCommonWordSource and can be constructed like
this:
ResourceBundleCommonWordSource source = new ResourceBundleCommonWordSource();
This will use the default source that comes with the filter.
The ExclusionWordSource interface
The ExclusionWordSource interface defines a list of words that are ignored during
processing because they contain profanity but are not profanity. For example, the words
assume and peacock both contain profanity but are not themselves
profanity. This list is used to contextualize words in the String during processing. The
default implementation of this interface uses the Inversoft Profanity Database as its source
for exclusions. Each word in the Inversoft Profanity Database contains a list of exclusion words
and these are used by the filter to reduce false positive matches during processing. This
implementation is com.inversoft.profanity.CachingXMLExclusionWordSource and can
be constructed in the same manner as the ProfanitySource implementation. It takes
a resource, reload time and tolerance, all of which have the same meaning as they do for the
ProfanitySource. This class can be constructed like this:
ExclusionWordSource source = new CachingXMLExclusionWordSource("badwords-english.xml",
Long.MAX_VALUE, 0);
This tells the class to load all the exclusion words from the database and to never reload.
Now that we have all the class required to create the RegexProfanityFilter we can
construct it like this:
ProfanitySource profanitySource = new CachingFullXMLProfanitySource("badwords-english.xml",
Long.MAX_VALUE, 0);
ResourceBundleCommonWordSource commonWordSource = new ResourceBundleCommonWordSource();
ExclusionWordSource exclusionWordSource = new CachingXMLExclusionWordSource("badwords-english.xml",
Long.MAX_VALUE, 0);
RegexProfanityFilter filter = new RegexProfanityFilter(profanitySource, commonWordSource,
exclusionWordSource);
Calling the filter
Once the filter has been instantiated it is a simple matter of calling the findProfanity
method. Here is an example of calling the method:
// Example call using JDK 5.0
String str = getStringFromSomewhere();
ProfanityResult[] results = filter.findProfanity(str, 4, "Swear", "Slang");
// using JDK 1.4 this would become
// ProfanityResult[] results = filter.findProfanity(str, 4, new String[]{"Swear", "Slang"});
if (results.length > 0) {
for (int i = 0; i < results.length; i++) {
System.out.println("Found profanity at " + results[i].getOffset());
}
}
Considerations
There are a few considerations when using the Inversoft Profanity Filter. First, the
ProfanitySource implementations usually take a few seconds to load because of the
size of the Inversoft Profanity Database. Therefore, it is a good idea to create this during
application start up and reuse the same instance. All the implementations that come with the
filter are thread safe.
Similarly, the ExclusionWordSource implementations also take a little time to
create and should be created at startup. Using the reload feature of the sources can cause the
application to pause due to synchronization requirements and therefore should be avoided unless
the source is constructed using a file location on the local file system. In this case the
source classes check whether or not the file has changed and if it hasn't a reload is not
performed.










