Database Reference
This guide is a reference for the contents of the Inversoft Profanity Database. This document should be used by those wanting to understand, add or create a database.
The database is an XML file that conforms to the XML schema located at http://www.inversoft.com/schemas/profanity-2.0/database. This schema also contains a good amount of documentation for the database and can be used to validate a database to ensure it is correctly formed.
The root element
The Inversoft Profanity Database contains a root element named entries that should
be placed into the XML namespace http://www.inversoft.com/schemas/profanity-2.0/database.
Although using the namespace isn't required, it helps during validation and avoids XML naming
conflicts with other documents. The Inversoft Profanity Filter will load any XML file that
contains the correct element, even if it is not in the correct namespace. This might change in
a future release, so using namespaces is always best.
The entry element
The root element contains a child element for each entry in the database. These elements are
named entry and contain the information about a single word in the database. The
information about an entry is stored in child elements of the entry element.
The phrase element
The phrase element contains the profanity word or phrase. For example, this
element might contain profanity using this format:
<phrase>foo</phrase> <phrase>bar</phrase> <phrase>baz</phrase>
The rating element
The rating element contains the rating for the word and is an integer that ranges
from 1 to 10. The higher the rating the more profane and offensive the word is. Here is a rough
outline fo the various ratings:
- 1
- Not offensive, but might be considered inappropriate in very few contexts
- 2 - 4
- Each step in this range increases the contexts where the words might be considered more offensive. It is generally best to consult the database to determine if these words should be included when filtering.
- 5
- This value is the cusp between non-offensive and offensive. Words with a rating of 5 are usually considered offensive and should be filtered in most cases. For more adult oriented content, this level might be too restrictive. These words generally are more difficult to use in the context of an insult directed towards someone, however, this might not always be the case.
- 6
- These words are usually considered offensive and should be filtered in most cases, even in more adult oriented content.
- 7
- All words rated 7 and above are generally offensive and often can be used directly towards someone making good candidates for filtering.
- 8 - 10
- Words in this range simply increase in vulgarity and offensiveness.
The type element
The type element defines the category of the entry. This allows entries to be
ignored based on the type. The types are as follows:
- Alcohol
- Any alcohol related words that some applications might want to filter such as beer, ale, wine, etc.
- Drug
- Any drug related words/slang that some applications might want to filter such as pot, weed, bong, etc.
- Religion
- Currently this category is rather small, but includes words that might offend certain users who have strong religious beliefs. This includes words and phrases such as "god damnit", "anti-christ", etc.
- Racial
- This category contains all racial slurs. This includes words such as "nigger".
- Swear
- Swear is a very broad category and includes all words that are only considered direct swears. These words usually have no other meaning than profanity. The Swear category includes words such as shit, ass, fuck, etc.
- Slang
- Slang is a very broad category and includes most words that are not considered direct swears. The slang category includes words such as airhead, bang, bimbo, etc.
- Youth
- This category contains all the words that are dictionary words and have no slang meanings but might be deemed inappropriate for younger users. This includes words such as beastiality, bisexual, homosexual, etc.
The alternatives element
The alternatives element is a comma separated list of alternative spellings,
conjugations and words that might be used in place of the database entry. Each alternative
should contain no whitespace between the comma, itself and the next alternative.
The regex_list element
The regex_list element is a list of regular expressions that can be used to find
the database entry within a block of text. Each regular expression is define on a separate line
and must not contain any whitespace before or after the regular expression.
The excludes element
The excludes element is a comma separated list of words that should be excluded
from the filtering process. In most cases these are words that contain the profanity, but are
not themselves profanity. For example, the word 'cockney' contains 'cock' and is not profanity.
On the flip-side, 'cocksucker' is considered profanity.
The definition element
The definition element contains a short description of the word or phrase.
The language_code element
The language_code element is the ISO 639-1 language code of the database entry.
This controls part of the locale of the database entry. It defines the language where the entry
it is most generally used and considered profane or offensive.
The country_code element
The country_code element is the 3166 country code of the database entry. This
controls part of the locale of the database entry. It defines the country where the entry is
most generally used and considered profane or offensive.
The is_* elements
The database contains a number of additional elements that define the grammatical use of the
word within a sentence. These elements names all start with is_. Here are the
different elements that control the function of the entry in a sentence.
- is_adverb
- is_adjective
- is_noun
- is_verb
- is_command
- is_exclamation
The is_embeddable element
The is_embeddable element controls whether or not the word can be embedded inside
another word. An example of embedding would be the word 'ass' inside 'assface'.
The is_phrase element
The is_phrase element determines if the phrase element contains a word or a
phrase. A phrase contains more than one word, seperated by spaces or a dash.










