Welcome Guest: Sign InRegister

Inversoft Profanity Database Documentation

Database Reference

This guide is a reference for the contents of the Inversoft Profanity Database. This document should be used by those wanting to understand, add or create a database.

The database is an XML file that conforms to the XML schema located at http://www.inversoft.com/schemas/profanity-2.0/database. This schema also contains a good amount of documentation for the database and can be used to validate a database to ensure it is correctly formed.

The root element

The Inversoft Profanity Database contains a root element named entries that should be placed into the XML namespace http://www.inversoft.com/schemas/profanity-2.0/database. Although using the namespace isn't required, it helps during validation and avoids XML naming conflicts with other documents. The Inversoft Profanity Filter will load any XML file that contains the correct element, even if it is not in the correct namespace. This might change in a future release, so using namespaces is always best.

The entry element

The root element contains a child element for each entry in the database. These elements are named entry and contain the information about a single word in the database. The information about an entry is stored in child elements of the entry element.

The phrase element

The phrase element contains the profanity word or phrase. For example, this element might contain profanity using this format:

  <phrase>foo</phrase>

  <phrase>bar</phrase>

  <phrase>baz</phrase>

The rating element

The rating element contains the rating for the word and is an integer that ranges from 1 to 10. The higher the rating the more profane and offensive the word is. Here is a rough outline fo the various ratings:

1
Not offensive, but might be considered inappropriate in very few contexts
2 - 4
Each step in this range increases the contexts where the words might be considered more offensive. It is generally best to consult the database to determine if these words should be included when filtering.
5
This value is the cusp between non-offensive and offensive. Words with a rating of 5 are usually considered offensive and should be filtered in most cases. For more adult oriented content, this level might be too restrictive. These words generally are more difficult to use in the context of an insult directed towards someone, however, this might not always be the case.
6
These words are usually considered offensive and should be filtered in most cases, even in more adult oriented content.
7
All words rated 7 and above are generally offensive and often can be used directly towards someone making good candidates for filtering.
8 - 10
Words in this range simply increase in vulgarity and offensiveness.

The type element

The type element defines the category of the entry. This allows entries to be ignored based on the type. The types are as follows:

Alcohol
Any alcohol related words that some applications might want to filter such as beer, ale, wine, etc.
Drug
Any drug related words/slang that some applications might want to filter such as pot, weed, bong, etc.
Religion
Currently this category is rather small, but includes words that might offend certain users who have strong religious beliefs. This includes words and phrases such as "god damnit", "anti-christ", etc.
Racial
This category contains all racial slurs. This includes words such as "nigger".
Swear
Swear is a very broad category and includes all words that are only considered direct swears. These words usually have no other meaning than profanity. The Swear category includes words such as shit, ass, fuck, etc.
Slang
Slang is a very broad category and includes most words that are not considered direct swears. The slang category includes words such as airhead, bang, bimbo, etc.
Youth
This category contains all the words that are dictionary words and have no slang meanings but might be deemed inappropriate for younger users. This includes words such as beastiality, bisexual, homosexual, etc.

The alternatives element

The alternatives element is a comma separated list of alternative spellings, conjugations and words that might be used in place of the database entry. Each alternative should contain no whitespace between the comma, itself and the next alternative.

The regex_list element

The regex_list element is a list of regular expressions that can be used to find the database entry within a block of text. Each regular expression is define on a separate line and must not contain any whitespace before or after the regular expression.

The excludes element

The excludes element is a comma separated list of words that should be excluded from the filtering process. In most cases these are words that contain the profanity, but are not themselves profanity. For example, the word 'cockney' contains 'cock' and is not profanity. On the flip-side, 'cocksucker' is considered profanity.

The definition element

The definition element contains a short description of the word or phrase.

The language_code element

The language_code element is the ISO 639-1 language code of the database entry. This controls part of the locale of the database entry. It defines the language where the entry it is most generally used and considered profane or offensive.

The country_code element

The country_code element is the 3166 country code of the database entry. This controls part of the locale of the database entry. It defines the country where the entry is most generally used and considered profane or offensive.

The is_* elements

The database contains a number of additional elements that define the grammatical use of the word within a sentence. These elements names all start with is_. Here are the different elements that control the function of the entry in a sentence.

  • is_adverb
  • is_adjective
  • is_noun
  • is_verb
  • is_command
  • is_exclamation

The is_embeddable element

The is_embeddable element controls whether or not the word can be embedded inside another word. An example of embedding would be the word 'ass' inside 'assface'.

The is_phrase element

The is_phrase element determines if the phrase element contains a word or a phrase. A phrase contains more than one word, seperated by spaces or a dash.