Spell Checking | Documentation

Documentation > Spell checking

Selecting the Spell check option turns on Total Validator's spell checking system. By default, the language code used for every word on every web page that is tested is detected, and then the matching dictionary for that language, together with some language specific rules, are used to spell check it.

If a word isn't found in the matching dictionary then a spelling mistake is displayed in the results. A list of suggested corrections may also be displayed as well.

The following page details how to get the best out of the spell checking system using the many Spell check options available. These have changed considerably since v10, so users of old releases of Total Validator may also need to check the Migration section.

Internal dictionaries

Six internal dictionaries are built into Total Validator to cover five West European languages: English, French, German, Italian and Spanish. These internal dictionaries are named after the language codes they apply to: en-US, en-GB, fr, de, it, es. They are also used to check words for any specific sub-languages codes, such as fr-CA and fr-FR. English is a special case: The en-US dictionary is used for all en languages such as en-CA, except for en-GB which will always use the en-GB dictionary. But you can switch to using the en-GB dictionary for all en languages (except for en-US of course).

All dictionaries used for spell checking are just plain, UTF-8 encoded, text files consisting of one word per line. To save having to list every possible variation of each word for each language, rules are used specific to each internal dictionary to detect plurals, apostrophes, and other similar features, so that a single word in the dictionary may match many actual words that are found on web pages. For example, 'address' will match 'addresses'.

Language detection

To determine which dictionary to use, the system looks at the language code in the lang attribute of the element containing it. For example:

<p lang='en-CA'>This will be detected as Canadian English</p>

If there is no lang attribute, it looks at the parent element, and then it's parent, right up to the <html> element:

<div lang='en-CA'><p><span>This will still be detected as Canadian English</span></p></div>

If there are no lang attributes, then the system looks for a <meta> tag specifying the language: <meta http-equiv='Content-Language' content='en'>. If that doesn't exist, it then looks for a Content-Language HTTP header sent by the website. Finally, it will use the Default code option, assuming it is set. If it fails to find any matching language code then words in the element will not be spell checked.

Note that if the detected language code is blank (''), malformed (!rubbish!), or one for which there is no dictionary (zh-Hans-CN), then the words will not be checked. Special words such as upper-case words, words in attributes, and words with digits in, will also be ignored, unless you set the appropriate Words to check option to include them.

As mentioned above, dictionaries for the top-level language are special in that they will match any sub-code of that language, so the French fr dictionary, will match fr-CA and fr-FR as well as fr. But the Ignore codes option can be used to prevent specific language codes, such as fr-CA, being checked. Better still, if you provide an External dictionary for fr-CA, this will be used in place of the top-level one for that specific code. Beware that if you provide an External dictionary for fr it will be used for all sub-codes such as fr-CA and fr-FR.

Results

When a word is checked against a dictionary and is unrecognised, it will be marked as a spelling mistake on the page report. But if you think that the word is correct and just missing from the dictionary, you can click on it to add it to a personal dictionary in the Personal dictionaries folder. Future testing will use these personal dictionaries, so the word will no longer be marked as misspelt.

As you view each page report any words that you have clicked on will no longer be marked, making it quicker to correct mistakes.

If there are a lot of specialist words on a website, to save clicking on every unrecognised word on every page, you can use the Save unrecognised option to save every unrecognised word to personal dictionaries. You can then check these to ensure that all the words added really are correctly spelt. Note that this option should only be used once, otherwise it will mask misspelt words in subsequent tests.

Personal dictionaries are named (in lower-case) after the language code used the check the words within them, together with a .dic suffix. For example, fr-CA words will be saved in a file called fr-ca.dic (even if the fr dictionary was used to check them). The Show codes option may be used to display the language code (and hence the file a word will be saved to) on the page reports. This may be useful if you have pages with lots of different languages on them and you wish to know which personal dictionaries words will be saved to, which may be useful, as described in the next section.

Personal dictionaries

As described above, when you click on 'unrecognised' words on the page reports, or use the Save unrecognised option, words are added to personal dictionaries. You can also manually add words to these, or create your own. But please note that these must be plain text files consisting of one word per line, ideally with no duplicates, and must be saved using UTF-8 encoding, otherwise some of the words within them may be ignored.

When a spell check runs, the words in all of the files in the personal dictionaries folder prefixed with a matching language code, and with the suffix .dic, will be used. This means that words in fr-ca.mydictionary.dic, fr-ca.dic, and fr.dic, will be used to check fr-ca words.

If a file is found in the personal dictionaries folder which ends with the suffix .dic, but has no prefix, or uses an invalid code, then it is added to all the dictionaries. This is so you can create personal dictionaries for words that apply to all languages, such as brand names like 'Google'.

External dictionaries

You can also add your own external dictionaries for languages not built into Total Validator. This may also be used to provide a dictionary for a specific country code such as fr-CA, or even to replace the internal dictionaries. Just add a list of the paths to these external files using the External dictionaries option.

As with all dictionaries these must be plain text files consisting of one word per line, ideally with no duplicates, and must be saved using UTF-8 encoding. Dictionary file names must also start with a valid language code prefix ending with a ".", and the whole file name must end with the suffix .dic, otherwise they will be ignored. For example, fr-ca.dic and pt-PT.mydictionary.dic are okay, but not fr-CA, nor fr-CA-dic, nor fr-CA-mydictionary.dic. But note that any language codes which also appear on the Ignore codes option, will be ignored during the spell check.

Dictionary file names are also case-insensitive, so if you list fr-ca.dic,fr-CA.dic only one of these will be used (with no guarantee which one). Also, if there is a naming conflict such as fr.mydic.dic,fr.mydic2.dic, only one of these will be used. Any dictionaries with names which match the internal ones will be used instead of them, such as fr.dic and en-GB.dic.

Just like the internal dictionaries, any external dictionary with a language code which is just the name of the language, such as pt.dic will be used to check words in any country or region-specific sub-codes such as pt-BR as well as pt, unless you supply a specific dictionary for the specific sub-code such as pt-br.dic.

If you supply an external dictionary for a language matching one the five West European languages, such as fr-CA, then the special language specific rules we have, such as detecting plurals will also be applied. But for other languages you will have to list every variation of every word. Also we have only tested the system with Western languages so there is no guarantee our system will work with languages which are significantly different (please let us know if they don't and we will try to fix things).

Fast creation

A quick way of creating an external dictionary is to use the Save unknown option. All the words which are marked with a valid language code, but for which no matching dictionary exists, will be saved to the Personal dictionaries folder. These are saved in files named after the matching language code with a .dic suffix. For example, pt-br.dic.

You can also do this for country specific or regional language sub-codes for which there is a top-level dictionary, by listing the code in the Ignore languages option. For example, to create your own dictionary for fr-CA, put fr-CA in the Ignore codes option and then use the Save unknown option. Then the top-level fr dictionary will ignore fr-CA words, and they will be saved to the fr-ca.dic dictionary file.

Note that for some glyph-based languages like Chinese, all the words are technically upper case, so you may need to set the UPPER CASE words option to ensure that words are saved, and then use the same option when spell checking them.

As these dictionaries are stored in the Personal dictionaries folder, it may be better to move them somewhere else where they are less likely to be overwritten. Also, you will still need to list them in the External dictionaries option for them to be used, and remember to remove the language code from the Ignore codes option, so they are no longer ignored.

Ignoring words

You can skip spell checking for any words where there is a matching dictionary, using one of three methods:

Add the matching language code to the Ignore codes option. Note that adding a top-level language code such as fr will skip all country and region-specific sub-codes as well
Mark the section with the HTML5 spellcheck="false" attribute
Mark the section using the -tv-ignore:E31 or -tv-ignore-spellcheck class attributes as described in Ignoring issues

Migrating from pre v11 releases of Total Validator

The internal dictionaries are now stored in the dics folder, rather than the dicts folder in the application folder, so if you've changed any of these you may need to move them into the new folder and rename them to match the new names.

Similarly, personal dictionaries are now stored by default in the dics folder, rather than the dicts folder within the results folder, so you may need to move these and rename them using the new filename format for them to be used.

The command line options have also changed. Using some of the old options will throw an error and stop the test from running. This has been done so that confusing results are not produced.