This is a markovic (sort of) classification demonstration program.
It is released under the GNU General Public License V2.0
Written by Linus Walleij

WHAT IT IS
----------

It is a program designed to use the statistical properties of written
natural languages in computer-readable form in order to determine whether

1. A certain word is probable to be a word in that language or not, thus
   may to a certain extent detect the presence of nonsense words.

2. It may calculate a probability that a sufficiently long piece of text 
   belongs to a certain natural language.


HOW TO INSTALL IT
-----------------

Just untar the .tar.gz file "tar xvfz markov-classifier.tar.gz" and enter
the created directory, "cd markov-classifier". Run all examples from there.
This is not a reusable module. It is an example of what can be done with
these methods.

The base directory contains example programs and a few test files, the 
directory "stats" contains frequency data for a few selected languages,
and the directort "ispell-dictionaries" contain the dictionaries that
were used to generate these frequency files.


EXAMPLES
--------
$ ./guess-language.pl testme.txt

 Hello. I am a language guesser program.
 I can sort of tell what language a text is written in.
 This looks like Italian!

$ ./classify-words.pl -l en classifyme.en

 Hello. I am a word classifying program.
 I can tell whether words are meaningful or nonsensical.
 Classifying words in file classifyme.en
 Classifying word "rhododendron"... meaningful (score: 0.142863599603693)
 Classifying word "happiness"... meaningful (score: 0.166520392998076)
 Classifying word "gfasdgafghda"... nonsense (score: 0.0738799965448918)
 Classifying word "asfdfagsdfgfd"... nonsense (score: -0.298013889001338)
 Classifying word "stoneroller"... meaningful (score: 0.117259309085975)

The other programs are used for generating the frequencies.


HOW IT WORKS
------------

The program used markov probabilities. First, a large textcorpus is analyzed
to obtain probabilities. This program used "Ispell" dictionaries to
obtain statistical data, but in practice any textcorpus typical of the 
language you want to profile may be used.

The probabilities are based on two-letter symbols. For a given word:

 F O O B A R

The word is analyzed such that the times that a certain letter will
follow two other letters is stored in a table. For this example the table
will be:

 After:        Follows:
 ----------------------
 FO            O
 OO            B
 OB            A
 BA            R

This is repeated for several hundred words. Then the probability of a certain
letter following a two-letter combination is calculated (a value between 0 and 1).
In this minimal example, the probability for O following FO, A following OB etc
will be 1, but for a larger corpus more realistic probabilities are obtained.
All non-existing two-letter combinations implicitly have the probability 0.

When analyzing a word to see if it belongs to a certain language, the programs 
will analyze one word at a time and create a hash of each two consequtive 
letters, then add the probability that the next letter follows these two to the
"score" for this word. The score is weighted with the word length for fairness.

If a previosly unknown letter occurs after a known two-letter combination, a 
penalty is awarded. If a previously unknown two-letter combination occurs,
an even bigger penalty is awarded. The penalty is also weighted.

To determine if a word is nonsense, the total score is compared against some
limit (like 0.1) and if it is below this limit, it is determined nonsense. This
method is not foolproof but quite good!

To determine what natural language a piece of text belongs to, the scores for
all words are added up. This is repeated for all language statistics, and the
language with the highest score "wins".
