ANNOUNCEMENT: Text::Statistics::Latin 0.04

am 09.07.2007 21:52:09 von Rodrigo Panchiniak Fernandes

Text::Statistics::Latin 0.04 has been released.

Description:

Text::Statistics::Latin creates a seven column CSV file output with
one line each
token per text, given as input an utf8-latin coded corpus that files
names follows:
1 (1). txt', '1 (2). txt', ..., '1 (n).txt' or the pattern
1 \(([1-9]|[1-9][0-9]+)\)\.txt
Columns stores statistical information:
(1) number of word forms in document d;
(2) number of tokens in d;
(3) Id number of d, ie., n;
(4) frequency of term t in d;
(5) corpus frequency of t ;
(6) document frequency of t (number of documents where t occurs at
least once);
(7) t, UTF8 latin coded token-string

Main output file name is '1 (n + 5).txt' and it is stored in the same
directory as the corpus itself, together with residual files on each
input file with .txu and .txv ad hoc extensions.

Example:
use Text::Statistics::Latin;
&LATIN("4"); #3 (4-1) texts will be analised.

Note:

(1) 1 \(([1-9]|[1-9][0-9]+)\)\.txt is the pattern Windows Explorer
uses when renaming sets of files.
(2) This module can be used for testing information retrieval
weighting functions or text indexing.

Research supported by CAPES BEX-09323-5