Class LangProfile

java.lang.Object
com.optimaize.langdetect.cybozu.util.LangProfile
All Implemented Interfaces:
Serializable

@Deprecated public class LangProfile extends Object implements Serializable
Deprecated.
replaced by LanguageProfile
LangProfile is a Language Profile Class. Users don't use this class directly. TODO split into builder and immutable class. TODO currently this only makes n-grams with the space before a word included. no n-gram with the space after the word. Example: "foo" creates " fo" as 3gram, but not "oo ". Either this is a bug, or if intended then needs documentation.
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private Map<String,Integer>
    Deprecated.
    Key = ngram, value = count.
    private static final int
    Deprecated.
    Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10.
    private static final int
    Deprecated.
    n-grams that occur less than this often can be removed using omitLessFreq().
    private String
    Deprecated.
    The language name (identifier).
    private int[]
    Deprecated.
    Tells how many occurrences of n-grams exist per gram length.
    private static final long
    Deprecated.
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    Deprecated.
    Constructor for JSONIC
    Deprecated.
    Normal Constructor
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    add(@NotNull String gram)
    Deprecated.
    Add n-gram to profile
    Deprecated.
     
    Deprecated.
     
    int[]
    Deprecated.
     
    void
    Deprecated.
    Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams.
    void
    Deprecated.
     
    void
    Deprecated.
     
    void
    setNWords(int[] nWords)
    Deprecated.
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • serialVersionUID

      private static final long serialVersionUID
      Deprecated.
      See Also:
    • MINIMUM_FREQ

      private static final int MINIMUM_FREQ
      Deprecated.
      n-grams that occur less than this often can be removed using omitLessFreq(). This number can change, see LESS_FREQ_RATIO.
      See Also:
    • LESS_FREQ_RATIO

      private static final int LESS_FREQ_RATIO
      Deprecated.
      Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10. 10 is larger than MINIMUM_FREQ (2), thus MINIMUM_FREQ remains at 2. All n-grams that occur less than 2 times can be removed as noise using omitLessFreq(). If the most frequent n-gram occurs 5000 times, then 5'000 / this (100'000) = 0.05. 0.05 is smaller than MINIMUM_FREQ (2), thus MINIMUM_FREQ becomes 0. No n-grams are removed because of insignificance when calling omitLessFreq().
      See Also:
    • name

      private String name
      Deprecated.
      The language name (identifier).
    • freq

      private Map<String,Integer> freq
      Deprecated.
      Key = ngram, value = count. All n-grams are in here (1-gram, 2-gram, 3-gram).
    • nWords

      private int[] nWords
      Deprecated.
      Tells how many occurrences of n-grams exist per gram length. When making 1grams, 2grams and 3grams (currently) then this contains 3 entries where element 0 = number occurrences of 1-grams element 1 = number occurrences of 2-grams element 2 = number occurrences of 3-grams Example: if there are 57 1-grams (English language has about that many) and the training text is fairly long, then this number is in the millions.
  • Constructor Details

    • LangProfile

      public LangProfile()
      Deprecated.
      Constructor for JSONIC
    • LangProfile

      public LangProfile(String name)
      Deprecated.
      Normal Constructor
      Parameters:
      name - language name
  • Method Details

    • add

      public void add(@NotNull @NotNull String gram)
      Deprecated.
      Add n-gram to profile
      Parameters:
      gram -
    • omitLessFreq

      public void omitLessFreq()
      Deprecated.
      Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams. Also removes ascii ngrams if the total number of ascii ngrams is less than one third of the total. This is done because non-latin text (such as Chinese) often has some latin noise in between. TODO split the 2 cleaning to separate methods. TODO distinguish ascii/latin, currently it looks for latin only, should include characters with diacritics, eg Vietnamese. TODO current code counts ascii, but removes any latin. is that desired? if so then this needs documentation.
    • getName

      public String getName()
      Deprecated.
    • setName

      public void setName(String name)
      Deprecated.
    • getFreq

      public Map<String,Integer> getFreq()
      Deprecated.
    • setFreq

      public void setFreq(Map<String,Integer> freq)
      Deprecated.
    • getNWords

      public int[] getNWords()
      Deprecated.
    • setNWords

      public void setNWords(int[] nWords)
      Deprecated.