본문 바로가기

과제모음

[CMU Sphinx]언어모델(Language Model : LM)파일 생성법

반응형

Typical Usage

Simplified toolkit framework - 8KB

Given a large corpus of text in a file a.text, but no specified vocabulary

  • Compute the word unigram counts 

    cat a.text | 
    text2wfreq > a.wfreq
  • Convert the word unigram counts into a vocabulary consisting of the 20,000 most common words 

    cat a.wfreq | wfreq2vocab -top 20000 > a.vocab
  • Generate a binary id 3-gram of the training text, based on this vocabulary

    cat a.text | text2idngram -vocab a.vocab > a.idngram
  • Convert the idngram into a binary format language model 

    idngram2lm -idngram a.idngram -vocab a.vocab -binary a.binlm
  • Compute the perplexity of the language model, with respect to some test text b.text

    evallm -binary a.binlm
    Reading in language model from file a.binlm
    Done.
    evallm : perplexity -text b.text 
    Computing perplexity of the language model with respect 
    to the text b.text 
    Perplexity = 128.15, Entropy = 7.00 bits 
    Computation based on 8842804 words. 
    Number of 3-grams hit = 6806674 (76.97%) 
    Number of 2-grams hit = 1766798 (19.98%) 
    Number of 1-grams hit = 269332 (3.05%) 
    1218322 OOVs (12.11%) and 576763 context cues were removed from the calculation. 
    evallm : quit

Alternatively, some of these processes can be piped together:

cat a.text | text2wfreq | wfreq2vocab -top 20000 > a.vocab
cat a.text | text2idngram -vocab a.vocab | \
   idngram2lm -vocab a.vocab -idngram - \
   -binary a.binlm -spec_num 5000000 15000000
echo "perplexity -text b.text" | evallm -binary a.binlm 

==============================================================================================================

윈도우 상에서 구동하기 위해서는 cat 명령어 대신 type 명령어를 사용한다
아래는 기본적으로 서로 대응하는 리눅스와 윈도우 커맨드창(DOS) 명령어

list 보기                                ls          /          dir
디렉토리 생성                       mkdir        /          mkdir , md
디렉토리 삭제                       rmdir        /          rmdir , rd
디렉토리 트리                       ls -R         /          tree
파일 삭제                              rm          /          del , erase
파일 복사                              cp          /          copy
파일 이동                              mv         /          move
이름 변경                              mv         /          rename
change directory                   cd          /          cd
현재 디렉토리 표시                 pwd          /          cd
화면 정리                              clear       /          cls
명령어 해석기                    sh, csh, bash /          command.com 
파일 내용 표시                      cat           /          type
도움말, 메뉴얼                      man          /          help
쉘 종료, 도스창 종료               exit           /          exit
시간 표시                             date         /          time
그대로 출력                          echo         /          echo
환경변수 표시                     set,env        /          set
경로 보기                        echo $PATH    /          path
버전 정보                         uname -a      /           ver
반응형