🚜 🌺 ☪️ ANYKS拼写检查器 🤶🏾 🍤 💭

您好，这是我有关Habré的第三篇文章，之前我写过一篇有关ALM语言模型的文章。现在，我想向您介绍ASC错字校正系统（在ALM的基础上实现）。

是的，有很多用于纠正错别字的系统，它们都有各自的优点和缺点，从开放系统中我可以挑出最有前途的JamSpell之一，我们将与之进行比较。DeepPavlov也有一个类似的系统，许多人可能会想到，但我从未与它交过朋友。

功能列表：

更正距离不超过4个Levenshtein距离的单词错误。
纠正单词中的拼写错误（插入，删除，替换，重新排列）。
定语的语境。
考虑到上下文，以单词的第一个字母为例（专有名称和标题）。
考虑到上下文，将合并的单词拆分为单独的单词。
执行文本分析而不更正原始文本。
在文本中搜索状态（错误，错别字，错误的上下文）。

支持的操作系统：

MacOS X
FreeBSD
的Linux

系统是用C ++ 11编写的，有一个用于Python3的端口

准备好的字典

名称	大小（GB）	记忆体（GB）	大小N克	语言
wittenbell-3-big.asc	1.97	15.6	3	RU
wittenbell-3-middle.asc	1.24	9.7	3	RU
mkneserney-3-middle.asc	1.33	9.7	3	RU
wittenbell-3-single.asc	0.772	5.14	3	RU
wittenbell-5-single.asc	1.37	10.7	五	RU

测试中

来自2016 Dialog21“打字错误校正” 竞赛的数据用于测试系统。经过培训的二进制词典用于测试： wittenbell-3-middle.asc

进行测试	精确	召回	测量
错字校正模式	76.97	62.71	69.11
纠错模式	73.72	60.53	66.48

我认为没有必要添加其他数据，如果需要的话，每个人都可以重复测试，我将在下面的测试中附加所有材料。

测试中使用的材料

test.txt- 测试文字
correct.txt- 正确变体的文本
Evaluation.py- 用于计算校正结果的Python3脚本

现在，比较在相同条件下校正错字的系统的操作很有趣，我们将在相同的文本数据上训练两个不同的错字并进行测试。

为了进行比较，让我们采用我上面提到的错字校正系统JamSpell。

ASC和JamSpell

安装

ASC

$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

JamSpell

$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

训练

ASC

train.json

{
  "ext": "txt",
  "size": 3,
  "alter": {"":""},
  "debug": 1,
  "threads": 0,
  "method": "train",
  "allow-unk": true,
  "reset-unk": true,
  "confidence": true,
  "interpolate": true,
  "mixed-dicts": true,
  "only-token-words": true,
  "locale": "en_US.UTF-8",
  "smoothing": "wittenbell",
  "pilots": ["","","","","","","","","","","a","i","o","e","g"],
  "corpus": "./texts/correct.txt",
  "w-bin": "./dictionary/3-middle.asc",
  "w-vocab": "./train/lm.vocab",
  "w-arpa": "./train/lm.arpa",
  "mix-restwords": "./similars/letters.txt",
  "alphabet": "abcdefghijklmnopqrstuvwxyz",
  "bin-code": "ru",
  "bin-name": "Russian",
  "bin-author": "You name",
  "bin-copyright": "You company LLC",
  "bin-contacts": "site: https://example.com, e-mail: info@example.com",
  "bin-lictype": "MIT",
  "bin-lictext": "... License text ...",
  "embedding-size": 28,
  "embedding": {
      "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
      "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
      "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
      "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
      "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
      "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
      "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
      "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
      "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
      "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
      "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
      "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
      "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
      "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
  }
}

$ ./asc -r-json ./train.json

Python3

import asc

asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)

asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")

asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

def statusArpa1(status):
    print("Build arpa", status)

def statusArpa2(status):
    print("Write arpa", status)

def statusVocab(status):
    print("Write vocab", status)

def statusIndex(text, status):
    print(text, status)

def status(text, status):
    print(text, status)

asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)

asc.buildArpa(statusArpa1)

asc.writeArpa("./train/lm.arpa", statusArpa2)

asc.writeVocab("./train/lm.vocab", statusVocab)

asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")

asc.setEmbedding({
     "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
     "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
     "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
     "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
     "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
     "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
     "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
     "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
     "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
     "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
     "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
     "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
     "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
     "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

JamSpell

$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin

测试中

ASC

spell.json

{
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "spell-verbose": true,
    "confidence": true,
    "mixed-dicts": true,
    "asc-split": true,
    "asc-alter": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "asc-wordrep": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "r-bin": "./dictionary/3-middle.asc"
}

$ ./asc -r-json ./spell.json

Python3

import asc

asc.setAlmV2()
asc.setThreads(0)

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)

def status(text, status):
    print(text, status)

asc.loadIndex("./dictionary/3-middle.asc", "", status)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

JamSpell

- Python , C++

#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>

//   BOOST
#ifdef USE_BOOST_CONVERT
	#include <boost/locale/encoding_utf.hpp>
//     
#else
	#include <codecvt>
#endif

using namespace std;

/**
 * convert    utf-8  
 * @param  str  utf-8  
 * @return      
 */
const string convert(const wstring & str){
	//   
	string result = "";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//     
#else
		//     UTF-8
		using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
		//  
		wstring_convert <convert_type, wchar_t> conv;
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		//    utf-8 
		result = conv.to_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * convert      utf-8
 * @param  str   
 * @return       utf-8
 */
const wstring convert(const string & str){
	//   
	wstring result = L"";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//     
#else
		//  
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
		//    utf-8 
		result = conv.from_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * safeGetline     
 * @param  is  
 * @param  t     
 * @return     
 */
istream & safeGetline(istream & is, string & t){
	//  
	t.clear();

	istream::sentry se(is, true);
	streambuf * sb = is.rdbuf();

	for(;;){
		int c = sb->sbumpc();
		switch(c){
 			case '\n': return is;
			case '\r':
				if(sb->sgetc() == '\n') sb->sbumpc();
				return is;
			case streambuf::traits_type::eof():
				if(t.empty()) is.setstate(ios::eofbit);
				return is;
			default: t += (char) c;
		}
	}
}

/**
* main   
*/
int main(){
	//  
	NJamSpell::TSpellCorrector corrector;
	//   
	corrector.LoadLangModel("model.bin");
	//    
	ifstream file1("./test_data/test.txt", ios::in);
	//   
	if(file1.is_open()){
		//    
		string line = "", res = "";
		//    
		ofstream file2("./test_data/output.txt", ios::out);
		//   
		if(file2.is_open()){
			//       
			while(file1.good()){
				//    
				safeGetline(file1, line);
				//   ,  
				if(!line.empty()){
					//   
					res = convert(corrector.FixFragment(convert(line)));
					//   ,    
					if(!res.empty()){
						//   
						res.append("\n");
						//    
						file2.write(res.c_str(), res.size());
					}
				}
			}
			//  
			file2.close();
		}
		//  
		file1.close();
	}
    return 0;
}

$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test

$ ./bin/test

结果

获得结果

$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt

ASC

精确	召回	测量
92.13	82.51	87.05

JamSpell

精确	召回	测量
77.87	63.36	69.87

ASC 的主要功能之一就是从脏数据中学习。几乎不可能在开放访问中找到没有错误和错别字的文本语料库。手动修复TB级的数据还不够，但是您需要以某种方式使用它。

我提供的教学原则

使用脏数据组合语言模型
我们删除了汇编语言模型中的所有稀有单词和N-gram
我们添加了单个单词，以使错字校正系统更正确地运行。
将二进制字典放在一起

让我们开始吧

假设我们有几个不同主题的语料库，将它们分开训练然后组合起来是更合乎逻辑的。

使用ALM组装机箱

collect.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"ext": "txt",
	"method": "train",
	"allow-unk": true,
	"mixed-dicts": true,
	"only-token-words": true,
	"smoothing": "wittenbell",
	"locale": "en_US.UTF-8",
	"w-abbr": "./output/alm.abbr",
	"w-map": "./output/alm.map",
	"w-vocab": "./output/alm.vocab",
	"w-words": "./output/words.txt",
	"corpus": "./texts/corpus",
	"abbrs": "./abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./collect.json

size — N- 3
debug —
threads —
ext —
allow-unk — 〈unk〉
mixed-dicts —
only-token-words — N- —
smoothing — wittenbell ( , - )
locale — ( )
w-abbr —
w-map —
w-vocab —
w-words — ( )
corpus —
abbrs — , , (, , ...)
goodwords —
badwords —
mix-restwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#    N- —       
alm.setOption(alm.options_t.tokenWords)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#   ,  ,   (, ,  ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    alm.addAbbr(abbr)
f.close()

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def status(text, status):
    print(text, status)

def statusWords(status):
    print("Write words", status)

def statusVocab(status):
    print("Write vocab", status)

def statusMap(status):
    print("Write map", status)

def statusSuffix(status):
    print("Write suffix", status)

#    
alm.collectCorpus("./texts/corpus", status)
#      
alm.writeWords("./output/words.txt", statusWords)
#   
alm.writeVocab("./output/alm.vocab", statusVocab)
#    
alm.writeMap("./output/alm.map", statusMap)
#      
alm.writeSuffix("./output/alm.abbr", statusSuffix)

使用ALM组装的船体修剪

prune.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "vprune",
    "vprune-wltf": -15.0,
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./corpus1/alm.map",
    "r-vocab": "./corpus1/alm.vocab",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./prune.json

size — N- 3
debug —
allow-unk — 〈unk〉
vprune-wltf — - (, — )
locale — ( )
smoothing — wittenbell ( , - )
r-map —
r-vocab —
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")

#    <unk>   
alm.setOption(alm.options_t.allowUnk)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def statusPrune(status):
    print("Prune data", status)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#  
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#    
alm.readMap("./corpus1/alm.map", statusReadMap)
#   
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

将收集的数据与ALM结合

merge.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "merge",
    "mixed-dicts": "true",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-words": "./texts/words",
    "r-map": "./corpus1",
    "r-vocab": "./corpus1",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "mix-restwords": "./texts/similars/letters.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./merge.json

size — N- 3
debug —
allow-unk — 〈unk〉
mixed-dicts —
locale — ( )
smoothing — wittenbell ( , - )
r-words —
r-map — ,
r-vocab — ,
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

#         
f = open('./texts/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addWord(word)
f.close()

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#   
alm.readVocab("./corpus1", statusReadVocab)
#    
alm.readMap("./corpus1", statusReadMap)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

使用ALM学习语言模型

train.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "reset-unk": true,
    "interpolate": true,
    "method": "train",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./output/alm.map",
    "r-vocab": "./output/alm.vocab",
    "w-arpa": "./output/alm.arpa",
    "w-words": "./output/words.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./train.json

size — N- 3
debug —
allow-unk — 〈unk〉
reset-unk — , 〈unk〉
interpolate —
locale — ( )
smoothing — wittenbell
r-map — ,
r-vocab — ,
w-arpa — ARPA,
w-words — , ( )
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#      <unk>   
alm.setOption(alm.options_t.resetUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#     
alm.setOption(alm.options_t.interpolate)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusBuildArpa(status):
    print("Build ARPA", status)

def statusWriteMap(status):
    print("Write map", status)

def statusWriteArpa(status):
    print("Write ARPA", status)

def statusWords(status):
    print("Write words", status)

#   
alm.readVocab("./output/alm.vocab", statusReadVocab)
#    
alm.readMap("./output/alm.map", statusReadMap)

#     
alm.buildArpa(statusBuildArpa)

#       ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)

#   
alm.writeWords("./output/words.txt", statusWords)

拼写检查ASC培训

train.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"confidence": true,
	"mixed-dicts": true,
	"method": "train",
	"alter": {"":""},
	"locale": "en_US.UTF-8",
	"smoothing": "wittenbell",
	"pilots": ["","","","","","","","","","","a","i","o","e","g"],
	"w-bin": "./dictionary/3-single.asc",
	"r-abbr": "./output/alm.abbr",
	"r-vocab": "./output/alm.vocab",
	"r-arpa": "./output/alm.arpa",
	"abbrs": "./texts/abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"alters": "./texts/alters/yoficator.txt",
	"upwords": "./texts/words/upp",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz",
	"bin-code": "ru",
	"bin-name": "Russian",
	"bin-author": "You name",
	"bin-copyright": "You company LLC",
	"bin-contacts": "site: https://example.com, e-mail: info@example.com",
	"bin-lictype": "MIT",
	"bin-lictext": "... License text ...",
	"embedding-size": 28,
	"embedding": {
	    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
	    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
	    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
	    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
	    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
	    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
	    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
	    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
	    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
	    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
	    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
	    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
	    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
	    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
	}
}

$ ./asc -r-json ./train.json

size — N- 3
debug —
threads —
confidence — ARPA - ,
mixed-dicts —
alter — ( , , — «»)
locale — ( )
smoothing — wittenbell ( , - )
pilots — ( )
w-bin —
r-abbr — ,
r-vocab — ,
r-arpa — ARPA,
abbrs — , , (, , ...)
goodwords —
badwords —
alters — , ( )
upwords — , (, , ...)
mix-restwords —
alphabet — ( )
bin-code —
bin-name —
bin-author —
bin-copyright —
bin-contacts —
bin-lictype —
bin-lictext —
embedding-size —
embedding — ( , )

Python

import asc

#   N-  3
asc.setSize(3)
#      
asc.setThreads(0)
#    (  )
asc.setLocale("en_US.UTF-8")

#        
asc.setOption(asc.options_t.uppers)
#    <unk>   
asc.setOption(asc.options_t.allowUnk)
#      <unk>   
asc.setOption(asc.options_t.resetUnk)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusIndex(text, status):
    print(text, status)

def statusBuildIndex(status):
    print("Build index", status)

def statusArpa(status):
    print("Read arpa", status)

def statusVocab(status):
    print("Read vocab", status)

#        ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#   
asc.readVocab("./output/alm.vocab", statusVocab)

#     
asc.setCode("RU")
#    
asc.setLictype("MIT")
#   
asc.setName("Russian")
#    
asc.setAuthor("You name")
#   
asc.setCopyright("You company LLC")
#    
asc.setLictext("... License text ...")
#     
asc.setContacts("site: https://example.com, e-mail: info@example.com")

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusBuildIndex)

#     
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

我知道并不是每个人都可以训练自己的二进制词汇表；这需要文本语料库和大量的计算资源。因此，ASC只能使用一个ARPA文件作为主词典。

工作实例

spell.json

{
    "ad": 13,
    "cw": 38120,
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "alter": {"":""},
    "asc-split": true,
    "asc-alter": true,
    "confidence": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "mixed-dicts": true,
    "asc-wordrep": true,
    "spell-verbose": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "upwords": "./texts/words/upp",
    "r-arpa": "./dictionary/alm.arpa",
    "r-abbr": "./dictionary/alm.abbr",
    "abbrs": "./texts/abbrs/abbrs.txt",
    "alters": "./texts/alters/yoficator.txt",
    "mix-restwords": "./similars/letters.txt",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "pilots": ["","","","","","","","","","","a","i","o","e","g"],
    "alphabet": "abcdefghijklmnopqrstuvwxyz",
    "embedding-size": 28,
    "embedding": {
        "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
        "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
        "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
        "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
        "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
        "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
        "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
        "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
        "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
        "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
        "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
        "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
        "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
        "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
    }
}

$ ./asc -r-json ./spell.json

Python

import asc

#      
asc.setThreads(0)
#        
asc.setOption(asc.options_t.uppers)
#   
asc.setOption(asc.options_t.ascSplit)
#   
asc.setOption(asc.options_t.ascAlter)
#      
asc.setOption(asc.options_t.ascESplit)
#      
asc.setOption(asc.options_t.ascRSplit)
#     
asc.setOption(asc.options_t.ascUppers)
#     
asc.setOption(asc.options_t.ascHyphen)
#    
asc.setOption(asc.options_t.ascWordRep)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusArpa(status):
    print("Read arpa", status)

def statusIndex(status):
    print("Build index", status)

#        ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)

#    (38120      13    )
asc.setAdCw(38120, 13)

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusIndex)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

PS对于那些根本不想收集和培训任何东西的人，我提出了ASC的网络版本。还应该牢记，纠正错别字的系统不是全知的系统，不可能在那里提供全部的俄语。ASC不会更正任何文本，因此有必要针对每个主题分别进行培训。

ANYKS拼写检查器

功能列表：

支持的操作系统：

准备好的字典

测试中

测试中使用的材料

ASC和JamSpell

结果

我提供的教学原则

让我们开始吧

More articles: