您好,这是我有关Habré的第三篇文章,之前我写过一篇有关ALM语言模型的文章。现在,我想向您介绍ASC错字校正系统(在ALM的基础上实现)。
是的,有很多用于纠正错别字的系统,它们都有各自的优点和缺点,从开放系统中我可以挑出最有前途的JamSpell之一,我们将与之进行比较。DeepPavlov也有一个类似的系统,许多人可能会想到,但我从未与它交过朋友。
功能列表:
- 更正距离不超过4个Levenshtein距离的单词错误。
- 纠正单词中的拼写错误(插入,删除,替换,重新排列)。
- 定语的语境。
- 考虑到上下文,以单词的第一个字母为例(专有名称和标题)。
- 考虑到上下文,将合并的单词拆分为单独的单词。
- 执行文本分析而不更正原始文本。
- 在文本中搜索状态(错误,错别字,错误的上下文)。
支持的操作系统:
- MacOS X
- FreeBSD
- 的Linux
系统是用C ++ 11编写的,有一个用于Python3的端口
准备好的字典
名称 | 大小(GB) | 记忆体(GB) | 大小N克 | 语言 |
---|---|---|---|---|
wittenbell-3-big.asc | 1.97 | 15.6 | 3 | RU |
wittenbell-3-middle.asc | 1.24 | 9.7 | 3 | RU |
mkneserney-3-middle.asc | 1.33 | 9.7 | 3 | RU |
wittenbell-3-single.asc | 0.772 | 5.14 | 3 | RU |
wittenbell-5-single.asc | 1.37 | 10.7 | 五 | RU |
测试中
来自2016 Dialog21“打字错误校正” 竞赛的数据用于测试系统 。经过培训的二进制词典用于测试: wittenbell-3-middle.asc
进行测试 | 精确 | 召回 | 测量 |
---|---|---|---|
错字校正模式 | 76.97 | 62.71 | 69.11 |
纠错模式 | 73.72 | 60.53 | 66.48 |
我认为没有必要添加其他数据,如果需要的话,每个人都可以重复测试,我将在下面的测试中附加所有材料。
测试中使用的材料
- test.txt- 测试文字
- correct.txt- 正确变体的文本
- Evaluation.py- 用于计算校正结果的Python3脚本
现在,比较在相同条件下校正错字的系统的操作很有趣,我们将在相同的文本数据上训练两个不同的错字并进行测试。
为了进行比较,让我们采用我上面提到的错字校正系统JamSpell。
ASC和JamSpell
安装
ASC
JamSpell
$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
JamSpell
$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
训练
ASC
train.json
Python3
JamSpell
train.json
{
"ext": "txt",
"size": 3,
"alter": {"":""},
"debug": 1,
"threads": 0,
"method": "train",
"allow-unk": true,
"reset-unk": true,
"confidence": true,
"interpolate": true,
"mixed-dicts": true,
"only-token-words": true,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"corpus": "./texts/correct.txt",
"w-bin": "./dictionary/3-middle.asc",
"w-vocab": "./train/lm.vocab",
"w-arpa": "./train/lm.arpa",
"mix-restwords": "./similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
Python3
import asc
asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
def statusArpa1(status):
print("Build arpa", status)
def statusArpa2(status):
print("Write arpa", status)
def statusVocab(status):
print("Write vocab", status)
def statusIndex(text, status):
print(text, status)
def status(text, status):
print(text, status)
asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
asc.buildArpa(statusArpa1)
asc.writeArpa("./train/lm.arpa", statusArpa2)
asc.writeVocab("./train/lm.vocab", statusVocab)
asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
JamSpell
$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin
测试中
ASC
spell.json
Python3
JamSpell
- Python , C++
spell.json
{
"debug": 1,
"threads": 0,
"method": "spell",
"spell-verbose": true,
"confidence": true,
"mixed-dicts": true,
"asc-split": true,
"asc-alter": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"asc-wordrep": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"r-bin": "./dictionary/3-middle.asc"
}
$ ./asc -r-json ./spell.json
Python3
import asc
asc.setAlmV2()
asc.setThreads(0)
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)
def status(text, status):
print(text, status)
asc.loadIndex("./dictionary/3-middle.asc", "", status)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
JamSpell
- Python , C++
#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>
// BOOST
#ifdef USE_BOOST_CONVERT
#include <boost/locale/encoding_utf.hpp>
//
#else
#include <codecvt>
#endif
using namespace std;
/**
* convert utf-8
* @param str utf-8
* @return
*/
const string convert(const wstring & str){
//
string result = "";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//
#else
// UTF-8
using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
//
wstring_convert <convert_type, wchar_t> conv;
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
// utf-8
result = conv.to_bytes(str);
#endif
}
//
return result;
}
/**
* convert utf-8
* @param str
* @return utf-8
*/
const wstring convert(const string & str){
//
wstring result = L"";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//
#else
//
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
// utf-8
result = conv.from_bytes(str);
#endif
}
//
return result;
}
/**
* safeGetline
* @param is
* @param t
* @return
*/
istream & safeGetline(istream & is, string & t){
//
t.clear();
istream::sentry se(is, true);
streambuf * sb = is.rdbuf();
for(;;){
int c = sb->sbumpc();
switch(c){
case '\n': return is;
case '\r':
if(sb->sgetc() == '\n') sb->sbumpc();
return is;
case streambuf::traits_type::eof():
if(t.empty()) is.setstate(ios::eofbit);
return is;
default: t += (char) c;
}
}
}
/**
* main
*/
int main(){
//
NJamSpell::TSpellCorrector corrector;
//
corrector.LoadLangModel("model.bin");
//
ifstream file1("./test_data/test.txt", ios::in);
//
if(file1.is_open()){
//
string line = "", res = "";
//
ofstream file2("./test_data/output.txt", ios::out);
//
if(file2.is_open()){
//
while(file1.good()){
//
safeGetline(file1, line);
// ,
if(!line.empty()){
//
res = convert(corrector.FixFragment(convert(line)));
// ,
if(!res.empty()){
//
res.append("\n");
//
file2.write(res.c_str(), res.size());
}
}
}
//
file2.close();
}
//
file1.close();
}
return 0;
}
$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test
$ ./bin/test
结果
获得结果
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt
ASC
精确 | 召回 | 测量 |
---|---|---|
92.13 | 82.51 | 87.05 |
JamSpell
精确 | 召回 | 测量 |
---|---|---|
77.87 | 63.36 | 69.87 |
ASC 的主要功能之一就是从脏数据中学习。几乎不可能在开放访问中找到没有错误和错别字的文本语料库。手动修复TB级的数据还不够,但是您需要以某种方式使用它。
我提供的教学原则
- 使用脏数据组合语言模型
- 我们删除了汇编语言模型中的所有稀有单词和N-gram
- 我们添加了单个单词,以使错字校正系统更正确地运行。
- 将二进制字典放在一起
让我们开始吧
假设我们有几个不同主题的语料库,将它们分开训练然后组合起来是更合乎逻辑的。
使用ALM组装机箱
collect.json
Python
,
{
"size": 3,
"debug": 1,
"threads": 0,
"ext": "txt",
"method": "train",
"allow-unk": true,
"mixed-dicts": true,
"only-token-words": true,
"smoothing": "wittenbell",
"locale": "en_US.UTF-8",
"w-abbr": "./output/alm.abbr",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"w-words": "./output/words.txt",
"corpus": "./texts/corpus",
"abbrs": "./abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./collect.json
- size — N- 3
- debug —
- threads —
- ext —
- allow-unk — 〈unk〉
- mixed-dicts —
- only-token-words — N- —
- smoothing — wittenbell ( , - )
- locale — ( )
- w-abbr —
- w-map —
- w-vocab —
- w-words — ( )
- corpus —
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- mix-restwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# N- —
alm.setOption(alm.options_t.tokenWords)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
# , , (, , ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
alm.addAbbr(abbr)
f.close()
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def status(text, status):
print(text, status)
def statusWords(status):
print("Write words", status)
def statusVocab(status):
print("Write vocab", status)
def statusMap(status):
print("Write map", status)
def statusSuffix(status):
print("Write suffix", status)
#
alm.collectCorpus("./texts/corpus", status)
#
alm.writeWords("./output/words.txt", statusWords)
#
alm.writeVocab("./output/alm.vocab", statusVocab)
#
alm.writeMap("./output/alm.map", statusMap)
#
alm.writeSuffix("./output/alm.abbr", statusSuffix)
,
使用ALM组装的船体修剪
prune.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "vprune",
"vprune-wltf": -15.0,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./corpus1/alm.map",
"r-vocab": "./corpus1/alm.vocab",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./prune.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- vprune-wltf — - (, — )
- locale — ( )
- smoothing — wittenbell ( , - )
- r-map —
- r-vocab —
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# <unk>
alm.setOption(alm.options_t.allowUnk)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def statusPrune(status):
print("Prune data", status)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#
alm.readMap("./corpus1/alm.map", statusReadMap)
#
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
将收集的数据与ALM结合
merge.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "merge",
"mixed-dicts": "true",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-words": "./texts/words",
"r-map": "./corpus1",
"r-vocab": "./corpus1",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./merge.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- mixed-dicts —
- locale — ( )
- smoothing — wittenbell ( , - )
- r-words —
- r-map — ,
- r-vocab — ,
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
#
f = open('./texts/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addWord(word)
f.close()
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1", statusReadVocab)
#
alm.readMap("./corpus1", statusReadMap)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
使用ALM学习语言模型
train.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"reset-unk": true,
"interpolate": true,
"method": "train",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./output/alm.map",
"r-vocab": "./output/alm.vocab",
"w-arpa": "./output/alm.arpa",
"w-words": "./output/words.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./train.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- reset-unk — , 〈unk〉
- interpolate —
- locale — ( )
- smoothing — wittenbell
- r-map — ,
- r-vocab — ,
- w-arpa — ARPA,
- w-words — , ( )
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
# <unk>
alm.setOption(alm.options_t.resetUnk)
#
alm.setOption(alm.options_t.mixDicts)
#
alm.setOption(alm.options_t.interpolate)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusBuildArpa(status):
print("Build ARPA", status)
def statusWriteMap(status):
print("Write map", status)
def statusWriteArpa(status):
print("Write ARPA", status)
def statusWords(status):
print("Write words", status)
#
alm.readVocab("./output/alm.vocab", statusReadVocab)
#
alm.readMap("./output/alm.map", statusReadMap)
#
alm.buildArpa(statusBuildArpa)
# ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)
#
alm.writeWords("./output/words.txt", statusWords)
拼写检查ASC培训
train.json
Python
{
"size": 3,
"debug": 1,
"threads": 0,
"confidence": true,
"mixed-dicts": true,
"method": "train",
"alter": {"":""},
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"w-bin": "./dictionary/3-single.asc",
"r-abbr": "./output/alm.abbr",
"r-vocab": "./output/alm.vocab",
"r-arpa": "./output/alm.arpa",
"abbrs": "./texts/abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alters": "./texts/alters/yoficator.txt",
"upwords": "./texts/words/upp",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
- size — N- 3
- debug —
- threads —
- confidence — ARPA - ,
- mixed-dicts —
- alter — ( , , — «»)
- locale — ( )
- smoothing — wittenbell ( , - )
- pilots — ( )
- w-bin —
- r-abbr — ,
- r-vocab — ,
- r-arpa — ARPA,
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- alters — , ( )
- upwords — , (, , ...)
- mix-restwords —
- alphabet — ( )
- bin-code —
- bin-name —
- bin-author —
- bin-copyright —
- bin-contacts —
- bin-lictype —
- bin-lictext —
- embedding-size —
- embedding — ( , )
Python
import asc
# N- 3
asc.setSize(3)
#
asc.setThreads(0)
# ( )
asc.setLocale("en_US.UTF-8")
#
asc.setOption(asc.options_t.uppers)
# <unk>
asc.setOption(asc.options_t.allowUnk)
# <unk>
asc.setOption(asc.options_t.resetUnk)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusIndex(text, status):
print(text, status)
def statusBuildIndex(status):
print("Build index", status)
def statusArpa(status):
print("Read arpa", status)
def statusVocab(status):
print("Read vocab", status)
# ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#
asc.readVocab("./output/alm.vocab", statusVocab)
#
asc.setCode("RU")
#
asc.setLictype("MIT")
#
asc.setName("Russian")
#
asc.setAuthor("You name")
#
asc.setCopyright("You company LLC")
#
asc.setLictext("... License text ...")
#
asc.setContacts("site: https://example.com, e-mail: info@example.com")
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusBuildIndex)
#
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
我知道并不是每个人都可以训练自己的二进制词汇表;这需要文本语料库和大量的计算资源。因此,ASC只能使用一个ARPA文件作为主词典。
工作实例
spell.json
Python
{
"ad": 13,
"cw": 38120,
"debug": 1,
"threads": 0,
"method": "spell",
"alter": {"":""},
"asc-split": true,
"asc-alter": true,
"confidence": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"mixed-dicts": true,
"asc-wordrep": true,
"spell-verbose": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"upwords": "./texts/words/upp",
"r-arpa": "./dictionary/alm.arpa",
"r-abbr": "./dictionary/alm.abbr",
"abbrs": "./texts/abbrs/abbrs.txt",
"alters": "./texts/alters/yoficator.txt",
"mix-restwords": "./similars/letters.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./spell.json
Python
import asc
#
asc.setThreads(0)
#
asc.setOption(asc.options_t.uppers)
#
asc.setOption(asc.options_t.ascSplit)
#
asc.setOption(asc.options_t.ascAlter)
#
asc.setOption(asc.options_t.ascESplit)
#
asc.setOption(asc.options_t.ascRSplit)
#
asc.setOption(asc.options_t.ascUppers)
#
asc.setOption(asc.options_t.ascHyphen)
#
asc.setOption(asc.options_t.ascWordRep)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusArpa(status):
print("Read arpa", status)
def statusIndex(status):
print("Build index", status)
# ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)
# (38120 13 )
asc.setAdCw(38120, 13)
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusIndex)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
PS对于那些根本不想收集和培训任何东西的人,我提出了ASC的网络版本。还应该牢记,纠正错别字的系统不是全知的系统,不可能在那里提供全部的俄语。ASC不会更正任何文本,因此有必要针对每个主题分别进行培训。