哈Ha!今天,将继续使用Java机器学习使用大文本数据的聚类和分类这一主题。本文是第一篇文章的续篇。
本文将包含理论以及我使用的算法的实现。
1.标记化
理论:
‒ . (, ). , , , , , , . . . (), . ‒ ; , . , - . , , . , , .
, . .
, «». (, , , ), , . , , . , .
, PDF-, , , . , . .
. , , , , . , , , , , . , . . . , , , . , .
, , . , , . , . . , , , , , , . , , . , . , .
:
Iterator<String> finalIterator = new WordIterator(reader);
private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
this.br = br;
curLine = null;
advance();
}
private void advance() {
try {
while (true) {
if (curLine == null || !matcher.find()) {
String line = br.readLine();
if (line == null) {
next = null;
br.close();
return;
}
matcher = notWhiteSpace.matcher(line);
curLine = line;
if (!matcher.find())
continue;
}
next = curLine.substring(matcher.start(), matcher.end());
break;
}
} catch (IOException ioe) {
throw new IOError(ioe);
}
}
2. -
:
, , «-», «-». , . - . -. - 1958 .. . - ‒ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , « », -, “”, “”, “ ”, “”. , «” “”, , , “” „“ . , , , : “”, “ ”, “”, , . , . - , , .
. -, . , » ", «», «», . -, , , , . , . .
- :
- - ‒ .
- , -, , , , -.
- - - . .
- , .
- , -, , .
- - .
- - :
- : -, -. .
- , («—»): - -. (TF-High), , , . . (TF1), (IDF).
- (MI): , (, , ), , . , , .
术语随机抽样(TBRS):一种从文档中手动检测停用词的方法。该方法用于对随机选择的单个数据块进行迭代,并使用Kullback-Leibler散度测度以格式将每个块中的特征基于其值排序,如以下等式所示:
d_x(t)= Px(t).log_2〖( Px(t))⁄(P(t))}
其中Px(t)是权重x内项t的归一化频率x
P(t)是整个集合中项t的归一化频率。
通过接受所有文档中信息最少的条款,并删除所有可能的重复项,可以构建最终的终止清单。
编码:
TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;
private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;
public TokenFilter loadFromResource(String fileName) {
try {
ClassLoader classLoader = getClass().getClassLoader();
String str = IOUtils.toString(
classLoader.getResourceAsStream(fileName),
Charset.defaultCharset());
InputStream is = new ByteArrayInputStream(str.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is));
Set<String> words = new HashSet<String>();
for (String line = null; (line = br.readLine()) != null;)
words.add(line);
br.close();
this.tokens = words;
this.excludeTokens = true;
this.parent = null;
} catch (Exception e) {
throw new IOError(e);
}
return this;
}
public boolean accept(String token) {
token = token.toLowerCase().replaceAll("[\\. \\d]", "");
return (parent == null || parent.accept(token))
&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
}
文件:
....
3.合法化
理论:
. , . , .
‒ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .
多年以来,已经开发了许多提供词元化功能的工具。尽管使用了不同的处理方法,但它们全都使用单词词典,一组规则或它们的组合作为形态分析的资源。最著名的词法化工具是:
- WordNet ‒ WordNet . , , , , . , . WordNet . .
- CLEAR ‒ . WordNet , . NLP, , .
- GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
- TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
- Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
- MorphAdorner – , , , POS . , MorphAdorner , . , .
- morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .
:
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
Annotation lemmaText = new Annotation(token);
pipeline.annotate(lemmaText);
List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
String word = "";
for(CoreLabel t:lemmaToken) {
word = t.get(LemmaAnnotation.class); // ( )
}
4. –
:
术语频率-反向文档频率(TF-IDF)是在现代信息检索系统中用于计算术语(文档中的关键字)权重的最广泛使用的算法。该权重是一种统计量度,用于评估单词对一系列文档或语料库中的文档的重要性。该值与单词在文档中出现的次数成正比,但补偿了单词在语料库中的出现频率
... (TF), , , , () . . (), , , . , . TF – . t D:
tf(t,D)=f_(t,D),
f_(t,D) – .
:
«»: tf(t,D) = 1, t D 0 ;
, :
tf(t,D)=f_(t,D)⁄(∑_(t^'∈D)▒f_(t^',D) )
:
log〖(1+f_(t,D))〗
, , , :
tf(t,D)=0.5+0.5*f_(t,D)/(max{f_(t^',D):t'∈D})
IDF, , , , . , , . , , , , :
idf(t,D)=logN/|{d∈D:t∈d}|
TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :
tfidf(t,D)=tf(t,D)*idf(t,D)
:
private final TObjectIntMap<T> counts;
public int count(T obj) {
int count = counts.get(obj);
count++;
counts.put(obj, count);
sum++;
return count;
}
public synchronized int addColumn(SparseArray<? extends Number> column) {
if (column.length() > numRows)
numRows = column.length();
int[] nonZero = column.getElementIndices();
nonZeroValues += nonZero.length;
try {
matrixDos.writeInt(nonZero.length);
for (int i : nonZero) {
matrixDos.writeInt(i); // write the row index
matrixDos.writeFloat(column.get(i).floatValue());
}
} catch (IOException ioe) {
throw new IOError(ioe);
}
return ++curCol;
}
public interface SparseArray<T> {
int cardinality();
T get(int index);
int[] getElementIndices();
int length();
void set(int index, T obj);
<E> E[] toArray(E[] array);
}
public File transform(File inputFile, File outFile, GlobalTransform transform) {
try {
DataInputStream dis = new DataInputStream(
new BufferedInputStream(new FileInputStream(inputFile)));
int rows = dis.readInt();
int cols = dis.readInt();
DataOutputStream dos = new DataOutputStream(
new BufferedOutputStream(new FileOutputStream(outFile)));
dos.writeInt(rows);
dos.writeInt(cols);
for (int row = 0; row < rows; ++row) {
for (int col = 0; col < cols; ++col) {
double val = dis.readFloat();
dos.writeFloat((float) transform.transform(row, col, val));
}
}
dos.close();
return outFile;
} catch (IOException ioe) {
throw new IOError(ioe);
}
}
public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}
public void factorize(MatrixFile mFile, int dimensions) {
try {
String formatString = "";
switch (mFile.getFormat()) {
case SVDLIBC_DENSE_BINARY:
formatString = " -r db ";
break;
case SVDLIBC_DENSE_TEXT:
formatString = " -r dt ";
break;
case SVDLIBC_SPARSE_BINARY:
formatString = " -r sb ";
break;
case SVDLIBC_SPARSE_TEXT:
break;
default:
throw new UnsupportedOperationException(
"Format type is not accepted");
}
File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
outputMatrixFile.deleteOnExit();
String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();
LOG.fine("creating SVDLIBC factor matrices at: " +
outputMatrixPrefix);
String commandLine = "svd -o " + outputMatrixPrefix + formatString +
" -w dt " +
" -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
LOG.fine(commandLine);
Process svdlibc = Runtime.getRuntime().exec(commandLine);
BufferedReader stdout = new BufferedReader(
new InputStreamReader(svdlibc.getInputStream()));
BufferedReader stderr = new BufferedReader(
new InputStreamReader(svdlibc.getErrorStream()));
StringBuilder output = new StringBuilder("SVDLIBC output:\n");
for (String line = null; (line = stderr.readLine()) != null; ) {
output.append(line).append("\n");
}
LOG.fine(output.toString());
int exitStatus = svdlibc.waitFor();
LOG.fine("svdlibc exit status: " + exitStatus);
if (exitStatus == 0) {
File Ut = new File(outputMatrixPrefix + "-Ut");
File S = new File(outputMatrixPrefix + "-S");
File Vt = new File(outputMatrixPrefix + "-Vt");
U = MatrixIO.readMatrix(
Ut, Format.SVDLIBC_DENSE_TEXT,
Type.DENSE_IN_MEMORY, true); // U
scaledDataClasses = false;
V = MatrixIO.readMatrix(
Vt, Format.SVDLIBC_DENSE_TEXT,
Type.DENSE_IN_MEMORY); // V
scaledClassFeatures = false;
singularValues = readSVDLIBCsingularVector(S, dimensions);
} else {
StringBuilder sb = new StringBuilder();
for (String line = null; (line = stderr.readLine()) != null; )
sb.append(line).append("\n");
// warning or error?
LOG.warning("svdlibc exited with error status. " +
"stderr:\n" + sb.toString());
}
} catch (IOException ioe) {
LOG.log(Level.SEVERE, "SVDLIBC", ioe);
} catch (InterruptedException ie) {
LOG.log(Level.SEVERE, "SVDLIBC", ie);
}
}
public MatrixBuilder getBuilder() {
return new SvdlibcSparseBinaryMatrixBuilder();
}
private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
int dimensions)
throws IOException {
BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
double[] m = new double[dimensions];
int readDimensions = Integer.parseInt(br.readLine());
if (readDimensions != dimensions)
throw new RuntimeException(
"SVDLIBC generated the incorrect number of " +
"dimensions: " + readDimensions + " versus " + dimensions);
int i = 0;
for (String line = null; (line = br.readLine()) != null; )
m[i++] = Double.parseDouble(line);
return m;
}
SVD Java ( S-space)
5. Aylien API
Aylien API Text Analysis ‒ API .
Aylien API , , , . ‒ .
, IPTC, -, ‒ IAB-QAG, .
IAB-QAG上下文分类法是由IAB(互动广告局)与来自学术界的分类法专家共同开发的,用于在至少两个不同级别上定义内容类别,从而使内容分类更加一致。第一个级别是广义级别的类别,第二个级别是根类型结构的更详细描述(图6)。
要使用此API,您需要在官方网站上获取密钥和ID。然后,使用此数据,您可以使用Java代码来调用POST和GET方法。
private static TextAPIClient client = new TextAPIClient(" ", " ")
然后,您可以通过传递要分类的数据来使用分类。
ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
System.out.println(c);
}
来自服务的响应以json格式返回:
{
"categories": [
{
"confident": true,
"id": "IAB19-36",
"label": "Windows",
"links": [
{
"link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
"rel": "self"
},
{
"link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
"rel": "parent"
}
],
"score": 0.5675236066291172
},
{
"confident": true,
"id": "IAB19",
"label": "Technology & Computing",
"links": [
{
"link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
"rel": "self"
}
],
"score": 0.46704140928338533
}
],
"language": "en",
"taxonomy": "iab-qag",
"text": "When Microsoft announced its wrenching..."
}
该API用于对将使用无监督学习聚类方法获得的聚类进行分类。
后记
应用上述算法时,有替代方法和现成的库。你只需要看看。如果您喜欢这篇文章,或者有想法或问题,请留下您的评论。第三部分将是摘要,主要讨论系统架构。算法说明,使用的内容和顺序。
此外,在应用每种算法后,还会得到每种结果以及这项工作的最终结果。