🙏🏽 🍶 ✍🏾 使用Java机器学习对大文本数据进行聚类和分类。第2条-算法 ⏸️ 🏨 📊

哈Ha！今天，将继续使用Java机器学习使用大文本数据的聚类和分类这一主题。本文是第一篇文章的续篇。

本文将包含理论以及我使用的算法的实现。

1.标记化

理论：

‒ . (, ). , , , , , , . . . (), . ‒ ; , . , - . , , . , , .

, . .

, «». (, , , ), , . , , . , .

, PDF-, , , . , . .

. , , , , . , , , , , . , . . . , , , . , .

, , . , , . , . . , , , , , , . , , . , . , .

Iterator<String> finalIterator = new WordIterator(reader);

private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
        this.br = br;
        curLine = null;
        advance();
    }
    private void advance() {
        try {
            while (true) {
                if (curLine == null || !matcher.find()) {
                    String line = br.readLine();
                    if (line == null) {
                        next = null;
                        br.close();
                        return;
                    }
                    matcher = notWhiteSpace.matcher(line);
                    curLine = line;
                    if (!matcher.find())
                        continue;                    
                }
                next = curLine.substring(matcher.start(), matcher.end());
                break;
            }
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }

2. -

, , «-», «-». , . - . -. - 1958 .. . - ‒ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , « », -, “”, “”, “ ”, “”. , «” “”, , , “” „“ . , , , : “”, “ ”, “”, , . , . - , , .

. -, . , » ", «», «», . -, , , , . , . .

- :

- ‒ .
, -, , , , -.
- - . .
, .
, -, , .
- .
- :
: -, -. .
, («—»): - -. (TF-High), , , . . (TF1), (IDF).
(MI): , (, , ), , . , , .

术语随机抽样（TBRS）：一种从文档中手动检测停用词的方法。该方法用于对随机选择的单个数据块进行迭代，并使用Kullback-Leibler散度测度以格式将每个块中的特征基于其值排序，如以下等式所示：

d_x（t）= Px（t）.log_2⁡〖（ Px（t））⁄（P（t））}

其中Px（t）是权重x内项t的归一化频率x

P（t）是整个集合中项t的归一化频率。

通过接受所有文档中信息最少的条款，并删除所有可能的重复项，可以构建最终的终止清单。

编码：

TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;

private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;

public TokenFilter loadFromResource(String fileName) {
		try {
			ClassLoader classLoader = getClass().getClassLoader();
			String str = IOUtils.toString(
					classLoader.getResourceAsStream(fileName),
					Charset.defaultCharset());
			InputStream is = new ByteArrayInputStream(str.getBytes());
			BufferedReader br = new BufferedReader(new InputStreamReader(is));

			Set<String> words = new HashSet<String>();
			for (String line = null; (line = br.readLine()) != null;)
				words.add(line);
			br.close();

			this.tokens = words;
			this.excludeTokens = true;
			this.parent = null;
		} catch (Exception e) {
			throw new IOError(e);
		}
		return this;
	}
public boolean accept(String token) {
		token = token.toLowerCase().replaceAll("[\\. \\d]", "");
		return (parent == null || parent.accept(token))
				&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
	}

文件：

















....

3.合法化

理论：

. , . , .

‒ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .

多年以来，已经开发了许多提供词元化功能的工具。尽管使用了不同的处理方法，但它们全都使用单词词典，一组规则或它们的组合作为形态分析的资源。最著名的词法化工具是：

WordNet ‒ WordNet . , , , , . , . WordNet . .
CLEAR ‒ . WordNet , . NLP, , .
GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
MorphAdorner – , , , POS . , MorphAdorner , . , .
morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
         Annotation lemmaText = new Annotation(token);
         pipeline.annotate(lemmaText);
         List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
         String word = "";
         for(CoreLabel t:lemmaToken) {
           word = t.get(LemmaAnnotation.class);  //   (  )
         }

4. –

术语频率-反向文档频率（TF-IDF）是在现代信息检索系统中用于计算术语（文档中的关键字）权重的最广泛使用的算法。该权重是一种统计量度，用于评估单词对一系列文档或语料库中的文档的重要性。该值与单词在文档中出现的次数成正比，但补偿了单词在语料库中的出现频率

...

(TF), , , , () . . (), , , . , . TF – . t D:

tf(t,D)=f_(t,D),

f_(t,D) – .

:

«»: tf(t,D) = 1, t D 0 ;

, :

tf(t,D)=f_(t,D)⁄(∑_(t^'∈D)▒f_(t^',D) )

:

log⁡〖(1+f_(t,D))〗

, , , :

tf(t,D)=0.5+0.5*f_(t,D)/(max⁡{f_(t^',D):t'∈D})

IDF, , , , . , , . , , , , :

idf(t,D)=log⁡N/|{d∈D:t∈d}|

TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :

tfidf(t,D)=tf(t,D)*idf(t,D)

private final TObjectIntMap<T> counts;
public int count(T obj) {
    int count = counts.get(obj);
    count++;
    counts.put(obj, count);
    sum++;
    return count;
}

public synchronized int addColumn(SparseArray<? extends Number> column) {
     if (column.length() > numRows)
         numRows = column.length();
    
     int[] nonZero = column.getElementIndices();
     nonZeroValues += nonZero.length;
     try {
         matrixDos.writeInt(nonZero.length);
         for (int i : nonZero) {
             matrixDos.writeInt(i); // write the row index
             matrixDos.writeFloat(column.get(i).floatValue());
         }
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
     return ++curCol;
}

public interface SparseArray<T> {
    int cardinality();
    T get(int index);
    int[] getElementIndices();
    int length();
    void set(int index, T obj);
    <E> E[] toArray(E[] array);
}

public File transform(File inputFile, File outFile, GlobalTransform transform) {
     try {
         DataInputStream dis = new DataInputStream(
             new BufferedInputStream(new FileInputStream(inputFile)));
         int rows = dis.readInt();
         int cols = dis.readInt();
         DataOutputStream dos = new DataOutputStream(
             new BufferedOutputStream(new FileOutputStream(outFile)));
         dos.writeInt(rows);
         dos.writeInt(cols);
         for (int row = 0; row < rows; ++row) {
             for (int col = 0; col < cols; ++col) {
                 double val = dis.readFloat();
                 dos.writeFloat((float) transform.transform(row, col, val));
             }
         }
         dos.close();
         return outFile;
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
}

public double transform(int row, int column, double value) {
        double tf = value / docTermCount[column];
        double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
        return tf * idf;
}

public void factorize(MatrixFile mFile, int dimensions) {
        try {
            String formatString = "";
            switch (mFile.getFormat()) {
            case SVDLIBC_DENSE_BINARY:
                formatString = " -r db ";
                break;
            case SVDLIBC_DENSE_TEXT:
                formatString = " -r dt ";
                break;
            case SVDLIBC_SPARSE_BINARY:
                formatString = " -r sb ";
                break;
            case SVDLIBC_SPARSE_TEXT:
                break;
            default:
                throw new UnsupportedOperationException(
                    "Format type is not accepted");
            }

            File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
            outputMatrixFile.deleteOnExit();
            String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();

            LOG.fine("creating SVDLIBC factor matrices at: " + 
                              outputMatrixPrefix);
            String commandLine = "svd -o " + outputMatrixPrefix + formatString +
                " -w dt " + 
                " -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
            LOG.fine(commandLine);
            Process svdlibc = Runtime.getRuntime().exec(commandLine);
            BufferedReader stdout = new BufferedReader(
                new InputStreamReader(svdlibc.getInputStream()));
            BufferedReader stderr = new BufferedReader(
                new InputStreamReader(svdlibc.getErrorStream()));

            StringBuilder output = new StringBuilder("SVDLIBC output:\n");
            for (String line = null; (line = stderr.readLine()) != null; ) {
                output.append(line).append("\n");
            }
            LOG.fine(output.toString());
            
            int exitStatus = svdlibc.waitFor();
            LOG.fine("svdlibc exit status: " + exitStatus);

            if (exitStatus == 0) {
                File Ut = new File(outputMatrixPrefix + "-Ut");
                File S  = new File(outputMatrixPrefix + "-S");
                File Vt = new File(outputMatrixPrefix + "-Vt");
                U = MatrixIO.readMatrix(
                        Ut, Format.SVDLIBC_DENSE_TEXT, 
                        Type.DENSE_IN_MEMORY, true); //  U
                scaledDataClasses = false; 
                
                V = MatrixIO.readMatrix(
                        Vt, Format.SVDLIBC_DENSE_TEXT,
                        Type.DENSE_IN_MEMORY); //  V
                scaledClassFeatures = false;


                singularValues =  readSVDLIBCsingularVector(S, dimensions);
            } else {
                StringBuilder sb = new StringBuilder();
                for (String line = null; (line = stderr.readLine()) != null; )
                    sb.append(line).append("\n");
                // warning or error?
                LOG.warning("svdlibc exited with error status.  " + 
                               "stderr:\n" + sb.toString());
            }
        } catch (IOException ioe) {
            LOG.log(Level.SEVERE, "SVDLIBC", ioe);
        } catch (InterruptedException ie) {
            LOG.log(Level.SEVERE, "SVDLIBC", ie);
        }
    }

    public MatrixBuilder getBuilder() {
        return new SvdlibcSparseBinaryMatrixBuilder();
    }

    private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
                                                      int dimensions)
            throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
        double[] m = new double[dimensions];

        int readDimensions = Integer.parseInt(br.readLine());
        if (readDimensions != dimensions)
            throw new RuntimeException(
                    "SVDLIBC generated the incorrect number of " +
                    "dimensions: " + readDimensions + " versus " + dimensions);

        int i = 0;
        for (String line = null; (line = br.readLine()) != null; )
            m[i++] = Double.parseDouble(line);
        return m;
    }

SVD Java ( S-space)

5. Aylien API

Aylien API Text Analysis ‒ API .

Aylien API , , , . ‒ .

, IPTC, -, ‒ IAB-QAG, .

IAB-QAG上下文分类法是由IAB（互动广告局）与来自学术界的分类法专家共同开发的，用于在至少两个不同级别上定义内容类别，从而使内容分类更加一致。第一个级别是广义级别的类别，第二个级别是根类型结构的更详细描述（图6）。

要使用此API，您需要在官方网站上获取密钥和ID。然后，使用此数据，您可以使用Java代码来调用POST和GET方法。

private static TextAPIClient client = new TextAPIClient(" ", " ")

然后，您可以通过传递要分类的数据来使用分类。

ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
  System.out.println(c);
}

来自服务的响应以json格式返回：

{
  "categories": [
    {
      "confident": true,
      "id": "IAB19-36",
      "label": "Windows",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
          "rel": "self"
        },
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "parent"
        }
      ],
      "score": 0.5675236066291172
    },
    {
      "confident": true,
      "id": "IAB19",
      "label": "Technology & Computing",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "self"
        }
      ],
      "score": 0.46704140928338533
    }
  ],
  "language": "en",
  "taxonomy": "iab-qag",
  "text": "When Microsoft announced its wrenching..."
}

该API用于对将使用无监督学习聚类方法获得的聚类进行分类。

后记

应用上述算法时，有替代方法和现成的库。你只需要看看。如果您喜欢这篇文章，或者有想法或问题，请留下您的评论。第三部分将是摘要，主要讨论系统架构。算法说明，使用的内容和顺序。

此外，在应用每种算法后，还会得到每种结果以及这项工作的最终结果。

使用Java机器学习对大文本数据进行聚类和分类。第2条-算法

1.标记化

2. -

3.合法化

4. –

5. Aylien API

后记

More articles: