使用Java机器学习对大文本数据进行聚类和分类。第2条-算法

图片



哈Ha!今天,将继续使用Java机器学习使用大文本数据的聚类和分类这一主题。本文是第一篇文章的续篇





本文将包含理论以及我使用的算法的实现





1.标记化



理论:



‒ . (, ). , , , , , , . . . (), . ‒ ; , . , - . , , . , , .



, . .



, «». (, , , ), , . , , . , .



, PDF-, , , . , . .



. , , , , . , , , , , . , . . . , , , . , .



, , . , , . , . . , , , , , , . , , . , . , .





:



Iterator<String> finalIterator = new WordIterator(reader);


private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
        this.br = br;
        curLine = null;
        advance();
    }
    private void advance() {
        try {
            while (true) {
                if (curLine == null || !matcher.find()) {
                    String line = br.readLine();
                    if (line == null) {
                        next = null;
                        br.close();
                        return;
                    }
                    matcher = notWhiteSpace.matcher(line);
                    curLine = line;
                    if (!matcher.find())
                        continue;                    
                }
                next = curLine.substring(matcher.start(), matcher.end());
                break;
            }
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }


2. -



:



, , «-», «-». , . - . -. - 1958 .. . - ‒ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , « », -, “”, “”, “ ”, “”. , «” “”, , , “” „“ . , , , : “”, “ ”, “”, , . , . - , , .



. -, . , » ", «», «», . -, , , , . , . .

- :



  • - ‒ .
  • , -, , , , -.
  • - - . .
  • , .
  • , -, , .
  • - .
  • - :
  • : -, -. .
  • , («—»): - -. (TF-High), , , . . (TF1), (IDF).
  • (MI): , (, , ), , . , , .


术语随机抽样(TBRS):一种从文档中手动检测停用词的方法。该方法用于对随机选择的单个数据块进行迭代,并使用Kullback-Leibler散度测度以格式将每个块中的特征基于其值排序,如以下等式所示:



d_x(t)= Px(t).log_2⁡〖( Px(t))⁄(P(t))}



其中Px(t)权重xt的归一化频率x

P(t)是整个集合中项t的归一化频率。

通过接受所有文档中信息最少的条款,并删除所有可能的重复项,可以构建最终的终止清单。





编码:

TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;


private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;

public TokenFilter loadFromResource(String fileName) {
		try {
			ClassLoader classLoader = getClass().getClassLoader();
			String str = IOUtils.toString(
					classLoader.getResourceAsStream(fileName),
					Charset.defaultCharset());
			InputStream is = new ByteArrayInputStream(str.getBytes());
			BufferedReader br = new BufferedReader(new InputStreamReader(is));

			Set<String> words = new HashSet<String>();
			for (String line = null; (line = br.readLine()) != null;)
				words.add(line);
			br.close();

			this.tokens = words;
			this.excludeTokens = true;
			this.parent = null;
		} catch (Exception e) {
			throw new IOError(e);
		}
		return this;
	}
public boolean accept(String token) {
		token = token.toLowerCase().replaceAll("[\\. \\d]", "");
		return (parent == null || parent.accept(token))
				&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
	}


文件:



















....


3.合法化



理论:



. , . , .



‒ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .



多年以来,已经开发了许多提供词元化功能的工具。尽管使用了不同的处理方法,但它们全都使用单词词典,一组规则或它们的组合作为形态分析的资源。最著名的词法化工具是:



  • WordNet ‒ WordNet . , , , , . , . WordNet . .
  • CLEAR ‒ . WordNet , . NLP, , .
  • GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
  • TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
  • Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
  • MorphAdorner – , , , POS . , MorphAdorner , . , .
  • morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .


:



Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
         Annotation lemmaText = new Annotation(token);
         pipeline.annotate(lemmaText);
         List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
         String word = "";
         for(CoreLabel t:lemmaToken) {
           word = t.get(LemmaAnnotation.class);  //   (  )
         }


4. –



:



术语频率-反向文档频率(TF-IDF)是在现代信息检索系统中用于计算术语(文档中的关键字)权重的最广泛使用的算法。该权重是一种统计量度,用于评估单词对一系列文档或语料库中的文档的重要性。该值与单词在文档中出现的次数成正比,但补偿了单词在语料库中的出现频率

...

(TF), , , , () . . (), , , . , . TF – . t D:



tf(t,D)=f_(t,D),



f_(t,D) – .

:

«»: tf(t,D) = 1, t D 0 ;

, :



tf(t,D)=f_(t,D)⁄(∑_(t^'∈D)▒f_(t^',D) )



:



log⁡〖(1+f_(t,D))〗



, , , :



tf(t,D)=0.5+0.5*f_(t,D)/(max⁡{f_(t^',D):t'∈D})



IDF, , , , . , , . , , , , :



idf(t,D)=log⁡N/|{d∈D:t∈d}|



TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :



tfidf(t,D)=tf(t,D)*idf(t,D)





:

private final TObjectIntMap<T> counts;
public int count(T obj) {
    int count = counts.get(obj);
    count++;
    counts.put(obj, count);
    sum++;
    return count;
}


public synchronized int addColumn(SparseArray<? extends Number> column) {
     if (column.length() > numRows)
         numRows = column.length();
    
     int[] nonZero = column.getElementIndices();
     nonZeroValues += nonZero.length;
     try {
         matrixDos.writeInt(nonZero.length);
         for (int i : nonZero) {
             matrixDos.writeInt(i); // write the row index
             matrixDos.writeFloat(column.get(i).floatValue());
         }
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
     return ++curCol;
}


public interface SparseArray<T> {
    int cardinality();
    T get(int index);
    int[] getElementIndices();
    int length();
    void set(int index, T obj);
    <E> E[] toArray(E[] array);
}


public File transform(File inputFile, File outFile, GlobalTransform transform) {
     try {
         DataInputStream dis = new DataInputStream(
             new BufferedInputStream(new FileInputStream(inputFile)));
         int rows = dis.readInt();
         int cols = dis.readInt();
         DataOutputStream dos = new DataOutputStream(
             new BufferedOutputStream(new FileOutputStream(outFile)));
         dos.writeInt(rows);
         dos.writeInt(cols);
         for (int row = 0; row < rows; ++row) {
             for (int col = 0; col < cols; ++col) {
                 double val = dis.readFloat();
                 dos.writeFloat((float) transform.transform(row, col, val));
             }
         }
         dos.close();
         return outFile;
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
}

public double transform(int row, int column, double value) {
        double tf = value / docTermCount[column];
        double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
        return tf * idf;
}


public void factorize(MatrixFile mFile, int dimensions) {
        try {
            String formatString = "";
            switch (mFile.getFormat()) {
            case SVDLIBC_DENSE_BINARY:
                formatString = " -r db ";
                break;
            case SVDLIBC_DENSE_TEXT:
                formatString = " -r dt ";
                break;
            case SVDLIBC_SPARSE_BINARY:
                formatString = " -r sb ";
                break;
            case SVDLIBC_SPARSE_TEXT:
                break;
            default:
                throw new UnsupportedOperationException(
                    "Format type is not accepted");
            }

            File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
            outputMatrixFile.deleteOnExit();
            String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();

            LOG.fine("creating SVDLIBC factor matrices at: " + 
                              outputMatrixPrefix);
            String commandLine = "svd -o " + outputMatrixPrefix + formatString +
                " -w dt " + 
                " -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
            LOG.fine(commandLine);
            Process svdlibc = Runtime.getRuntime().exec(commandLine);
            BufferedReader stdout = new BufferedReader(
                new InputStreamReader(svdlibc.getInputStream()));
            BufferedReader stderr = new BufferedReader(
                new InputStreamReader(svdlibc.getErrorStream()));

            StringBuilder output = new StringBuilder("SVDLIBC output:\n");
            for (String line = null; (line = stderr.readLine()) != null; ) {
                output.append(line).append("\n");
            }
            LOG.fine(output.toString());
            
            int exitStatus = svdlibc.waitFor();
            LOG.fine("svdlibc exit status: " + exitStatus);

            if (exitStatus == 0) {
                File Ut = new File(outputMatrixPrefix + "-Ut");
                File S  = new File(outputMatrixPrefix + "-S");
                File Vt = new File(outputMatrixPrefix + "-Vt");
                U = MatrixIO.readMatrix(
                        Ut, Format.SVDLIBC_DENSE_TEXT, 
                        Type.DENSE_IN_MEMORY, true); //  U
                scaledDataClasses = false; 
                
                V = MatrixIO.readMatrix(
                        Vt, Format.SVDLIBC_DENSE_TEXT,
                        Type.DENSE_IN_MEMORY); //  V
                scaledClassFeatures = false;


                singularValues =  readSVDLIBCsingularVector(S, dimensions);
            } else {
                StringBuilder sb = new StringBuilder();
                for (String line = null; (line = stderr.readLine()) != null; )
                    sb.append(line).append("\n");
                // warning or error?
                LOG.warning("svdlibc exited with error status.  " + 
                               "stderr:\n" + sb.toString());
            }
        } catch (IOException ioe) {
            LOG.log(Level.SEVERE, "SVDLIBC", ioe);
        } catch (InterruptedException ie) {
            LOG.log(Level.SEVERE, "SVDLIBC", ie);
        }
    }

    public MatrixBuilder getBuilder() {
        return new SvdlibcSparseBinaryMatrixBuilder();
    }

    private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
                                                      int dimensions)
            throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
        double[] m = new double[dimensions];

        int readDimensions = Integer.parseInt(br.readLine());
        if (readDimensions != dimensions)
            throw new RuntimeException(
                    "SVDLIBC generated the incorrect number of " +
                    "dimensions: " + readDimensions + " versus " + dimensions);

        int i = 0;
        for (String line = null; (line = br.readLine()) != null; )
            m[i++] = Double.parseDouble(line);
        return m;
    }


SVD Java ( S-space)



5. Aylien API



Aylien API Text Analysis ‒ API .

Aylien API , , , . ‒ .



, IPTC, -, ‒ IAB-QAG, .



IAB-QAG上下文分类法是由IAB(互动广告局)与来自学术界的分类法专家共同开发的,用于在至少两个不同级别上定义内容类别,从而使内容分类更加一致。第一个级别是广义级别的类别,第二个级别是根类型结构的更详细描述(图6)。

要使用此API,您需要在官方网站上获取密钥和ID。然后,使用此数据,您可以使用Java代码来调用POST和GET方法。



private static TextAPIClient client = new TextAPIClient(" ", " ")


然后,您可以通过传递要分类的数据来使用分类。



ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
  System.out.println(c);
}


来自服务的响应以json格式返回:



{
  "categories": [
    {
      "confident": true,
      "id": "IAB19-36",
      "label": "Windows",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
          "rel": "self"
        },
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "parent"
        }
      ],
      "score": 0.5675236066291172
    },
    {
      "confident": true,
      "id": "IAB19",
      "label": "Technology & Computing",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "self"
        }
      ],
      "score": 0.46704140928338533
    }
  ],
  "language": "en",
  "taxonomy": "iab-qag",
  "text": "When Microsoft announced its wrenching..."
}


该API用于对将使用无监督学习聚类方法获得的聚类进行分类。



后记



应用上述算法时,有替代方法和现成的库。你只需要看看。如果您喜欢这篇文章,或者有想法或问题,请留下您的评论。第三部分将是摘要,主要讨论系统架构。算法说明,使用的内容和顺序。



此外,在应用每种算法后,还会得到每种结果以及这项工作的最终结果。




All Articles