在本文中，我想补充一下本文，并告诉您如何更灵活地使用Wikipedia WikiExtractor，并按类别过滤文章。

一切始于我需要为各种术语定义的事实。术语及其定义通常是每个Wikipedia页面上的第一句话。按照最简单的方法，我提取了所有文章，并迅速获取了常客所需的一切。问题是定义的大小超过500 MB，并且有太多不必要的东西，例如，命名实体，城市，年份等。我不需要

我正确地假定WikiExtractor工具（我将使用不同的版本，链接将在下面）具有某种过滤器，事实证明它是按类别的过滤器。类别是具有用于组织页面的层次结构的文章的标签。我很高兴提出“精确科学”类别，非常天真地认为所有与精确科学有关的文章都将包含在列表中，但是没有发生奇迹-每个页面都有自己的微小类别集合，并且单个页面上没有信息这些类别之间的关系。这意味着，如果我需要有关精确科学的页面，则必须指出“精确科学”的所有子类别。

好吧，没关系，我认为现在可以找到一项服务，该服务可以轻松地将所有类别从给定的起点发送给我。不幸的是，我才发现这个地方，你才可以看到这些类别是如何相关的。尝试手动遍历类别也没有成功，但是我很高兴这些类别的结构不是树，就像我一直想的那样，只是一个带有循环的有向图。此外，层次结构本身非常浮动-我会先说一遍，通过设置起点“数学”，您可以轻松到达AlexanderI。因此，我只需要在本地恢复此图，并以某种方式获得我感兴趣的类别列表。

因此，问题如下：从某个顶点开始，获取与此顶点关联的所有类别的列表，并且能够以某种方式限制它们。

这项工作是在运行Ubuntu 16.04的计算机上完成的，但是我相信以下说明不会对18.04造成问题。

下载和部署数据

首先，我们需要从此处下载所有必要的数据，即

ruwiki-latest-pages-articles.xml.bz2
ruwiki-latest-categorylinks.sql.gz
ruwiki-latest-category.sql.gz
ruwiki-latest-page.sql.gz

categorylinks , , [[Category:Title]] , . cl_from, id , cl_to, . , id , page () page_id page_title. , . , , , , , . category([](category table)) cat_title. pages-articles.xml .

mysql. ,

sudo apt-get install mysql-server  mysql-client

, mysql , .

$ mysql -u username -p
mysql> create database category;
mysql> create database categorylinks;
mysql> create database page;

, . .

$  mysql -u username -p category < ruwiki-latest-category.sql
$  mysql -u username -p categorylinks < ruwiki-latest-categorylinks.sql
$  mysql -u username -p page < ruwiki-latest-page.sql

, csv.

mysql> select page_title, cl_to from categorylinks.categorylinks join page.page
on cl_from = page_id  where page_title in (select cat_title from category) INTO outfile '/var/lib/mysql-files/category.csv' FIELDS terminated by ';' enclosed by '"' lines terminated by '\n';

. .

, , — , . , , , , 1,6 1,1. .

import pandas as pd
import networkx as nx
from tqdm.auto import tqdm, trange

#Filtering
df = pd.read_csv("category.csv", sep=";", error_bad_lines=False)
df = df.dropna()
df_filtered = df[df.parant.str.contains("[--]+:") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains(",_") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains(",_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("") != True] 

# Graph recovering
G = nx.DiGraph()
c = 0
for i, gr in tqdm(df_filtered.groupby('child')):

    vertex = set()
    edges = []
    for i, r in gr.iterrows():
        G.add_node(r.parant, color="white")
        G.add_node(r.child, color="white")
        G.add_edge(r.parant, r.child)

, , , , .

counter = 0
nodes = []

def dfs(G, node, max_depth):
    global nodes, counter
    G.nodes[node]['color'] = 'gray'
    nodes.append(node)
    counter += 1
    if counter == max_depth:
        counter -= 1
        return
    for v in G.successors(node):
        if G.nodes[v]['color'] == 'white':
            dfs(G, v, max_depth)
        elif G.nodes[v]['color'] == 'gray':
            continue
    counter -= 1

, nodes . " " 5 . 2500 . , , , , - , , , — , . , , .

, .

_

CAM
__
_
_
__
__


__
__
__
___
_

...

_
___
__
_____
_
_
____
_
_
_
_
__
_
_()

...


_

_
_

_
_
_
-_

_
_
_
_
_

为了将这些类别应用于俄语过滤，您需要在源代码中进行一些调整。我使用了这个版本。现在有一些新内容，也许下面的修复不再有用。在WikiExtractor.py文件中，需要在两个地方用“ Category”替换“ Category”。具有已更正版本的区域如下所示：


tagRE = re.compile(r'(.*?)<(/?\w+)[^>]*?>(?:([^<]*)(<.*?>)?)?')
#                    1     2               3      4
keyRE = re.compile(r'key="(\d*)"')
catRE = re.compile(r'\[\[:([^\|]+).*\]\].*')  # capture the category name [[Category:Category name|Sortkey]]"

def load_templates(file, output_file=None):
...

if inText:
    page.append(line)
    # extract categories
    if line.lstrip().startswith('[[:'):
        mCat = catRE.search(line)
        if mCat:
            catSet.add(mCat.group(1))

之后，您需要运行命令

python WikiExtractor.py --filter_category categories --output wiki_filtered ruwiki-latest-pages-articles.xml

其中category是带有类别的文件。过滤的文章将在wiki_filtered中。

就这样。感谢您的关注。

解析Wikipedia，以44行代码过滤NLP任务

下载和部署数据

More articles: