🥧 💒 🕟 分析DeepSpeech项目的代码或为什么不应该在命名空间std中编写代码 🆘 👈🏽 🧙

DeepSpeech是Mozilla开发的开源和免费语音识别引擎。该引擎具有相当高的性能和良好的用户评价，这使项目代码成为要检查的有趣目标。本文专门分析DeepSpeech项目的C ++代码中发现的错误。

介绍

我们反复使用机器学习来寻找项目中的错误，而DeepSpeech对我们也不例外。毫不奇怪，因为这个项目非常受欢迎：在撰写本文时，它在GitHub上已经有超过15,000个星星。

像往常一样，我将在本文中引用的错误搜索是使用PVS-Studio静态代码分析器进行的。

对于其工作，DeepSpeech使用TensorFlow库。我已经关闭了对该库代码的分析，因为我们已经写了一篇单独的文章。但是，我没有关闭对所用其余库的分析。这是什么原因呢？您包含在项目中的任何库中的错误也将成为项目中的错误。因此，不仅分析您自己，而且分析您使用的任何第三方代码也很有用。您可以在我们最近的文章中阅读有关此内容的详细意见。

简短的介绍到此结束-现在该继续进行错误分析了。顺便说一句，如果您来这里是为了找到我在文章标题中提出的问题的答案（为什么不应该在命名空间std中编写内容），则可以立即查看文章结尾。一个特别有趣的例子在那里等着您！

分析仪发出的10个有趣警告的概述

警告1

V773在不释放“数据”指针的情况下退出了该功能。可能发生内存泄漏。编辑fh.h 311

// EditFstData method implementations: just the Read method.
template <typename A, typename WrappedFstT, typename MutableFstT>
EditFstData<A, WrappedFstT, MutableFstT> *
EditFstData<A, WrappedFstT, MutableFstT>::Read(std::istream &strm,
                                               const FstReadOptions &opts)
{
  auto *data = new EditFstData<A, WrappedFstT, MutableFstT>();
  // next read in MutabelFstT machine that stores edits
  FstReadOptions edits_opts(opts);

  ....
  
  std::unique_ptr<MutableFstT> edits(MutableFstT::Read(strm, edits_opts));
  if (!edits) return nullptr; // <=

  ....
}

此代码段包含一个经典的内存泄漏示例：Read函数调用' return nullptr '而不释放用表达式' new EditFstData '分配的内存。通过退出函数（不调用delete data），仅指针本身将被删除，并且指向它的对象的析构函数将不会被调用。因此，对象将继续存储在内存中，并且将不再可能删除或使用它。

除了错误外，此代码还包含另一种不太好的做法：一个函数的代码同时使用智能指针和普通指针。例如，如果数据还是智能指针，则不会发生这样的错误：如果有必要，当超出范围时，智能指针会自动调用存储对象的析构函数。

警告2

V1062'DfsState '类定义了自定义的'new'运算符。还必须定义“删除”运算符。 dfs-visit.h 62

// An FST state's DFS stack state.
template <class FST>
struct DfsState {
public:
  ....
  void *operator new(size_t size, 
                     MemoryPool<DfsState<FST>> *pool) {
    return pool->Allocate();
  }
  ....
}

PVS-Studio不会停滞不前，并继续添加新的诊断程序。此代码段是一个很好的示例，显示了编号为V1062的最新诊断程序的工作。

该诊断程序强制执行的规则很简单：如果您定义自己的“ new”运算符，则还必须定义自己的“ delete”运算符。相反的方法是相同的：如果您定义自己的“删除”，则还必须定义自己的“新”。

在上面的示例中，违反了该规则：将使用我们定义的“新”创建对象，然后使用标准的“删除”删除该对象。让我们看看MemoryPool类的Allocate函数的作用，我们自己的“新”称呼为：

void *Allocate() {
  if (free_list_ == nullptr) {
    auto *link = static_cast<Link *>(mem_arena_.Allocate(1));
    link->next = nullptr;
    return link;
  } else {
    auto *link = free_list_;
    free_list_ = link->next;
    return link;
  }
}

此函数创建一个项目并将其添加到链接列表。逻辑上，这样的分配应该以自己的“新”形式编写。

但是请稍等！下面仅几行包含以下功能：

void Free(void *ptr) {
  if (ptr) {
    auto *link = static_cast<Link *>(ptr);
    link->next = free_list_;
    free_list_ = link;
  }
}

这意味着我们已经具有分配和发布的现成功能。程序员最有可能必须编写自己的“删除”运算符，然后使用Free（）函数释放它。

分析仪至少检测到三个以下此类错误：

V1062'VectorState '类定义了一个自定义的'new'运算符。还必须定义“删除”运算符。vector-fst.h 31
V1062 “ CacheState”类定义了一个自定义的“ new”运算符。还必须定义“删除”运算符。缓存.h 65

警告3

V703很奇怪，派生类“ ShortestPathOptions”中的“ first_path”字段将覆盖基类“ ShortestDistanceOptions”中的字段。检查行：shortest-path.h：35，shortest-distance.h：34。最短路径.h 35

// Base class
template <class Arc, class Queue, class ArcFilter>
struct ShortestDistanceOptions {
  Queue *state_queue;    // Queue discipline used; owned by caller.
  ArcFilter arc_filter;  // Arc filter (e.g., limit to only epsilon graph).
  StateId source;        // If kNoStateId, use the FST's initial state.
  float delta;           // Determines the degree of convergence required
  bool first_path;       // For a semiring with the path property (o.w.
                         // undefined), compute the shortest-distances along
                         // along the first path to a final state found
                         // by the algorithm. That path is the shortest-path
                         // only if the FST has a unique final state (or all
                         // the final states have the same final weight), the
                         // queue discipline is shortest-first and all the
                         // weights in the FST are between One() and Zero()
                         // according to NaturalLess.

  ShortestDistanceOptions(Queue *state_queue, ArcFilter arc_filter,
                          StateId source = kNoStateId,
                          float delta = kShortestDelta)
      : state_queue(state_queue),
        arc_filter(arc_filter),
        source(source),
        delta(delta),
        first_path(false) {}
};
// Derived class
template <class Arc, class Queue, class ArcFilter>
struct ShortestPathOptions
    : public ShortestDistanceOptions<Arc, Queue, ArcFilter> {
  using StateId = typename Arc::StateId;
  using Weight = typename Arc::Weight;

  int32 nshortest;    // Returns n-shortest paths.
  bool unique;        // Only returns paths with distinct input strings.
  bool has_distance;  // Distance vector already contains the
                      // shortest distance from the initial state.
  bool first_path;    // Single shortest path stops after finding the first
                      // path to a final state; that path is the shortest path
                      // only when:
                      // (1) using the ShortestFirstQueue with all the weights
                      // in the FST being between One() and Zero() according to
                      // NaturalLess or when
                      // (2) using the NaturalAStarQueue with an admissible
                      // and consistent estimate.
  Weight weight_threshold;  // Pruning weight threshold.
  StateId state_threshold;  // Pruning state threshold.

  ShortestPathOptions(Queue *queue, ArcFilter filter, int32 nshortest = 1,
                      bool unique = false, bool has_distance = false,
                      float delta = kShortestDelta, bool first_path = false,
                      Weight weight_threshold = Weight::Zero(),
                      StateId state_threshold = kNoStateId)
      : ShortestDistanceOptions<Arc, Queue, ArcFilter>(queue, filter,
                                                       kNoStateId, delta),
        nshortest(nshortest),
        unique(unique),
        has_distance(has_distance),
        first_path(first_path),
        weight_threshold(std::move(weight_threshold)),
        state_threshold(state_threshold) {}
};

同意，发现潜在错误并非易事，对吧？

问题在于基类和派生类都包含具有相同名称的字段：first_path。这将导致派生类具有其自己的不同字段，该字段将使用其名称覆盖基类中的字段。这样的错误可能导致严重的混乱。

为了更好地理解我的意思，我建议从我们的文档中考虑一个简短的综合示例。假设我们有以下代码：

class U {
public:
  int x;
};

class V : public U {
public:
  int x;  // <= V703 here
  int z;
};

在此，名称x在派生类中被覆盖。现在的问题是：以下代码将输出什么值？

int main() {
  V vClass;
  vClass.x = 1;
  U *uClassPtr = &vClass;
  std::cout << uClassPtr->x << std::endl;
  ....
}

如果您认为将输出未定义的值，那么您是对的。在此示例中，该单元将被写入派生类的字段，但是将从基类的字段进行读取，而基类在输出时仍未定义。

在类层次结构中重叠名称是一个潜在的错误，应避免:)

警告4

V1004在对nullptr进行验证之后，不安全地使用了“ aiter”指针。检查行：107、119。visit.h 119

template <....>
void Visit(....)
{
  ....
  // Deletes arc iterator if done.
  auto *aiter = arc_iterator[state];
  if ((aiter && aiter->Done()) || !visit) {
    Destroy(aiter, &aiter_pool);
    arc_iterator[state] = nullptr;
    state_status[state] |= kArcIterDone;
  }
  // Dequeues state and marks black if done.
  if (state_status[state] & kArcIterDone) {
    queue->Dequeue();
    visitor->FinishState(state);
    state_status[state] = kBlackState;
    continue;
  }
  const auto &arc = aiter->Value();       // <=
  ....
}

在检查过nullptr之后使用aiter 指针。分析器做出一个假设：如果检查指针是否为nullptr，则在检查期间它可以具有这样的值。在这种情况下，让我们看看如果aiter真正等于零，将会发生什么。首先，将在' if（（aiter && aiter-> Done（））||！Visit） '语句中检查此指针。这个条件将是错误的，如果if，那么我们将不会进入此分支。然后，根据所有经典错误的规范，将取消引用空指针：' aiter-> Value（）;

'。这种取消引用导致未定义的行为。

警告5

以下示例一次包含两个错误：

V595在针对nullptr进行验证之前，已使用“ istrm”指针。检查行：60，61。mapping-file.cc 60
V595在针对nullptr进行验证之前，已使用“ istrm”指针。检查行：39，61。mapping-file.cc 39

MappedFile *MappedFile::Map(std::istream *istrm, bool memorymap,
                            const string &source, size_t size) {
  const auto spos = istrm->tellg();        // <=
  ....
  istrm->seekg(pos + size, std::ios::beg); // <=
  if (istrm) {                             // <=
    VLOG(1) << "mmap'ed region of " << size
            << " at offset " << pos
            << " from " << source
            << " to addr " << map;
  return mmf.release();
  }
  ....
}

在此找到的错误比上一个示例的错误更清楚。首先取消引用istrm指针（两次），然后再进行零检查和错误记录。这清楚地表明：如果null指针以istrm的形式出现在此函数中，则将发生未定义的行为（或更可能是程序崩溃），而无需进行任何日志记录。疾病...像这样的错误不容忽视。

警告6

V730并非类的所有成员都在构造函数中初始化。考虑检查：stones_writing_。ersatz_progress.cc 14

ErsatzProgress::ErsatzProgress()
  : current_(0)
  , next_(std::numeric_limits<uint64_t>::max())
  , complete_(next_)
  , out_(NULL)
{}

分析器警告我们，构造函数不会初始化ErzatzProgress结构的所有字段。让我们将此构造函数与该结构中的字段列表进行比较：

class ErsatzProgress {
  ....
private:
    void Milestone();

    uint64_t current_, next_, complete_;
    unsigned char stones_written_;
    std::ostream *out_;
};

确实，您可以看到该构造函数初始化了除stones_writing_之外的所有字段。

注意：此示例可能不是错误。在未初始化的字段的值时才会发生真正的错误使用。

但是，V730诊断程序可帮助您提前调试这些用例。毕竟，出现了一个自然的问题：如果程序员决定专门初始化类的所有字段，那么为什么他有理由将一个字段留为无值呢？当我在下面的几行中看到另一个构造函数时，证实了

我的“ stones_writing_”字段未正确初始化的猜测：

ErsatzProgress::ErsatzProgress(uint64_t complete,
                               std::ostream *to,
                               const std::string &message)
  : current_(0)
  , next_(complete / kWidth)
  , complete_(complete)
  , stones_written_(0)
  , out_(to)
{
  ....
}

在这里，该类的所有字段都已初始化，这确认了：程序员确实计划初始化所有字段，但无意间忘了一件事。

警告7

V780无法使用memset函数初始化非被动（非PDS）类型的对象'＆params'。第261章

/* Not the best numbering system,
   but it grew this way for historical reasons
 * and I want to preserve existing binary files. */
typedef enum
{
  PROBING=0,
  REST_PROBING=1,
  TRIE=2,
  QUANT_TRIE=3,
  ARRAY_TRIE=4,
  QUANT_ARRAY_TRIE=5
}
ModelType;

....

struct FixedWidthParameters {
  unsigned char order;
  float probing_multiplier;
  // What type of model is this?
  ModelType model_type;
  // Does the end of the file 
  // have the actual strings in the vocabulary?
  bool has_vocabulary;
  unsigned int search_version;
};

....

// Parameters stored in the header of a binary file.
struct Parameters {
  FixedWidthParameters fixed;
  std::vector<uint64_t> counts;
};

....

void BinaryFormat::FinishFile(....)
{
  ....
  // header and vocab share the same mmap.
  Parameters params = Parameters();
  memset(&params, 0, sizeof(Parameters)); // <=
  ....
}

要理解此警告，建议您首先了解什么是PDS类型。 PDS代表被动数据结构，一种简单的数据结构。有时，他们说的不是“ PDS”而是“ POD”-“普通旧数据”。用简单的术语（我引用俄语Wikipedia），PDS-type是一种数据类型，在内存中具有严格定义的字段位置，不需要访问限制和自动控制。简而言之，它是仅包含内置类型的数据类型。

POD类型的显着特征是可以使用原始内存管理功能（memset，memcpy等）来更改和处理这些类型的变量。但是，对于``非PDS''类型不能这么说：如此低水平地处理其值可能会导致严重错误。例如，内存泄漏，相同资源的两次刷新或未定义的行为。

PVS-Studio对上面给出的代码发出警告：您不能以这种方式处理“参数”类型的结构。如果查看此结构的定义，可以看到其第二个成员的类型为std :: vector...该类型主动使用自动内存管理，除了内容数据之外，还存储其他服务变量。使用memset将此类字段清零会破坏类的逻辑，这是一个严重的错误。

警告8

V575潜在的空指针已传递到“ memcpy”函数。检查第一个参数。检查行：73，68.modelstate.cc 73

Metadata*
ModelState::decode_metadata(const DecoderState& state, 
                            size_t num_results)
{
  ....
  Metadata* ret = (Metadata*)malloc(sizeof(Metadata));
  ....
  memcpy(ret, &metadata, sizeof(Metadata));
  return ret;
}

下一条警告告诉我们，空指针正在传递给memcpy函数。是的，的确，如果malloc函数无法分配内存，它将返回NULL。在这种情况下，该指针将被传递到memset函数，在此将其取消引用-从而导致迷人的程序崩溃。

但是，我们的某些读者可能会愤慨：如果内存溢出/碎片过多，以致malloc无法分配内存，接下来发生的事情真的重要吗？该程序仍然会崩溃，因为由于内存不足，它将无法正常运行。

我们反复遇到这种观点，并认为它是错误的。我会详细告诉您为什么如此，但是此主题值得另作文章。这么多值得我们在几年前写它：)如果您想知道为什么应该始终检查malloc函数返回的指针，那么我请您阅读：检查malloc返回的原因为什么很重要。

警告9

以下警告是由与上一个警告相同的原因引起的。是的，它指出的错误略有不同。

V769“ middle_begin_ +（counts.size（）-2）”表达式中的“ middle_begin_”指针可以为nullptr。在这种情况下，结果值将毫无意义，因此不应使用。检查行：553、552。search_trie.cc 553

template <class Quant, class Bhiksha> class TrieSearch {
....
private:
  ....
  Middle *middle_begin_, *middle_end_;
  ....
};

template <class Quant, class Bhiksha>
uint8_t *TrieSearch<Quant, Bhiksha>::SetupMemory(....)
{
  ....
  middle_begin_
    = static_cast<Middle*>(malloc(sizeof(Middle) * (counts.size() - 2)));
  middle_end_ = middle_begin_ + (counts.size() - 2);
  ....
}

就像前面的示例一样，这里使用malloc函数分配内存。在算术表达式中使用返回的指针，无需对nullptr进行任何检查。 las，这种表达式的结果将毫无意义，并且一个完全无用的值将存储在middle_end_字段中。

警告10

最后，我认为，最有趣的示例是在DeepSpeech包含的kenlm库中找到的：

V1061扩展'std'名称空间可能会导致未定义的行为。 size_iterator.hh 210

// Dirty hack because g++ 4.6 at least wants
// to do a bunch of copy operations.
namespace std {
inline void iter_swap(util::SizedIterator first,
                      util::SizedIterator second)
{
  util::swap(*first, *second);
}
} // namespace std

评论中称为“肮脏的把戏”的把戏确实很脏。关键是std名称空间的这种扩展可能导致未定义的行为。

为什么？因为std名称空间的内容完全由标准委员会确定。这就是为什么C ++语言的国际标准明确禁止以这种方式扩展std。

g ++ 4.6支持的最后一个标准是C ++ 03。这是C ++ 03最终工作草案的翻译报价（请参阅第17.6.4.2.1项）：“除非另外指定，否则如果C ++程序的行为向std名称空间或std嵌套名称空间添加声明或定义，则该行为不确定。此引用适用于所有后续标准（C ++ 11，C ++ 14，C ++ 17和C ++ 20）。

我建议考虑如何从我们的示例中修复有问题的代码。第一个逻辑问题：这些“表明相反情况的情况”是什么？在几种情况下，std扩展不会导致未定义的行为。您可以在V1061诊断的文档页面上阅读有关所有这些情况的更多信息，但是现在对我们来说重要的是，其中一种情况是添加了功能模板的特殊化。

因为std已经具有一个称为iter_swap的函数（请注意：模板函数），可以合理地假设程序员希望扩展其功能，以便可以与util :: SizedIterator类型一起使用。但这是不幸的事情：程序员没有在函数模板中添加特殊化，而是简单地编写了一个普通的重载。它应该这样写：

namespace std {
template <>
inline void iter_swap(util::SizedIterator first,
                      util::SizedIterator second)
{
  util::swap(*first, *second);
}
} // namespace std

但是，此代码也不是那么简单。关键是该代码仅在C ++ 20标准下才有效。是的，它也注意到函数模板的专业化导致未定义的行为（请参见C ++ 20最终工作草案第16.5.4.2.1节）。并且由于此代码属于库，因此迟早可能会使用-std = C ++ 20标志来构建它。顺便说一句，PVS-Studio可以区分代码中使用的是哪个版本的标准，并根据此版本发出或不发出警告。见自己：例如对于C ++ 17，例如对于C ++ 20。

实际上，您可以做得容易得多。要解决该错误，您只需要传输自己的iter_swap定义即可进入定义SizedIterator类的相同名称空间。在这种情况下，在调用iter_swap的地方，您需要添加“ using std :: iter_swap;”。事实证明是这样的（为简单起见，更改了SizedIterator类的定义和util :: swap（）函数）：

namespace util
{
  class SizedIterator
  {
  public:
    SizedIterator(int i) : m_data(i) {}

    int& operator*()
    {
      return m_data;
    }
  private:
    int m_data;
  };

  ....

  inline void iter_swap(SizedIterator first,
                        SizedIterator second)
  {
    std::cout << "we are inside util::iter_swap" << std::endl;
    swap(*first, *second);
  }
}


int main()
{
  double d1 = 1.1, d2 = 2.2;
  double *pd1 = &d1, *pd2 = &d2;
  util::SizedIterator si1(42), si2(43);

  using std::iter_swap;

  iter_swap(pd1, pd2);
  iter_swap(si1, si2); // "we are inside util::iter_swap"

  return 0;
}

现在，编译器将基于参数搜索（ADL）独立选择iter_swap函数所需的重载。对于SizedIterator类，将调用命名空间util中的版本，对于其他类型，将调用命名空间std中的版本。证据在链接上。而且，不需要在库函数内部添加任何“使用”：由于它们的代码已经在std内部，因此编译器仍将选择正确的重载。

然后-瞧-定制的iter_swap函数将正常运行，而无需任何“肮脏的把戏”和其他巫术：)

结论

我的文章到此结束。我希望我发现的错误对您来说很有趣，并且您学到了一些对自己有用的新东西。如果您已经读到这一点，那么我衷心希望您编写的代码整洁，没有错误。让错误绕过您的项目！

PS：我们认为在命名空间std中编写自己的代码是一种不好的做法。你怎么看？我期待您在评论中的答复。

如果您使用C，C ++，C＃或Java开发，并且像我一样对静态分析感兴趣，那么我建议您自己尝试PVS-Studio。您可以从链接下载它。

如果您想与讲英语的读者分享这篇文章，请使用翻译链接：George Gribkov。检查DeepSpeech的代码，或为什么不应该在命名空间std中编写代码。

分析DeepSpeech项目的代码或为什么不应该在命名空间std中编写代码

介绍

分析仪发出的10个有趣警告的概述

结论

More articles: