😘 🛀🏽 👩🏽‍💼 两个标准C库的历史 👩🏾‍🎤 ♑️ 🔠

今天，我收到了一个Debian用户的错误报告，该用户向scdoc实用程序中添加了一些废话SIGSEGV。研究这个问题让我之间做出一个很好的比较musl libc和glibc。首先，让我们看一下堆栈跟踪：

==26267==ERROR: AddressSanitizer: SEGV on unknown address 0x7f9925764184
(pc 0x0000004c5d4d bp 0x000000000002 sp 0x7ffe7f8574d0 T0)
==26267==The signal is caused by a READ memory access.
    0 0x4c5d4d in parse_text /scdoc/src/main.c:223:61
    1 0x4c476c in parse_document /scdoc/src/main.c
    2 0x4c3544 in main /scdoc/src/main.c:763:2
    3 0x7f99252ab0b2 in __libc_start_main
/build/glibc-YYA7BZ/glibc-2.31/csu/../csu/libc-start.c:308:16
    4 0x41b3fd in _start (/scdoc/scdoc+0x41b3fd)

这行的源代码说明了这一点：

if (!isalnum(last) || ((p->flags & FORMAT_UNDERLINE) && !isalnum(next))) {

提示：这p是一个有效的非空指针。变量last和next类型uint32_t。段错误发生在第二个函数调用上isalnum。而且，最重要的是，仅当使用glibc而非Musl libc时才可重现。如果您必须多次重新阅读代码，那么并不孤单：根本没有什么可以触发段错误的。

既然知道整个东西都在glibc库中，我就得到了它的源代码并开始寻找实现isalnum，准备面对一些愚蠢的废话。但在此之前我到愚蠢的废话，这一点，相信我，散装，让我们先来快速浏览一下一个不错的选择。这是isalnum在musl libc中实现该功能的方式：

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

不出所料，对于任何值，该c函数都可以在没有段错误的情况下运行，因为为什么isalnum要在所有情况下都抛出段错误？

好的，现在让我们将其与glibc实现进行比较。打开标题后，就会遇到典型的GNU废话，但让我们跳过它并尝试查找它isalnum。

第一个结果是：

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

它看起来像一个实现细节，让我们继续。

__exctype (isalnum);

但这是什么__exctype？我们往回走几行...

#define __exctype(name) extern int name (int) __THROW

好吧，显然这只是一个原型。但是，目前尚不清楚为什么需要此处的宏。往前看...

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...

因此，这看起来已经很有用。这是什么__isctype_f？颤抖...

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
# define __isctype_f(type) \
  __extern_inline int                                                         \
  is##type (int __c) __THROW                                                  \
  {                                                                           \
    return (*__ctype_b_loc ())[(int) (__c)] & (unsigned short int) _IS##type; \
  }
#endif

好吧，它开始了...好吧，我们会以某种方式解决它。显然，__isctype_f这是一个内联函数...停止，它全部在#ifndef __cplusplus预处理程序指令的else块内。死路。isalnum她的母亲实际上在哪里定义？进一步寻找...也许是吗？

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...
# elif defined __isctype
# define isalnum(c)     __isctype((c), _ISalnum) // <-

嘿，这是我们前面看到的“实现细节”。记得？

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

让我们尝试快速选择此宏：

# include <bits/endian.h>
# if __BYTE_ORDER == __BIG_ENDIAN
#  define _ISbit(bit)   (1 << (bit))
# else /* __BYTE_ORDER == __LITTLE_ENDIAN */
#  define _ISbit(bit)   ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
# endif

这是什么鬼？好吧，让我们继续前进，认为这只是一个魔术常数。另一个宏称为__isctype，与我们最近看到的宏类似__isctype_f。让我们再看一下分支#ifndef __cplusplus：

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
// ...
#endif

嗯。。。

至少，我们发现了一个指针解除引用，可以解释该段错误。什么事__ctype_b_loc啊

/*      ctype-info.c.
          localeinfo.h.

     ,   , (. `uselocale'  <locale.h>)
        ,  .
    ,   -,   
    ,    ,   .

        384 ,    
     `unsigned char' [0,255];   EOF (-1);  
    `signed char' value [-128,-1).  ISO C ,   ctype 
      `unsigned char'  EOF;    
    `signed char'      .
          `int`,
     `unsigned char`,   `tolower(EOF)'   EOF,   
       `unsigned char`.     - , 
         .  */
extern const unsigned short int **__ctype_b_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_tolower_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_toupper_loc (void)
     __THROW __attribute__ ((__const__));

glibc，您真酷！我只是喜欢与语言环境打交道。无论如何，gdb已连接到我崩溃的应用程序，考虑到我收到的所有信息，我编写了以下代码：

(gdb) print ((unsigned int **(*)(void))__ctype_b_loc)()[next]
Cannot access memory at address 0x11dfa68

发现段错误。在注释中有一句话是这样的：``ISO C要求ctype函数使用诸如`unsigned char'和EOF之类的值。如果我们在规范中找到了这个，我们会看到：

在[在ctype.h中声明的函数的所有实现中]，参数为int，其值必须适合无符号字符，或等于EOF宏的值。

现在很明显如何解决该问题。我的关节。事实证明，我无法isalnum输入任意UCS-32字符来检查其在0x30-0x39、0x41-0x5A和0x61-0x7A范围内的出现。

但是在这里，我可以自由地建议：也许该函数isalnum根本不应该抛出段错误，无论它得到什么？也许即使规范允许这样做，这也不意味着应该以这种方式进行？也许，就像一个疯狂的主意一样，此函数的行为不应包含五个宏，请检查C ++编译器的使用，具体取决于体系结构的字节顺序，查找表，流语言环境数据以及取消引用两个指针？

让我们再次回顾一下musl版本：

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

这些是馅饼。

译者注：感谢MaxGraey与原始文档的链接。

两个标准C库的历史

More articles: