Tesseract源代码阅读:字符串 STRING

2015-05-11

STRING 类是 Tesseract 中自定义的字符串类，封装了一些字符串操作，该类定义于 ccutil/strngs.h 中，同样的，不知道出于什么目的，其设计让人一开始摸不着头脑。

按理来说，如果要实现一个字符串类，那么其内部应该要有一个保存字符串内容的数据成员，这个 STRING 类确实有数据成员，不过其类型是内部定义的一个结构 STRING_HEADER:

class TESS_API STRING {
public:
    // ....

private:
    typedef struct STRING_HEADER {
        int capacity_;

        // used_ is how much of the capacity is currently being used,
        // including a '\0' terminator.
        // if used_ is 0 then string is NULL(not even the '\0')
        // else if used_ > 0 then it is strlen() + 1 (because it includes '\0')
        mutable int used_;
    } STRING_HEADER;

    STRING_HEADER *data_;

    // ....
};

可以看到这个数据成员中并没有保存字符串内容的部分。

STRING 类中有两个方法，返回的是一个 const char * 类型的值，这两个方法的名称为 string() 和 c_str() ，从这两点上来看，可以认为这两个方法是将 STRING 类内部存储的字符串内容转换为 const char * 类型返回了出来，那么可以从这两个方法去入手来发掘这个类的秘密。

这两个方法的实现都很简单:

string() 方法

const char* STRING::string() const {
    const STRING_HEADER* header = GetHeader();
    if (header->used_ == 0)
        return NULL;

    // mark header length unreliable because tesseract might
    // cast away the const and mutate the string directly.
    header->used_ = -1;
    return GetCStr();
}

c_str()

const char* STRING::c_str() const {
    return string();
}

c_str() 方法是 string() 方法的别名，而 string() 方法中返回的结果是 GetCStr() 这个方法的结果，顺藤摸瓜找到这个方法，发现它的实现如下:

inline const char* GetCStr() const {
    return ((const char *)data_) + sizeof(STRING_HEADER);
};

也就是说，STRING 类把字符串内容放在了 data_ 后面的那个地址空间上，不过这段空间是匿名的，只能通过上面这样的方式进行访问。

到底是出于何种考虑进行这样的设计，我也无从得之，但我个人是对这个设计不太赞同的。要说有什么优点，就是在进行内存分配时少了一步操作 —— 在 STRING::AllocData 中是这样进行内存分配的:

char *STRING::AllocData(int used, int capacity) {
    data_ = (STRING_HEADER *)alloc_string(capacity + sizeof(STRING_HEADER));

    STRING_HEADER *header = GetHeader();
    header->capacity_ = capacity;
    header->used_ = used;

    return GetCStr();
}

另外，上面这个方法里的 GetHeader() 方法返回的就是 data_ ，直接这样我想也是可以的:

data_ = (STRING_HEADER *)alloc_string(capacity + sizeof(STRING_HEADER));

data_->capacity_ = capacity;
data_->used_ = used;

return GetCStr();

在 STRING 类封装的方法里，大部分在 C/C++ 中已有对应的字符串操作。

ZMonster's Blog 巧者劳而智者忧，无能者无所求，饱食而遨游，泛若不系之舟

Tesseract源代码阅读:字符串 STRING