fasttext中使用的c++技巧总结

最近在研究fasttext源码，这篇文章主要分析一些非理论层面的c++的技巧。

编译方面

由于需要对多线程进行支持使用了-pthread参数
-std=c++0x : C++11之前被称为C++0x，因为原本计划在2010年发布，所以之前一些编译器使用C++11的编译参数是：-std=c++0x，后面使用：-std=c++11

代码方面

reserve 和 resize对区别？
- reserve的作用是预留空间，不会创建对象，真正的数据填充需要使用push_back操作来进行。为什么要这样做？下面这个代码会导致capacity多次重新分配，去预申请内存，然后将原来地址上的所有元素拷贝到新地址上。因为vector在发现当前空间不够时，一般会把capacity进行翻倍，如果再次不够就再次翻倍, 这种效率是很低的。
  1
  2
  3
  4
  vector<int> v;
  for (int i = 0; i < 100; ++i) {
  v.push_back(i);
  }
- resize是直接调整size，改变容器的大小，并且创建对象。如果new_size > old_size, 则在vector增加new_size - old_size默认构造出来的元素；反之，将多出的部分删掉，但不改变capacity。
lambad表达式
1
2
3
4
for (int32_t i = 0; i < args_->thread; i++) {
//lambda表达式
threads.push_back(std::thread([=]() { trainThread(i); }));
}
lambda表达式的定义如下：[capture list] (parameter list) -> return type { function body }
其中capture_list的官方解释是：capture list is an (often empty) list of local variables defined in the enclosing function, 指的是lambda所在的函数中的局部变量
parameter list 指的是函数的参数，return type指的是返回类型，function body指的是函数主题。
那么，上面这段代码的意思是：lambda表达式所在函数中所有的局部变量为capture list，没有参数，没有返回值的函数。下面再举个例子说明下：
1
2
3
4
5
6
7
8
9
10
void test_equal(int c) {
int a = 5;
int b = 3;
auto f2 = [=]() {return a + b + c;};
cout << f2() << endl;
}
int main() {
test_equal(5);
return 0;
}
这段代码的输出是：5+3+5=13
shared_ptr智能指针
具体问题请参考这篇博客：https://www.cnblogs.com/heleifz/p/shared-principle-application.html

使用int32_t／int64_t
为了程序的可移植性之后要这样使用, 我们在stdint.html(https://sites.uclouvain.be/SystInfo/usr/include/stdint.h.html)中可以看到long在64位机器上是8字节的，但在32位机器上是32字节的，long是会随着机器的不同而改变的，长度是不确定的，但是使用int32_t就能保证是32位的。

ifndef __int8_t_defined
# define __int8_t_defined
typedef signed char                int8_t;
typedef short int                int16_t;
typedef int                        int32_t;
# if __WORDSIZE == 64
typedef long int                int64_t;
# else
__extension__
typedef long long int                int64_t;
# endif
#endif

/* Unsigned.  */
typedef unsigned char                uint8_t;
typedef unsigned short int        uint16_t;
#ifndef __uint32_t_defined
typedef unsigned int                uint32_t;
# define __uint32_t_defined
#endif
#if __WORDSIZE == 64
typedef unsigned long int        uint64_t;
#else
__extension__
typedef unsigned long long int        uint64_t;
#endif

utf-8编码边界在线检测
参考下阮一峰的这篇博客：
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
如果一个字符的前两个字节是10的话，那么utf-8完整字符一定还没有结束

void Dictionary::computeSubwords(const std::string& word,
                std::vector<int32_t>& ngrams) const {
     for (size_t i = 0; i < word.size(); i++) {
          std::string ngram;
          if ((word[i] & 0xC0) == 0x80) continue;
          for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
              ngram.push_back(word[j++]);
              while (j < word.size() && (word[j] & 0xC0) == 0x80) {
                  ngram.push_back(word[j++]);
              }
              if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
                  int32_t h = hash(ngram) % args_->bucket;
                  pushHash(ngrams, h);
              }
          }
     }
}