Protobuf初探 - Crazy Mark

文 · Mark

对于软件开发者，尤其是移动端（Android & iOS）开发者来说，XML和JSON两种文件传输格式并不陌生，尤其是JSON，仅在上述两种开发领域中就广泛应用。随之而来的，是各种JSON的解析库，此文中，笔者不想介绍各个平台丰富多彩的JSON解析库，而是想和大家分享发现的一种新的文件传输格式，Protobuf。

Protobuf是什么鬼

初次相遇Protobuf，这是我的第一反应。那么Protobuf到底是什么鬼？

Protobuf是一种灵活高效的、用于跨平台数据通信的数据传输格式，全称Protocol Buffers，类似XML和JSON。下面，我们先看一个protobuf的简单例子，如下：

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}

其实，Protobuf出身名门Google，采用C++实现，在Google内部已应用多年，并有与平台无关，与语言无关，可高效序列化传输等特点，当前最新的Protobuf版本为3.0。

发展现状

Google最初为了解决统一内部传输问题，自己制定了一套高效的数据传输格式定义协议，即Protobuf，并逐步在Google内部项目中使用。随着Protobuf在Google内部被越来越多的项目所采用，本身性能也在逐步被改良提高。
在2008年7月7日，将Protobuf贡献给开源社区进行了开源，目前，Github上可以找到相应源码。
目前Protobuf不仅在Google内部广泛使用，在RPC通信中广泛使用，在国外，Facebook在部分项目中采用了Protobuf作为通信的编解码工具。在国内，Protobuf在百度，腾讯TDW平台，阿里巴巴部分项目中，也被作为基础库充分利用。除此之外，还有一些其他公司也在使用Protobuf左右数据通信的基础类库。
但是，在移动平台开发中，应用还不是很广泛。

支持的语言

Google官方对Protobuf提供C++、Java、Python三种语言提供官方支持。对于proto3，还支持Go、JavaNano、Ruby和C#。当然，几乎所有当前的编程语言，都有相应的Protobuf支持库，所以，对于Protobuf的支持的语言还是很全面的。

工作原理

使用Protobuf，只需要根据Protobuf的语法规范，定义需要传输的数据内容格式，定义Protobuf的文件为.proto文件，例如，定义一个具有name和email字段的Person信息对象时，.proto文件内容如下：

message Person {
    required string name = 1;
    optional string email = 2;
}

定义好数据格式之后，我们只需要使用我们需要的语言的Protobuf编译器对.proto文件进行编译即可，输出我们需要的对应语言的Model定义类源码文件，在该文件中会自动生成序列化方法，getter和setter方法等，例如对于C++语言，上述.proto文件编译命令为：

protoc --cpp_out=cpp_dir person.proto

protobuf编译器根据.proto文件对Message Person定义，编译生成对应的源码文件，包含.h、.cpp文件，其中.h文件如下，下面我们就看一下具体生成了哪些内容。

基本Person类框架，包括Person对象的构建函数，析构函数，等于运算符重载，Descriptor函数等；

  class Person;
...
  class Person : public ::google::protobuf::Message {
  public:
    Person();
    virtual ~Person();
Person(const Person&amp; from);

inline Person&amp; operator=(const Person&amp; from) {
  CopyFrom(from);
  return *this;
}

inline const ::google::protobuf::UnknownFieldSet&amp; unknown_fields() const {
  return _unknown_fields_;
}

inline ::google::protobuf::UnknownFieldSet* mutable_unknown_fields() {
  return &amp;_unknown_fields_;
}

static const ::google::protobuf::Descriptor* descriptor();
static const Person&amp; default_instance();

由剩余部分代码可以清晰的看出，Protobuf编译器已经帮我们生成好了Model构建的方法，各属性的getter、setter方法、以及序列化方法，在后续过程中，我们只需要调用相关方法即可；

    // required string name = 1;
    inline bool has_name() const;
    inline void clear_name();
    static const int kNameFieldNumber = 1;
    inline const ::std::string& name() const;
    inline void set_name(const ::std::string& value);
    inline void set_name(const char* value);
    inline void set_name(const char* value, size_t size);
    inline ::std::string* mutable_name();
    inline ::std::string* release_name();
    inline void set_allocated_name(::std::string* name);
// optional string email = 2;
inline bool has_email() const;
inline void clear_email();
static const int kEmailFieldNumber = 2;
inline const ::std::string&amp; email() const;
inline void set_email(const ::std::string&amp; value);
inline void set_email(const char* value);
inline void set_email(const char* value, size_t size);
inline ::std::string* mutable_email();
inline ::std::string* release_email();
inline void set_allocated_email(::std::string* email);

// @@protoc_insertion_point(class_scope:Person)

private:
inline void set_has_name();
inline void clear_has_name();
inline void set_has_email();
inline void clear_has_email();

那么，我们通过protobuf编译器protoc编译出了我们需要的Model定义类Person.h和Person.cpp，剩下的工作只需要调用进行使用了。下面我们简单调用一下我们生成的相应Person Model代码。
调用代码如下：

#include <iostream>
#include <fstream>
#include <string>
#include “person.pb.h”
using namespace std;
// Main function:  Reads the entire person info. from a file,
//   adds one person based on user input, then writes it back out to the same
//   file.
int main(int argc, char* argv[]) {
// Verify that the version of the library that we linked against is
// compatible with the version of the headers we compiled against.
GOOGLE_PROTOBUF_VERIFY_VERSION;
…
Person *person = new Person();
person->set_name(“Mark CJ”);
person->set_email(“markcjemail@google.com”);
cout << “Person Info : name ” << person.name() << ”, email : ” << person.email() << endl;
{
// Read the existing person object.
fstream input(argv[1], ios::in | ios::binary);
if (!input) {
cout << argv[1] << ”: File not found.  Creating a new file.” << endl;
} else if (!person.ParseFromIstream(&input)) {
cerr << “Failed to parse person object.” << endl;
return -1;
}
}
…
{
// Write the new person object back to disk.
fstream output(argv[1], ios::out | ios::trunc | ios::binary);
if (!person.SerializeToOstream(&output)) {
cerr << “Failed to write person object.” << endl;
return -1;
}
}
// Optional:  Delete all global objects allocated by libprotobuf.
google::protobuf::ShutdownProtobufLibrary();
return 0;
}

Protobuf优势

Protobuf是用于结构化数据串行化的灵活、高效、自动的方法，类似XML，不过它比XML更小、更快、也更简单。你可以定义自己的数据结构，然后使用代码生成器生成的代码来读写这个数据结构。你甚至可以在无需重新部署程序的情况下更新数据结构。

体积更小、速度更快

相对于XML，Protobuf只有XML文件的1/10到1/3大小。
例如，当需要传输一个带有name和email字段的Person对象信息时，使用XML格式如下

<person>
    <name>John Doe</name>
    <email>jdoe@example.com</email>
</person>

如果剔除空格，该XML体积为69字节，会大概话费5000-10000 nm解析上述数据。
而使用Protobuf的message形式，转为人类可读内容如下：

# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
    name: "John Doe"
    email: "jdoe@example.com"
}

但当上述Protobuf的Person信息转为二进制形式（Protobuf传输时使用二进制）时，约28字节，大概话费100-200nm解析使用。

以下是针对于Protobuf及相关竞品做的一份性能测试

浅蓝色为序列化时间，深蓝色为反序列化时间
序列号反序列化对比

内存占用情况对比
内存占用对比

书写简单、更少歧义

当编程输出Model内容时，采用C++，代码如下：

cout << "Name: " << person.name() << endl;
cout << "E-mail: " << person.email() << endl;

但当采用XML，对获取到的Person信息进行解析时，代码如下：

cout << "Name: "
       << person.getElementsByTagName("name")->item(0)->innerText()
       << endl;
cout << "E-mail: "
       << person.getElementsByTagName("email")->item(0)->innerText()
       << endl;

相比于JSON，除了语义更简单之外，Protobuf中被编号的字段可以排除所谓的版本检查，保证无偿的向后兼容。
由上面所述可见，我们也可以使用Protobuf中的@required、@optional、@repeated等属性，将一些常识性的调试操作，转换为正式的拓展，比如将原来@optional的字段内容转换为
@required的字段。

Protobuf劣势

虽然Protobuf的效率以及体积控制很出色，但是万物都有优点，也有缺点，当然Protobuf也不例外。

相对于XML， Protobuf的功能略显简单，无法表达较为复杂的概念定义，所以，对于复杂的定义需求，无法有效的实现。

由于XML在多行业中被广泛、长期的使用，所以，使用XML已经成为了部分行业的标准工具，而Protobuf只在Google内部使用较多，所以对于被更广泛的其他行业所使用，还有很长的路要走。
为了缩减Protobuf的传输数据文件大小，也为了加快解析速度，Protobuf采用二进制格式进行存储，所以导致存储后或传输过程中的数据，对人类可读性差，不利于中间代码数据调试。

和XML相比，Protobuf也适用描述标记语言的传输，比较适用于描述数据结构，而XML在这两方面，均可适用。

而相对于JSON，Protobuf在序列化速度和反序列化速度方面还略有差距，这一点也是Protobuf需要补强的一部分。同时，在服务端和Web端数据通信中，JSON的使用广泛性还是要高于Protobuf，这也源于前端原生库及第三方库对JSON的有效支持，而Protobuf在Web端，还没有如此广泛的支持。

小结

以上只是对Protobuf的初探内容，本想找到一种可以替代JSON的解决方案，但是并不是完全没有收获，虽然现在Protobuf没有在移动端广泛应用，因为当前JSON的各项性能均与其类似，但是Protobuf有些设计思想还是可以供大家借鉴的。如果有朋友想深入了解，可以访问Google官网及查看相关源码，笔者文笔较差，欢迎批评指正，多多交流。

在学习Protobuf的过程中，发现了一个名叫Protostuff的好东西，protostuff针对protobuf进行了部分优化，包括可选免去预编译等操作，初探比XML及JSON效率都要高效，有兴趣的朋友也可以深入了解一下。