统计下protobuf中对消息调用DebugString输入bytes数量。
https://en.cppreference.com/w/cpp/language/string_literal
这个地方是增加了字符常量的语法类型,其中比较特殊的是R"delimiter(raw_characters)delimiter" 这种语法,也即是可以定义代码中文本内容作为字符串的格式。
例如
tsecer@harry: cat RawString.cpp
#include <stdio.h>
int main()
{
printf("%s",R"(
a
b
)");
return 0;
}
tsecer@harry: g++ RawString.cpp
tsecer@harry: ./a.out
a
b
tsecer@harry:
https://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s
这个其实是把""ID作为一个字符串操作符来处理了,标准中s只是因为C++库中定义了这个操作符我们自己不用定义而已。例如,我们也可以定义自己的操作符。但是,这个操作符有一个好处在于编译器计算出了字符串的原始长度,而不是C字符串的长度。
tsecer@harry: cat c++.string.operator.cpp
#include <string>
#include <stdio.h>
#include <iostream>
using namespace std::string_literals;
std::string operator ""tsecer(const char *org, size_t len)
{
printf("orig str %s len %ld\n", org, len);
return std::string("harry test");
}
int main()
{
return printf("%s \n s suffix %s\n", "org"tsecer.c_str(), "org"s.c_str());
}
tsecer@harry: g++ c++.string.operator.cpp
c++.string.operator.cpp:7:22: 警告:literal operator suffixes not preceded by ‘_’ are reserved for future standardization [-Wliteral-suffix]
std::string operator ""tsecer(const char *org, size_t len)
^~~~~~~~
tsecer@harry: ./a.out
orig str org len 3
harry test
s suffix org
tsecer@harry:
从pb的说明中可以看到,bytes类型在C++中也是通过std::string来表示。之前一直有一个印象,std的string是一个C string,也就是必须以'\0'结尾。现在才看明白是可以任意内容加长度的形式,所以字符之间有0是完全可以接受的。
https://stackoverflow.com/a/164274
在protobuf中,可以通过Message的DebugString接口对一个消息进行可视化处理。这个处理就是真正使用到了一个消息的reflection。
protobuf-master\src\google\protobuf\text_format.cc
void TextFormat::Printer::Print(const Message& message,
TextGenerator* generator) const {
const Reflection* reflection = message.GetReflection();
if (!reflection) {
// This message does not provide any way to describe its structure.
// Parse it again in an UnknownFieldSet, and display this instead.
UnknownFieldSet unknown_fields;
{
std::string serialized = message.SerializeAsString();
io::ArrayInputStream input(serialized.data(), serialized.size());
unknown_fields.ParseFromZeroCopyStream(&input);
}
PrintUnknownFields(unknown_fields, generator);
return;
}
const Descriptor* descriptor = message.GetDescriptor();
auto itr = custom_message_printers_.find(descriptor);
if (itr != custom_message_printers_.end()) {
itr->second->Print(message, single_line_mode_, generator);
return;
}
if (descriptor->full_name() == internal::kAnyFullTypeName && expand_any_ &&
PrintAny(message, generator)) {
return;
}
std::vector<const FieldDescriptor*> fields;
if (descriptor->options().map_entry()) {
fields.push_back(descriptor->field(0));
fields.push_back(descriptor->field(1));
} else {
reflection->ListFields(message, &fields);
}
if (print_message_fields_in_index_order_) {
std::sort(fields.begin(), fields.end(), FieldIndexSorter());
}
for (int i = 0; i < fields.size(); i++) {
PrintField(message, reflection, fields[i], generator);
}
if (!hide_unknown_fields_) {
PrintUnknownFields(reflection->GetUnknownFields(message), generator);
}
}
protobuf-master\src\google\protobuf\stubs\strutil.cc
// ----------------------------------------------------------------------
// Escapes 'src' using C-style escape sequences, and appends the escaped string
// to 'dest'. This version is faster than calling CEscapeInternal as it computes
// the required space using a lookup table, and also does not do any special
// handling for Hex or UTF-8 characters.
// ----------------------------------------------------------------------
void CEscapeAndAppend(StringPiece src, string* dest) {
size_t escaped_len = CEscapedLength(src);
if (escaped_len == src.size()) {
dest->append(src.data(), src.size());
return;
}
size_t cur_dest_len = dest->size();
dest->resize(cur_dest_len + escaped_len);
char* append_ptr = &(*dest)[cur_dest_len];
for (int i = 0; i < src.size(); ++i) {
unsigned char c = static_cast<unsigned char>(src[i]);
switch (c) {
case '\n': *append_ptr++ = '\\'; *append_ptr++ = 'n'; break;
case '\r': *append_ptr++ = '\\'; *append_ptr++ = 'r'; break;
case '\t': *append_ptr++ = '\\'; *append_ptr++ = 't'; break;
case '\"': *append_ptr++ = '\\'; *append_ptr++ = '\"'; break;
case '\'': *append_ptr++ = '\\'; *append_ptr++ = '\''; break;
case '\\': *append_ptr++ = '\\'; *append_ptr++ = '\\'; break;
default:
if (!isprint(c)) {
*append_ptr++ = '\\';
*append_ptr++ = '0' + c / 64;
*append_ptr++ = '0' + (c % 64) / 8;
*append_ptr++ = '0' + c % 8;
} else {
*append_ptr++ = c;
}
break;
}
}
}
可以看到,pb对于正则表达式,不可显示的字符默认通过三个字符的八进制数表示。所以使用这个正则表达式统计的话,就需要统计匹配字符的数量。好在正则表达式提供了匹配次数的语法,就是通过大括号表示匹配次数。
grep的 -o 选项支持只输出所有的匹配内容
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
tsecer@harry: cat tsecer.proto
syntax = "proto3";
message tsecer {
bytes harry = 1;
};
tsecer@harry: cat main.cpp
#include <iostream>
#include <fstream>
#include <string>
#include "tsecer.pb.h"
#include <google/protobuf/util/json_util.h>
using namespace std;
int main(int argc, char* argv[]) {
char str[] = "\00\01\02\03HelloWorld";
tsecer msg;
msg.set_harry(std::string(str, sizeof(str)));
printf("%s\n", msg.DebugString().c_str());
return 0;
}
tsecer@harry: ./a.out
harry: "\000\001\002\003HelloWorld\000"
tsecer@harry:
其中'\\[[:digit:]]\{3\}\|\\[nr"'\''\t]\|[^\]'
\\[[:digit:]]\{3\} 表示匹配\引导的三个数字
\\[nr"'\''\t] 表示 \引导的在[nr"'\''\t] 集合中的元素
[^\]表示不是\的任意单个字符
这里假设了正则表达式的匹配优先级和书写的优先级相同。但这事实上是一种巧合,因为匹配任意字符的正则表达式生效的优先级最低。事实上,即使把匹配任何字符的表达式[^\]放在最前面,生成的结果也是相同的。
tsecer@harry: echo '"\000\001\002\003\n\\\\\tHelloWorld\000"' | grep -o '\\[[:digit:]]\{3\}\|\\[nr"'\''\t]\|[^\]'
"
\000
\001
\002
\003
\n
\\
\\
\t
H
e
l
l
o
W
o
r
l
d
\000
"