Using BertServer¶
Installation¶
The best way to install the server is via pip. Note that the server can be installed separately from BertClient or even on a different machine:
pip install bert-serving-server
Warning
The server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!
Command Line Interface¶
Once installed, you can use the command line interface to start a bert server:
bert-serving-server -model_dir /uncased_bert_model -num_worker 4
Server-side API¶
Server-side is a CLI bert-serving-start
, you can get the latest
usage via:
bert-serving-start --help
Start a BertServer for serving
usage: bert-serving-server [-h] -model_dir MODEL_DIR
[-tuned_model_dir TUNED_MODEL_DIR]
[-ckpt_name CKPT_NAME] [-config_name CONFIG_NAME]
[-graph_tmp_dir GRAPH_TMP_DIR]
[-max_seq_len MAX_SEQ_LEN] [-cased_tokenization]
[-pooling_layer POOLING_LAYER [POOLING_LAYER ...]]
[-pooling_strategy {NONE,REDUCE_MAX,REDUCE_MEAN,REDUCE_MEAN_MAX,FIRST_TOKEN,LAST_TOKEN,CLS_POOLED,CLASSIFICATION,REGRESSION}]
[-mask_cls_sep] [-no_special_token]
[-show_tokens_to_client] [-no_position_embeddings]
[-num_labels NUM_LABELS] [-port PORT]
[-port_out PORT_OUT] [-http_port HTTP_PORT]
[-http_max_connect HTTP_MAX_CONNECT] [-cors CORS]
[-num_worker NUM_WORKER]
[-max_batch_size MAX_BATCH_SIZE]
[-priority_batch_size PRIORITY_BATCH_SIZE] [-cpu]
[-xla] [-fp16]
[-gpu_memory_fraction GPU_MEMORY_FRACTION]
[-device_map DEVICE_MAP [DEVICE_MAP ...]]
[-prefetch_size PREFETCH_SIZE]
[-fixed_embed_length] [-verbose] [-version]
Named Arguments¶
-verbose | turn on tensorflow logging for debug Default: False |
-version | show program’s version number and exit |
File Paths¶
config the path, checkpoint and filename of a pretrained/fine-tuned BERT model
-model_dir | directory of a pretrained BERT model |
-tuned_model_dir | |
directory of a fine-tuned BERT model | |
-ckpt_name | filename of the checkpoint file. By default it is “bert_model.ckpt”, but for a fine-tuned model the name could be different. Default: “bert_model.ckpt” |
-config_name | filename of the JSON config file for BERT model. Default: “bert_config.json” |
-graph_tmp_dir | path to graph temp file |
BERT Parameters¶
config how BERT model and pooling works
-max_seq_len | maximum length of a sequence, longer sequence will be trimmed on the right side. set it to NONE for dynamically using the longest sequence in a (mini)batch. Default: 25 |
-cased_tokenization | |
Whether tokenizer should skip the default lowercasing and accent removal.Should be used for e.g. the multilingual cased pretrained BERT model. Default: True | |
-pooling_layer | the encoder layer(s) that receives pooling. Give a list in order to concatenate several layers into one Default: [-2] |
-pooling_strategy | |
Possible choices: NONE, REDUCE_MAX, REDUCE_MEAN, REDUCE_MEAN_MAX, FIRST_TOKEN, LAST_TOKEN, CLS_POOLED, CLASSIFICATION, REGRESSION the pooling strategy for generating encoding vectors Default: REDUCE_MEAN | |
-mask_cls_sep | masking the embedding on [CLS] and [SEP] with zero. When pooling_strategy is in {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN} then the embedding is preserved, otherwise the embedding is masked to zero before pooling Default: False |
-no_special_token | |
add [CLS] and [SEP] in every sequence, put sequence to the model without [CLS] and [SEP] when True and is_tokenized=True in Client Default: False | |
-show_tokens_to_client | |
sending tokenization results to client Default: False | |
-no_position_embeddings | |
Whether to add position embeddings for the position of each token in the sequence. Default: False | |
-num_labels | Numbers of Label Default: 2 |
Serving Configs¶
config how server utilizes GPU/CPU resources
-port, -port_in, -port_data | |
server port for receiving data from client Default: 5555 | |
-port_out, -port_result | |
server port for sending result to client Default: 5556 | |
-http_port | server port for receiving HTTP requests |
-http_max_connect | |
maximum number of concurrent HTTP connections Default: 10 | |
-cors | setting “Access-Control-Allow-Origin” for HTTP requests Default: “*” |
-num_worker | number of server instances Default: 1 |
-max_batch_size | |
maximum number of sequences handled by each worker Default: 256 | |
-priority_batch_size | |
batch smaller than this size will be labeled as high priority,and jumps forward in the job queue Default: 16 | |
-cpu | running on CPU (default on GPU) Default: False |
-xla | enable XLA compiler (experimental) Default: False |
-fp16 | use float16 precision (experimental) Default: False |
-gpu_memory_fraction | |
determine the fraction of the overall amount of memory that each visible GPU should be allocated per worker. Should be in range [0.0, 1.0] Default: 0.5 | |
-device_map | specify the list of GPU device ids that will be used (id starts from 0). If num_worker > len(device_map), then device will be reused; if num_worker < len(device_map), then device_map[:num_worker] will be used Default: [] |
-prefetch_size | the number of batches to prefetch on each worker. When running on a CPU-only machine, this is set to 0 for comparability Default: 10 |
-fixed_embed_length | |
when “max_seq_len” is set to None, the server determines the “max_seq_len” according to the actual sequence lengths within each batch. When “pooling_strategy=NONE”, this may cause two “.encode()” from the same client results in different sizes [B, T, D].Turn this on to fix the “T” in [B, T, D] to “max_position_embeddings” in bert json config. Default: False |
Server-side Benchmark¶
If you want to benchmark the speed, you may use:
bert-serving-benchmark --help
Benchmark BertServer locally
usage: bert-serving-benchmark [-h] -model_dir MODEL_DIR
[-tuned_model_dir TUNED_MODEL_DIR]
[-ckpt_name CKPT_NAME]
[-config_name CONFIG_NAME]
[-graph_tmp_dir GRAPH_TMP_DIR]
[-max_seq_len MAX_SEQ_LEN] [-cased_tokenization]
[-pooling_layer POOLING_LAYER [POOLING_LAYER ...]]
[-pooling_strategy {NONE,REDUCE_MAX,REDUCE_MEAN,REDUCE_MEAN_MAX,FIRST_TOKEN,LAST_TOKEN,CLS_POOLED,CLASSIFICATION,REGRESSION}]
[-mask_cls_sep] [-no_special_token]
[-show_tokens_to_client]
[-no_position_embeddings]
[-num_labels NUM_LABELS] [-port PORT]
[-port_out PORT_OUT] [-http_port HTTP_PORT]
[-http_max_connect HTTP_MAX_CONNECT]
[-cors CORS] [-num_worker NUM_WORKER]
[-max_batch_size MAX_BATCH_SIZE]
[-priority_batch_size PRIORITY_BATCH_SIZE]
[-cpu] [-xla] [-fp16]
[-gpu_memory_fraction GPU_MEMORY_FRACTION]
[-device_map DEVICE_MAP [DEVICE_MAP ...]]
[-prefetch_size PREFETCH_SIZE]
[-fixed_embed_length] [-verbose] [-version]
[-test_client_batch_size [TEST_CLIENT_BATCH_SIZE [TEST_CLIENT_BATCH_SIZE ...]]]
[-test_max_batch_size [TEST_MAX_BATCH_SIZE [TEST_MAX_BATCH_SIZE ...]]]
[-test_max_seq_len [TEST_MAX_SEQ_LEN [TEST_MAX_SEQ_LEN ...]]]
[-test_num_client [TEST_NUM_CLIENT [TEST_NUM_CLIENT ...]]]
[-test_pooling_layer [TEST_POOLING_LAYER [TEST_POOLING_LAYER ...]]]
[-wait_till_ready WAIT_TILL_READY]
[-client_vocab_file CLIENT_VOCAB_FILE]
[-num_repeat NUM_REPEAT]
Named Arguments¶
-verbose | turn on tensorflow logging for debug Default: False |
-version | show program’s version number and exit |
File Paths¶
config the path, checkpoint and filename of a pretrained/fine-tuned BERT model
-model_dir | directory of a pretrained BERT model |
-tuned_model_dir | |
directory of a fine-tuned BERT model | |
-ckpt_name | filename of the checkpoint file. By default it is “bert_model.ckpt”, but for a fine-tuned model the name could be different. Default: “bert_model.ckpt” |
-config_name | filename of the JSON config file for BERT model. Default: “bert_config.json” |
-graph_tmp_dir | path to graph temp file |
BERT Parameters¶
config how BERT model and pooling works
-max_seq_len | maximum length of a sequence, longer sequence will be trimmed on the right side. set it to NONE for dynamically using the longest sequence in a (mini)batch. Default: 25 |
-cased_tokenization | |
Whether tokenizer should skip the default lowercasing and accent removal.Should be used for e.g. the multilingual cased pretrained BERT model. Default: True | |
-pooling_layer | the encoder layer(s) that receives pooling. Give a list in order to concatenate several layers into one Default: [-2] |
-pooling_strategy | |
Possible choices: NONE, REDUCE_MAX, REDUCE_MEAN, REDUCE_MEAN_MAX, FIRST_TOKEN, LAST_TOKEN, CLS_POOLED, CLASSIFICATION, REGRESSION the pooling strategy for generating encoding vectors Default: REDUCE_MEAN | |
-mask_cls_sep | masking the embedding on [CLS] and [SEP] with zero. When pooling_strategy is in {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN} then the embedding is preserved, otherwise the embedding is masked to zero before pooling Default: False |
-no_special_token | |
add [CLS] and [SEP] in every sequence, put sequence to the model without [CLS] and [SEP] when True and is_tokenized=True in Client Default: False | |
-show_tokens_to_client | |
sending tokenization results to client Default: False | |
-no_position_embeddings | |
Whether to add position embeddings for the position of each token in the sequence. Default: False | |
-num_labels | Numbers of Label Default: 2 |
Serving Configs¶
config how server utilizes GPU/CPU resources
-port, -port_in, -port_data | |
server port for receiving data from client Default: 5555 | |
-port_out, -port_result | |
server port for sending result to client Default: 5556 | |
-http_port | server port for receiving HTTP requests |
-http_max_connect | |
maximum number of concurrent HTTP connections Default: 10 | |
-cors | setting “Access-Control-Allow-Origin” for HTTP requests Default: “*” |
-num_worker | number of server instances Default: 1 |
-max_batch_size | |
maximum number of sequences handled by each worker Default: 256 | |
-priority_batch_size | |
batch smaller than this size will be labeled as high priority,and jumps forward in the job queue Default: 16 | |
-cpu | running on CPU (default on GPU) Default: False |
-xla | enable XLA compiler (experimental) Default: False |
-fp16 | use float16 precision (experimental) Default: False |
-gpu_memory_fraction | |
determine the fraction of the overall amount of memory that each visible GPU should be allocated per worker. Should be in range [0.0, 1.0] Default: 0.5 | |
-device_map | specify the list of GPU device ids that will be used (id starts from 0). If num_worker > len(device_map), then device will be reused; if num_worker < len(device_map), then device_map[:num_worker] will be used Default: [] |
-prefetch_size | the number of batches to prefetch on each worker. When running on a CPU-only machine, this is set to 0 for comparability Default: 10 |
-fixed_embed_length | |
when “max_seq_len” is set to None, the server determines the “max_seq_len” according to the actual sequence lengths within each batch. When “pooling_strategy=NONE”, this may cause two “.encode()” from the same client results in different sizes [B, T, D].Turn this on to fix the “T” in [B, T, D] to “max_position_embeddings” in bert json config. Default: False |
Benchmark parameters¶
config the experiments of the benchmark
-test_client_batch_size | |
Default: [1, 16, 256, 4096] | |
-test_max_batch_size | |
Default: [8, 32, 128, 512] | |
-test_max_seq_len | |
Default: [32, 64, 128, 256] | |
-test_num_client | |
Default: [1, 4, 16, 64] | |
-test_pooling_layer | |
Default: [[-1], [-2], [-3], [-4], [-5], [-6], [-7], [-8], [-9], [-10], [-11], [-12]] | |
-wait_till_ready | |
seconds to wait until server is ready to serve Default: 30 | |
-client_vocab_file | |
file path for building client vocabulary Default: “README.md” | |
-num_repeat | number of repeats per experiment (must >2), as the first two results are omitted for warm-up effect Default: 10 |