Pre-train Albert with customized corpus for domain-specific texts and additional fine-tune the pre-trained mannequin for utility duties
This put up illustrates the straightforward steps to pre-train the state of artwork Albert NLP mannequin on a customized corpus and additional fine-tune the pre-trained Albert mannequin on particular downstream duties. The customized corpus could be domain-specific or overseas language and many others which we don’t have present pre-trained Albert fashions. The downstream job is application-driven, it may be a doc classification downside (e.g. sentiment evaluation), or token tagging issues (e.g. NER), and many others. This put up illustrates a customized entity recognition downstream job. The pattern use case illustrated on this put up is to extract dish names in restaurant critiques, e.g. tag “mala steamboat” as dish title, in evaluation “I just like the mala steamboat loads!”
Nevertheless, the main points of the Albert mannequin and the way it works aren’t coated on this put up. The put up assumes you’ve got a tough understanding of transformers (A very good learn on transformers), Albert fashions and are in a position to clone and run scripts from git repos.
My Github repo containing the illustration pocket book could possibly be discovered right here: https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus
Toy Information Set
The toy information set used on this put up contains:
- Restaurant Evaluate Corpus: composed of two evaluation sentences on this toy instance, used to pre-train Albert mannequin. In precise utility, typically, we are able to practice with 100M+ sentences.
2. Dish title extracted from restaurant evaluation to effective tune Albert for dish title recognition. The coaching set consists of the identical two evaluation sentences, with the dish title extracted.
The analysis set consists of the same two evaluation sentences, however with the dishes swapped within the two critiques and barely completely different textual content context containing out-of-vocabulary phrase (phrase that’s not within the restaurant evaluation corpus), to validate the effectivity of the Albert mannequin.
Step one is to construct vocabulary for the restaurant evaluation corpus.
Google did not open supply the WordPiece unsupervised tokenizer, nevertheless, we are able to modify on open sourced SentencePiece or t2t textual content encoder subword builder to generate vocabulary that’s appropriate with Bert/Albert fashions.
An open supply implementation of modifying t2t textual content encoder subword builder could possibly be discovered at https://github.com/kwonmha/bert-vocab-builder
- Clone the repo (Create an atmosphere and set up necessities accordingly)
- Put together corpus recordsdata, on this case, the restaurant evaluation corpus
- Run beneath command to create Albert appropriate vocabulary recordsdata
python subword_builder.py --corpus_filepattern “” --output_filename --min_count minimum_subtoken_counts
The subsequent step is to pre-train Albert with the customized corpus. The official Albert repo could possibly be discovered at https://github.com/google-research/ALBERT
Clone the repo
Create an atmosphere and set up necessities accordingly.
Create Albert pre-training recordsdata
Run the beneath command utilizing customized corpus and vocabulary recordsdata used/created in Construct Vocab step. This step first tokenizes and encodes the corpus based mostly on the vocabulary file enter. It then creates coaching enter recordsdata for Albert, to be skilled on masked language mannequin and sentence order prediction job.
python create_pretraining_data.py --input_file “corpus” --output_file output_file --vocab_file vocab_file
Pre-train Albert mannequin
- Put together Albert config file, vocab_size in Albert config file ought to similar as the entire variety of vocabularies in vocab file created in Construct Vocab step. One other pattern Albert config file could possibly be discovered at https://tfhub.dev/google/albert_base/2, which is the config file used to construct Albert base v2 mannequin.
- Run beneath command to begin pre-training. All of the mannequin parameters shall be saved within the output_dir
python run_pretraining.py --input_file=albert_pretrain_file --output_dir=models_dir --albert_config_file=
The final step is to fine-tune the pre-trained Albert to carry out the dish title recognition job. Consult with the pocket book for this step. The code snippets on this put up doesn’t concatenate to full pocket book, they’re just for illustration function. Please learn the complete pocket book if not clear.
Course of enter
To have an enter sequence of uniform size, we set a max size of 50 tokens and chop the sequence if it exceeds and pad the sequence in any other case. We will use Keras’s pad_sequences for this function.
Albert fashions takes enter within the format of input_ids [batch_size * max_seq_len]— tokenized and encoded textual content sequence on this case, input_mask[batch_size * max_seq_len]—provides 1 if the token is encoded from unique sequence zero if padded, loss is barely calculated from tokens of unique sequence, segment_ids[batch_size * max_seq_len] — zero tensor arrays, not used on this case, labels[batch_size * max_seq_len] — provides 1 if the token is a part of dish title zero in any other case. We’ll course of the enter on this format.
df_data[‘text_tokens’] = df_data.textual content.apply(tokenizer.tokenize)
df_data[‘text_labels’] = df_data.apply(lambda row: label_sent(row[‘name’].decrease().cut up()
, row[‘text_tokens’]), axis=1)
df_data_sampled = df_data[[np.sum(label)>0 for label in df_data.text_labels]]
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_sampled[‘text_tokens’]],
maxlen=MAX_LEN, dtype=”lengthy”, truncating=”put up”, padding=”put up”)
labels = pad_sequences(df_data_sampled[‘text_labels’],
maxlen=MAX_LEN, padding=”put up”,
dtype=”lengthy”, truncating=”put up”)
# create the masks to disregard the padded components within the sequences.
input_mask = [[int(i>0) for i in ii] for ii in input_ids]
Batch up the enter tensors.
d = tf.information.Dataset.from_tensor_slices(
(“input_ids”: tf.fixed( input_ids, form=[num_examples, seq_length], dtype=tf.int32),
“input_mask”: tf.fixed( input_mask, form=[num_examples, seq_length], dtype=tf.int32),
“segment_ids”: tf.zeros(form=[num_examples, seq_length], dtype=tf.int32),
,tf.fixed(labels, form=[num_examples, seq_length], dtype=tf.int32),))
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
Create Albert effective tune mannequin
Add yet one more layer after the pre-trained Albert mannequin to foretell whether or not every token in an enter sequence is a part of dish title or not.
def create_model(albert_config, mode, input_ids, input_mask, segment_ids,labels, num_labels):
“””Creates a classification mannequin.”””
is_training = mode == tf.estimator.ModeKeys.TRAIN
mannequin = modeling.AlbertModel(
output_layer = mannequin.get_sequence_output()
hidden_size = output_layer.form[-1].worth
output_weight = tf.get_variable(
“output_weights”, [num_labels, hidden_size],
output_bias = tf.get_variable(
“output_bias”, [num_labels], initializer=tf.zeros_initializer())
Log-loss is used on this case.
logits = tf.matmul(output_layer, output_weight, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
logits = tf.reshape(logits, [-1, MAX_LEN, num_labels])log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_sum(per_example_loss)
Adam optimiser with weight decay is used on this case for studying
Prepare fine-tuning mannequin
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps,
use_tpu, optimizer)output_spec = tf.contrib.tpu.TPUEstimatorSpec(
scaffold_fn=scaffold_fn)train_input_fn = input_fn_builder(
file = “data_toy/dish_name_train.csv”,
tokenizer = tokenizer,
Consider fine-tuning mannequin
Loss and accuracy was reported for analysis
def metric_fn(per_example_loss, label_ids, logits):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.imply(tf.math.equal(label_ids,predictions))
loss = tf.metrics.imply(values=per_example_loss)
With 100 epochs coaching (within the pocket book, batch measurement is about as 1, and coaching step is about as 200, which is efficient with 100 epochs of coaching), the mannequin is ready to extract the proper dish title within the analysis set, despite the fact that the context textual content is barely completely different with even out-of-vocabulary phrase
In brief, Albert is a good NLP breakthrough, and due to open supply, it’s fairly straight ahead for us to construct customized functions on prime of state-of-the-art fashions.
 M. H. Kwon, Bert-Vocab-Builder, Github Repo