Advanced NLP designs from R


Intro

The Transformers repository from ” Hugging Face” consists of a great deal of all set to utilize, advanced designs, which are uncomplicated to download and tweak with Tensorflow & & Keras.

For this function the users typically require to get:

  • The design itself (e.g. Bert, Albert, RoBerta, GPT-2 and etc.)
  • The tokenizer things
  • The weights of the design

In this post, we will deal with a traditional binary category job and train our dataset on 3 designs:

Nevertheless, readers must understand that a person can deal with transformers on a range of down-stream jobs, such as:

  1. function extraction
  2. belief analysis
  3. text category
  4. concern answering
  5. summarization
  6. translation and a lot more

Requirements

Our very first task is to set up the transformers plan by means of reticulate

 reticulate::  py_install(' transformers', pip  =  REAL)

Then, as normal, load basic ‘Keras’, ‘TensorFlow’ >>= 2.0 and some timeless libraries from R.

Keep in mind that if running TensorFlow on GPU one might define the following criteria in order to prevent memory problems.

 physical_devices  =  tf$ config$ list_physical_devices(' GPU')
 tf$ config$ speculative$ set_memory_growth( physical_devices[[1]] , REAL)

 tf$ keras$ backend$ set_floatx(' float32')

Design Template

We currently discussed that to train an information on the particular design, users must download the design, its tokenizer things and weights. For instance, to get a RoBERTa design one needs to do the following:

 # get Tokenizer
 transformer$ RobertaTokenizer$ from_pretrained(' roberta-base', do_lower_case = REAL)

 # get Design with weights
 transformer$ TFRobertaModel$ from_pretrained(' roberta-base')

Information preparation

A dataset for binary category is supplied in text2vec plan. Let’s fill the dataset and take a sample for quick design training.

Split our information into 2 parts:

 idx_train  =  sample.int( nrow( df) * 0.8)

 train  =  df[idx_train,]
 test  =  df[!idx_train,]

Information input for Keras

Previously, we have actually simply covered information import and train-test split. To feed input to the network we need to turn our raw text into indices by means of the imported tokenizer. And after that adjust the design to do binary category by including a thick layer with a single system at the end.

Nevertheless, we wish to train our information for 3 designs GPT-2, RoBERTa, and Electra. We require to compose a loop for that.

Note: one design in basic needs 500-700 MB

 # list of 3 designs
ai_m = list(
c(' TFGPT2Model', ' GPT2Tokenizer', ' gpt2'),
c(' TFRobertaModel', ' RobertaTokenizer', ' roberta-base'),
c(' TFElectraModel', ' ElectraTokenizer', ' google/electra-small-generator')
)

# criteria
max_len = 50L
dates = 2
batch_size = 10

# develop a list for design outcomes
gather_history = list()

for ( i in 1: length( ai_m)) {

# tokenizer
tokenizer = glue:: glue(" transformer$ {ai_m[[i]] [2]} $from_pretrained(' {ai_m[[i]] [3]} ',
do_lower_case= REAL)") %>>%
rlang:: parse_expr() %>>% eval()

# design
design _ = glue:: glue(" transformer$ {ai_m[[i]] [1]} $from_pretrained(' {ai_m[[i]] [3]} ')") %>>%
rlang:: parse_expr() %>>% eval()

# inputs
text = list()
# outputs
label = list()

data_prep = function( information) {
for ( i in 1: nrow( information)) {

txt = tokenizer$ encode( information[['comment_text']] [i], max_length = max_len,
truncation = T) %>>%
t() %>>%
as.matrix() %>>% list()
lbl = information[['target']] [i] %>>% t()

text = text %>>% append( txt)
label = label %>>% append( lbl)
}
list( do.call( plyr:: rbind.fill.matrix, text), do.call( plyr:: rbind.fill.matrix, label))
}

train _ = data_prep( train)
test _ = data_prep( test)

# slice dataset
tf_train = tensor_slices_dataset( list( train _[[1]] , train _[[2]] )) %>>%
dataset_batch( batch_size = batch_size, drop_remainder = REAL) %>>%
dataset_shuffle( 128) %>>% dataset_repeat( dates) %>>%
dataset_prefetch( tf$ information$ speculative$ AUTOTUNE)

tf_test = tensor_slices_dataset( list( test _[[1]] , test _[[2]] )) %>>%
dataset_batch( batch_size = batch_size)

# develop an input layer
input = layer_input( shape = c( max_len), dtype =' int32')
hidden_mean = tf$ reduce_mean( design _( input)[[1]] , axis = 1L) %>>%
layer_dense( 64, activation = ' relu')
# develop an output layer for binary category
output = hidden_mean %>>% layer_dense( systems = 1, activation =' sigmoid')
design = keras_model( inputs = input, outputs = output)

# assemble with AUC rating
design %>>% assemble( optimizer = tf$ keras$ optimizers$ Adam( learning_rate = 3e-5, epsilon = 1e-08, clipnorm = 1.0),
loss = tf$ losses$ BinaryCrossentropy( from_logits = F),
metrics = tf$ metrics$ AUC())

print( glue:: glue(' {ai_m[[i]] [1]} '))
# train the design
history = design %>>% keras:: fit( tf_train, dates = dates, #steps _ per_epoch= len/batch _ size,
validation_data = tf_test)
gather_history[[i]] <

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: