Advanced NLP designs from R

Intro

The Transformers repository from ” Hugging Face” consists of a great deal of all set to utilize, advanced designs, which are uncomplicated to download and tweak with Tensorflow & & Keras.

For this function the users typically require to get:

The design itself (e.g. Bert, Albert, RoBerta, GPT-2 and etc.)
The tokenizer things
The weights of the design

In this post, we will deal with a traditional binary category job and train our dataset on 3 designs:

Nevertheless, readers must understand that a person can deal with transformers on a range of down-stream jobs, such as:

function extraction
belief analysis
text category
concern answering
summarization
translation and a lot more

Requirements

Our very first task is to set up the transformers plan by means of reticulate

 reticulate::  py_install(' transformers', pip  =  REAL)

Then, as normal, load basic ‘Keras’, ‘TensorFlow’ >>= 2.0 and some timeless libraries from R.

Keep in mind that if running TensorFlow on GPU one might define the following criteria in order to prevent memory problems.

 physical_devices  =  tf$ config$ list_physical_devices(' GPU')
 tf$ config$ speculative$ set_memory_growth( physical_devices[[1]] , REAL)

 tf$ keras$ backend$ set_floatx(' float32')

Design Template

We currently discussed that to train an information on the particular design, users must download the design, its tokenizer things and weights. For instance, to get a RoBERTa design one needs to do the following:

 # get Tokenizer
 transformer$ RobertaTokenizer$ from_pretrained(' roberta-base', do_lower_case = REAL)

 # get Design with weights
 transformer$ TFRobertaModel$ from_pretrained(' roberta-base')

Information preparation

A dataset for binary category is supplied in text2vec plan. Let’s fill the dataset and take a sample for quick design training.

Split our information into 2 parts:

 idx_train  =  sample.int( nrow( df) * 0.8)

 train  =  df[idx_train,]
 test  =  df[!idx_train,]

Information input for Keras

Previously, we have actually simply covered information import and train-test split. To feed input to the network we need to turn our raw text into indices by means of the imported tokenizer. And after that adjust the design to do binary category by including a thick layer with a single system at the end.

Nevertheless, we wish to train our information for 3 designs GPT-2, RoBERTa, and Electra. We require to compose a loop for that.

Note: one design in basic needs 500-700 MB

 # list of 3 designs

 ai_m  =  list(

   c(' TFGPT2Model', ' GPT2Tokenizer', ' gpt2'),

    c(' TFRobertaModel', ' RobertaTokenizer', ' roberta-base'),

    c(' TFElectraModel', ' ElectraTokenizer', ' google/electra-small-generator')

)



 # criteria

 max_len  =  50L

 dates  =  2

 batch_size  =  10



 # develop a list for design outcomes

 gather_history  =  list()



 for ( i  in  1:  length( ai_m))  {

  

   # tokenizer

   tokenizer  =  glue::  glue(" transformer$ {ai_m[[i]] [2]} $from_pretrained(' {ai_m[[i]] [3]} ',

 do_lower_case= REAL)") %>>% 

     rlang::  parse_expr() %>>%  eval()

  

   # design

   design _  =  glue::  glue(" transformer$ {ai_m[[i]] [1]} $from_pretrained(' {ai_m[[i]] [3]} ')") %>>% 

     rlang::  parse_expr() %>>%  eval()

  

   # inputs

   text  =  list()

   # outputs

   label  =  list()

  

   data_prep  =  function( information)  {

     for ( i  in  1:  nrow( information))  {

      

       txt  =  tokenizer$ encode( information[['comment_text']] [i], max_length  =  max_len, 

 truncation = T) %>>% 

         t() %>>% 

         as.matrix() %>>%  list()

       lbl  =  information[['target']] [i] %>>%  t()

      

       text  =  text %>>%  append( txt)

       label  =  label %>>%  append( lbl)

    } 

     list( do.call( plyr::  rbind.fill.matrix, text),  do.call( plyr::  rbind.fill.matrix, label))

  } 

  

   train _  =  data_prep( train)

   test _  =  data_prep( test)

  

   # slice dataset

   tf_train  =  tensor_slices_dataset( list( train _[[1]] , train _[[2]] )) %>>% 

     dataset_batch( batch_size  =  batch_size, drop_remainder  =  REAL) %>>% 

     dataset_shuffle( 128) %>>%  dataset_repeat( dates) %>>% 

     dataset_prefetch( tf$ information$ speculative$ AUTOTUNE)

  

   tf_test  =  tensor_slices_dataset( list( test _[[1]] , test _[[2]] )) %>>% 

     dataset_batch( batch_size  =  batch_size)

  

   # develop an input layer

   input  =  layer_input( shape = c( max_len), dtype =' int32')

   hidden_mean  =  tf$ reduce_mean( design _( input)[[1]] , axis = 1L) %>>% 

     layer_dense( 64, activation  = ' relu')

   # develop an output layer for binary category

   output  =  hidden_mean %>>%  layer_dense( systems = 1, activation =' sigmoid')

   design  =  keras_model( inputs = input, outputs  =  output)

  

   # assemble with AUC rating

   design %>>%  assemble( optimizer =  tf$ keras$ optimizers$ Adam( learning_rate = 3e-5, epsilon = 1e-08, clipnorm = 1.0),

 loss  =  tf$ losses$ BinaryCrossentropy( from_logits = F),

 metrics  =  tf$ metrics$ AUC())

  

   print( glue::  glue(' {ai_m[[i]] [1]} '))

   # train the design

   history  =  design %>>%  keras::  fit( tf_train, dates = dates,  #steps _ per_epoch= len/batch _ size,

 validation_data = tf_test)

   gather_history[[i]] <