Классификация настроений с использованием BERT
BERT ( двунаправленное представление для преобразователей) был предложен исследователями языка Google AI в 2018 году. Хотя основной целью этого было улучшение понимания смысла запросов, связанных с поиском Google, BERT становится одной из наиболее важных и законченных архитектур. для решения различных задач на естественном языке сформировав состояние современных результатов по Предложении парного классификации задачи, вопрос-ответ задача и т.д. для получения более подробной информации по архитектуре , пожалуйста , смотрите на этой статье
Архитектура:
Одна из наиболее важных особенностей BERT заключается в том, что его адаптивность к выполнению различных задач НЛП с высочайшей точностью (аналогично переносному обучению, которое мы использовали в компьютерном зрении ). Для этого в документе также предложена архитектура различных задач. В этом посте мы будем использовать архитектуру BERT для задач классификации отдельных предложений, в частности архитектуру, используемую для задачи двоичной классификации CoLA (Corpus of Linguistic Acceptability). В предыдущем посте о BERT мы подробно обсудили архитектуру BERT, но давайте вспомним некоторые из ее важных деталей:
BERT предлагает две версии:
- BERT (BASE): 12 уровней стека кодировщика с 12 двунаправленными головками самовнимания и 768 скрытыми блоками.
- BERT (БОЛЬШОЙ): 24 уровня стека кодировщика с 24 двунаправленными головками самовнимания и 1024 скрытыми блоками.
Для реализации TensorFlow Google предоставил две версии BERT BASE и BERT LARGE: Uncased и Cased. В версии без регистра буквы в нижнем регистре перед токенизацией WordPiece.
Implementation:
- First, we need to clone the GitHub repo to BERT to make the setup easier.
Code:
! git clone https: / / github.com / google - research / bert.git |
Cloning into "bert"... remote: Enumerating objects: 340, done. remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340 Receiving objects: 100% (340/340), 317.20 KiB | 584.00 KiB/s, done. Resolving deltas: 100% (185/185), done.
- Now, we need to download the BERTBASE model using the following link and unzip it into the working directory ( or the desired location).
Code:
# Download BERT BASE model from tF hub ! wget https://storage.googleapis.com / bert_models / 2018_10_18 / uncased_L-12_H-768_A-12.zip ! unzip uncased_L-12_H-768_A-12.zip |
Archive: uncased_L-12_H-768_A-12.zip creating: uncased_L-12_H-768_A-12/ inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 inflating: uncased_L-12_H-768_A-12/vocab.txt inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index inflating: uncased_L-12_H-768_A-12/bert_config.json
- We will be using the TensorFlow 1x version. In Google colab there is a magic function called tensorflow_version that can switch different versions.
Code:
% tensorflow_version 1.x |
TensorFlow 1.x selected.
- Now, we will import modules necessary for running this project, we will be using NumPy, scikit-learn and Keras from TensorFlow inbuilt modules. These are already preinstalled in colab, make sure to install these in your environment.
Code:
import os import re import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import tensorflow as tf from tensorflow import keras import csv from sklearn import metrics |
- Now we will load IMDB sentiments datasets and do some preprocessing before training. For loading the IMDB dataset from TensorFlow Hub, we will follow this tutorial.
Code:
# load data from positive and negative directories and a columns that takes there # positive and negative label def load_directory_data(directory): data = {} data[ "sentence" ] = [] data[ "sentiment" ] = [] for file_path in os.listdir(directory): with tf.gfile.GFile(os.path.join(directory, file_path), "r" ) as f: data[ "sentence" ].append(f.read()) data[ "sentiment" ].append(re.match( "d+_(d+).txt" , file_path).group( 1 )) return pd.DataFrame.from_dict(data) # Merge positive and negative examples, add a polarity column and shuffle. def load_dataset(directory): pos_df = load_directory_data(os.path.join(directory, "pos" )) neg_df = load_directory_data(os.path.join(directory, "neg" )) pos_df[ "polarity" ] = 1 neg_df[ "polarity" ] = 0 return pd.concat([pos_df, neg_df]).sample(frac = 1 ).reset_index(drop = True ) # Download and process the dataset files. def download_and_load_datasets(force_download = False ): dataset = tf.keras.utils.get_file( fname = "aclImdb.tar.gz" , extract = True ) train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb" , "train" )) test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb" , "test" )) return train_df, test_df train, test = download_and_load_datasets() train.shape, test.shape |
Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 8s 0us/step ((25000, 3), (25000, 3))
- This dataset contains 50k reviews 25k for each training and test, we will sample 5k reviews from each test and train. Also, both test and train dataset contains 3 columns whose list is given below
Code:
# sample 5k datapoints for both train and test train = train.sample( 5000 ) test = test.sample( 5000 ) # List columns of train and test data train.columns, test.columns |
(Index(["sentence", "sentiment", "polarity"], dtype="object"), Index(["sentence", "sentiment", "polarity"], dtype="object"))
- Now, we need to convert the specific format that is required by the BERT model to train and predict, for that we will use pandas dataframe. Below are the columns required in BERT training and test format:
- GUID: An id for the row. Required for both train and test data
- Class label.: A value of 0 or 1 depending on positive and negative sentiment.
- alpha: This is a dummy column for text classification but is expected by BERT during training.
- text: The review text of the data point which needed to be classified. Obviously required for both training and test
Code:
# code # Convert training data into BERT format train_bert = pd.DataFrame({ "guid" : range ( len (train)), "label" :train[ "polarity" ], "alpha" : [ "a" ] * train.shape[ 0 ], "text" : train[ "sentence" ].replace(r "
" , "", regex = True ) }) train_bert.head() print ( "-----" ) # convert test data into bert format bert_test = pd.DataFrame({ "id" : range ( len (test)), "text" : test[ "sentence" ].replace(r "
" , " " , regex = True ) }) bert_test.head() |
guid label alpha text 14930 0 1 a William Hurt may not be an American matinee id... 1445 1 1 a Rock solid giallo from a master filmmaker of t... 16943 2 1 a This movie surprised me. Some things were "cli... 6391 3 1 a This film may seem dated today, but remember t... 4526 4 0 a The Twilight Zone has achieved a certain mytho... ----- guid text 20010 0 One of Alfred Hitchcock"s three greatest films... 16132 1 Hitchcock once gave an interview where he said... 24947 2 I had nothing to do before going out one night... 5471 3 tell you what that was excellent. Dylan Moran ... 21075 4 I watched this show until my puberty but still...
- Now, we split the data into three parts: train, dev, and test and save it into tsv file save it into a folder (here “IMDB Dataset”). This is because run classifier file requires dataset in tsv format.
Code:
# split data into train and validation set bert_train, bert_val = train_test_split(train_bert, test_size = 0.1 ) # save train, validation and testfile to afolder bert_train.to_csv( "bert / IMDB_dataset / train.tsv" , sep = " " , index = False , header = False ) bert_val.to_csv( "bert / IMDB_dataset / dev.tsv" , sep = " " , index = False , header = False ) bert_test.to_csv( "bert / IMDB_dataset / test.tsv" , sep = " " , index = False , header = True ) |
- In this step, we train the model using the following command, for executing bash commands on colab, we use ! sign in front of the command. The run_classifier file trains the model with the help of given command. Due to time and resource constraints, we will run it only on 3 epochs.
Code:
# Most of the arguments hereare self-explanatory but some aguments needs to be explained: # task name:We have discussed this above .Here we need toperform binary classification that why we use cola # vocab file : A vocab file (vocab.txt) to map WordPiece to word id. # init checkpoint: A tensorflow checkpoint required. Here we used downlaoded bert. # max_seq_length :caps the maximunumber of words to each reviews # bert_config_file: file contains hyperparameter settings ! python bert / run_classifier.py - - task_name = cola - - do_train = true - - do_eval = true - - data_dir = / content / bert / IMDB_dataset - - vocab_file = / content / uncased_L - 12_H - 768_A - 12 / vocab.txt - - bert_config_file = / content / uncased_L - 12_H - 768_A - 12 / bert_config.json - - init_checkpoint = / content / uncased_L - 12_H - 768_A - 12 / bert_model.ckpt - - max_seq_length = 64 - - train_batch_size = 8 - - learning_rate = 2e - 5 - - num_train_epochs = 3.0 - - output_dir = / content / bert_output / - - do_lower_case = True - - save_checkpoints_steps 10000 |
# Last few lines INFO:tensorflow:***** Eval results ***** I0713 06:06:28.966619 139722620139392 run_classifier.py:923] ***** Eval results ***** INFO:tensorflow: eval_accuracy = 0.796 I0713 06:06:28.966814 139722620139392 run_classifier.py:925] eval_accuracy = 0.796 INFO:tensorflow: eval_loss = 0.95403963 I0713 06:06:28.967138 139722620139392 run_classifier.py:925] eval_loss = 0.95403963 INFO:tensorflow: global_step = 1687 I0713 06:06:28.967317 139722620139392 run_classifier.py:925] global_step = 1687 INFO:tensorflow: loss = 0.95741796 I0713 06:06:28.967507 139722620139392 run_classifier.py:925] loss = 0.95741796
- Now we will use test data to evaluate our model with the following bash script. This script saves the predictions into a tsv file.
Code:
# code to predict bert on test.tsv # here we use saved training checkpoint as initial model ! python bert / run_classifier.py - - task_name = cola - - do_predict = true - - data_dir = / content / bert / IMDB_dataset - - vocab_file = / content / uncased_L - 12_H - 768_A - 12 / vocab.txt - - bert_config_file = / content / uncased_L - 12_H - 768_A - 12 / bert_config.json - - init_checkpoint = / content / bert_output / model.ckpt - 0 - - max_seq_length = 128 - - output_dir = / content / bert_output / |
INFO:tensorflow:Restoring parameters from /content/bert_output/model.ckpt-1687 I0713 06:08:22.372014 140390020667264 saver.py:1284] Restoring parameters from /content/bert_output/model.ckpt-1687 INFO:tensorflow:Running local_init_op. I0713 06:08:23.801442 140390020667264 session_manager.py:500] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0713 06:08:23.859703 140390020667264 session_manager.py:502] Done running local_init_op. 2020-07-13 06:08:24.453814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 INFO:tensorflow:prediction_loop marked as finished I0713 06:10:02.280455 140390020667264 error_handling.py:101] prediction_loop marked as finished INFO:tensorflow:prediction_loop marked as finished I0713 06:10:02.280870 140390020667264 error_handling.py:101] prediction_loop marked as finished
- The code below takes maximum prediction for each row of test data and store it into a list.
Code:
# code import csv label_results = [] with open ( "/content / bert_output / test_results.tsv" ) as file : rows = csv.reader( file , delimiter = " " ) for row in rows: data_1 = [ float (i) for i in row] label_results.append(data_1.index( max (data_1))) |
- The code below calculates accuracy and F1-score.
Code:
print ( "Accuracy" , metrics.accuracy_score(test[ "polarity" ], label_results)) print ( "F1-Score" , metrics.f1_score(test[ "polarity" ], label_results)) |
Accuracy 0.8548 F1-Score 0.8496894409937888
- We have achieved 85% accuracy and F1-score on the IMDB reviews dataset while training BERT (BASE) just for 3 epochs which is quite a good result. Training on more epochs will certainly improve the accuracy.
References:
- BERT paper
- Google BERT repo
- MC.ai BERT text classification
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course