Fine-tuning Pre-trained Models for Vietnamese NLP

January 10, 2024 10 min read Ngoc Luu

Vietnamese Fine-tuning Tutorial

Introduction

Vietnamese NLP presents unique challenges due to the language's tonal nature, complex morphology, and lack of spaces between syllables. However, with the advent of pre-trained language models like PhoBERT, BARTpho, and ViT5, Vietnamese NLP has made significant progress. This guide will walk you through fine-tuning these models for various Vietnamese NLP tasks.

Understanding Vietnamese Language Characteristics

Before diving into fine-tuning, it's essential to understand what makes Vietnamese unique:

Tonal Language: Vietnamese has 6 tones that change word meanings
Syllable-based: Words are composed of syllables, with spaces between them
Isolating Language: Minimal inflection, relying on word order and particles
Diacritics: Critical for correct interpretation (e.g., "ma" vs "mà" vs "má")

Available Vietnamese Pre-trained Models

1. PhoBERT

PhoBERT is the first large-scale pre-trained language model for Vietnamese, based on BERT architecture. It comes in two variants: PhoBERT-base and PhoBERT-large.

from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

2. BARTpho

BARTpho is a sequence-to-sequence model ideal for tasks like text generation, summarization, and translation.

3. ViT5

ViT5 is a Vietnamese T5 model, excellent for text-to-text tasks.

Step-by-Step Fine-tuning Guide

1. Data Preparation

First, ensure your Vietnamese text is properly preprocessed:

import py_vncorenlp

# Word segmentation for Vietnamese
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"])

def preprocess_vietnamese(text):
    # Segment words
    segmented = rdrsegmenter.word_segment(text)
    return ' '.join(segmented[0])

text = "Tôi học trí tuệ nhân tạo"
processed = preprocess_vietnamese(text)
print(processed)  # "Tôi học trí_tuệ nhân_tạo"

2. Load Dataset

from datasets import load_dataset

# Example: Vietnamese sentiment analysis dataset
dataset = load_dataset("uitnlp/vietnamese_students_feedback")

# Or load custom data
import pandas as pd
df = pd.read_csv("vietnamese_data.csv")
train_texts = df['text'].tolist()
train_labels = df['label'].tolist()

3. Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

4. Model Configuration

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "vinai/phobert-base",
    num_labels=3  # Adjust based on your task
)

5. Training Arguments

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

6. Training

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

Common Vietnamese NLP Tasks

Named Entity Recognition (NER)

For NER tasks, use PhoBERT with a token classification head:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "vinai/phobert-base",
    num_labels=len(label_list)
)

Text Classification

Sentiment analysis, topic classification, etc. Use the sequence classification approach shown above.

Question Answering

from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("vinai/phobert-base")

Best Practices

Word Segmentation: Always segment Vietnamese text before tokenization
Diacritic Handling: Preserve diacritics - they're crucial for meaning
Data Augmentation: Use techniques like back-translation for low-resource scenarios
Learning Rate: Start with 2e-5 or 3e-5 for fine-tuning
Batch Size: Adjust based on GPU memory (16-32 typically works well)
Evaluation: Use Vietnamese-specific metrics when available

Common Pitfalls to Avoid

Not performing word segmentation properly
Removing diacritics (this changes meanings!)
Using too high learning rates (causes instability)
Insufficient training data (consider data augmentation)
Not handling class imbalance in classification tasks

Evaluation Metrics

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    accuracy = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')

    return {
        'accuracy': accuracy,
        'f1': f1
    }

Conclusion

Fine-tuning pre-trained models for Vietnamese NLP has become increasingly accessible thanks to models like PhoBERT and BARTpho. By following the steps outlined in this guide and adhering to best practices, you can achieve strong performance on various Vietnamese NLP tasks. Remember that proper preprocessing, especially word segmentation and diacritic handling, is crucial for success.