Introduction
Vietnamese NLP presents unique challenges due to the language's tonal nature, complex morphology, and lack of spaces between syllables. However, with the advent of pre-trained language models like PhoBERT, BARTpho, and ViT5, Vietnamese NLP has made significant progress. This guide will walk you through fine-tuning these models for various Vietnamese NLP tasks.
Understanding Vietnamese Language Characteristics
Before diving into fine-tuning, it's essential to understand what makes Vietnamese unique:
- Tonal Language: Vietnamese has 6 tones that change word meanings
- Syllable-based: Words are composed of syllables, with spaces between them
- Isolating Language: Minimal inflection, relying on word order and particles
- Diacritics: Critical for correct interpretation (e.g., "ma" vs "mà" vs "má")
Available Vietnamese Pre-trained Models
1. PhoBERT
PhoBERT is the first large-scale pre-trained language model for Vietnamese, based on BERT architecture. It comes in two variants: PhoBERT-base and PhoBERT-large.
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
2. BARTpho
BARTpho is a sequence-to-sequence model ideal for tasks like text generation, summarization, and translation.
3. ViT5
ViT5 is a Vietnamese T5 model, excellent for text-to-text tasks.
Step-by-Step Fine-tuning Guide
1. Data Preparation
First, ensure your Vietnamese text is properly preprocessed:
import py_vncorenlp
# Word segmentation for Vietnamese
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"])
def preprocess_vietnamese(text):
# Segment words
segmented = rdrsegmenter.word_segment(text)
return ' '.join(segmented[0])
text = "Tôi học trí tuệ nhân tạo"
processed = preprocess_vietnamese(text)
print(processed) # "Tôi học trí_tuệ nhân_tạo"
2. Load Dataset
from datasets import load_dataset
# Example: Vietnamese sentiment analysis dataset
dataset = load_dataset("uitnlp/vietnamese_students_feedback")
# Or load custom data
import pandas as pd
df = pd.read_csv("vietnamese_data.csv")
train_texts = df['text'].tolist()
train_labels = df['label'].tolist()
3. Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=256
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
4. Model Configuration
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"vinai/phobert-base",
num_labels=3 # Adjust based on your task
)
5. Training Arguments
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
save_strategy="epoch",
load_best_model_at_end=True,
)
6. Training
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
)
trainer.train()
Common Vietnamese NLP Tasks
Named Entity Recognition (NER)
For NER tasks, use PhoBERT with a token classification head:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
"vinai/phobert-base",
num_labels=len(label_list)
)
Text Classification
Sentiment analysis, topic classification, etc. Use the sequence classification approach shown above.
Question Answering
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("vinai/phobert-base")
Best Practices
- Word Segmentation: Always segment Vietnamese text before tokenization
- Diacritic Handling: Preserve diacritics - they're crucial for meaning
- Data Augmentation: Use techniques like back-translation for low-resource scenarios
- Learning Rate: Start with 2e-5 or 3e-5 for fine-tuning
- Batch Size: Adjust based on GPU memory (16-32 typically works well)
- Evaluation: Use Vietnamese-specific metrics when available
Common Pitfalls to Avoid
- Not performing word segmentation properly
- Removing diacritics (this changes meanings!)
- Using too high learning rates (causes instability)
- Insufficient training data (consider data augmentation)
- Not handling class imbalance in classification tasks
Evaluation Metrics
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
accuracy = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average='weighted')
return {
'accuracy': accuracy,
'f1': f1
}
Conclusion
Fine-tuning pre-trained models for Vietnamese NLP has become increasingly accessible thanks to models like PhoBERT and BARTpho. By following the steps outlined in this guide and adhering to best practices, you can achieve strong performance on various Vietnamese NLP tasks. Remember that proper preprocessing, especially word segmentation and diacritic handling, is crucial for success.