File size: 10,233 Bytes
819910a 0063d17 819910a 4d6a130 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
title: Vietnamese Legal Doc Retrieval
emoji: π
colorFrom: indigo
colorTo: pink
sdk: docker
pinned: false
short_description: Fine-tuned Retrieval System for Vietnamese Legal Documents
models:
- YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs
datasets:
- YuITC/Vietnamese-Legal-Doc-Retrieval-Data
---
# Vietnamese Legal Document Retrieval System
[](https://huggingface.co/spaces/YuITC/Vietnamese-Legal-Doc-Retrieval)
[](https://huggingface.co/YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs)
[](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data)
A retrieval system specifically designed for Vietnamese legal documents using fine-tuned SBERT (Sentence-BERT) technology.
## π Overview
This project implements a retrieval system for retrieving relevant Vietnamese legal documents based on user queries. The system uses a fine-tuned multilingual BERT model to encode legal queries and documents into a semantic vector space, allowing for retrieval based on meaning rather than just keyword matching.

## π Key features
- Step-by-step notebook for understanding.
- Fine-tuned SBERT model specialized for Vietnamese legal document retrieval.
- FAISS indexing for efficient vector search.
- Evaluation based on MTEB.
- Interactive web interface for quick legal document search.
- High-performance retrieval of relevant legal passages.
## π οΈ Installation & Usage
```bash
# Install dependencies
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install faiss-gpu=1.9.0 -c pytorch -c nvidia
pip install -r requirements.txt
# Running the Application
python main.py
```
The application will start a local web server with the Gradio interface, allowing you to enter legal queries and retrieve relevant documents.
## π Project Structure
```
Vietnamese-Legal-Doc-Retrieval/
βββ assets/ # Visual assets for documentation
β βββ gradio_demo.png # Screenshot of the Gradio demo interface
βββ cache/ # Cached model files
β βββ VN-legalDocs-SBERT/ # Cached BERT model files
βββ data/ # Dataset files
β βββ original/ # Original downloaded dataset
β β βββ corpus.csv # Raw corpus documents
β β βββ train_split.csv # Training data
β β βββ val_split.csv # Validation data
β β βββ ...
β βββ processed/ # Processed dataset files
β β βββ corpus_data.parquet # Processed corpus for embedding
β β βββ train_data.parquet # Processed training data
β β βββ test_data.parquet # Processed test data
β βββ retrieval/ # Files for retrieval system
β βββ legal_faiss.index # FAISS index for fast vector search
βββ models/ # Trained model files
β βββ VN-legalDocs-SBERT/ # Fine-tuned BERT model for legal documents
β βββ model.safetensors # Model weights
β βββ config.json # Model configuration
β βββ checkpoint-*/ # Training checkpoints
βββ results/ # Evaluation results
βββ Dockerfile # Docker configuration for deployment
βββ main.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ settings.py # Configuration settings
βββ step_*_*.ipynb # Jupyter notebooks for each step of the process
```
## πΎ Dataset
The system is trained on a Vietnamese legal document corpus containing:
- Legal texts from various domains
- Query-document pairs for training and evaluation
- Processed and structured for semantic search training
The dataset is available on [Hugging Face](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) (modified by me, the base dataset is cited below).
## π Model Training Process
The project follows a systematic approach to build the retrieval system:
1. **Data Preparation** (`step_01_Prepare_Data.ipynb`):
- Processes raw legal documents
- Creates query-document pairs for training
- Formats data for the embedding model
2. **SBERT Fine-tuning** (`step_02_Finetune_SBERT.ipynb`):
- Fine-tunes a multilingual BERT model with legal document pairs
- Uses `CachedMultipleNegativesRankingLoss` for training
- Optimizes for semantic similarity in legal context
3. **Evaluation** (`step_03_Eval_with_MTEB.ipynb`):
- Evaluates model performance using retrieval metrics
- Compares with baseline models
4. **Retrieval System Setup** (`step_04_Retrieval.ipynb`):
- Creates FAISS index from document embeddings
- Implements efficient search functionality
- Prepares for deployment
## π Usage Examples
The system accepts natural language queries in Vietnamese related to legal topics. Example queries:
- "Tα»i xΓΊc phαΊ‘m danh dα»±?" (Crimes against honor?)
- "Quyα»n lợi cα»§a ngΖ°α»i lao Δα»ng?" (Rights of workers?)
- "Thα»§ tα»₯c ΔΔng kΓ½ kαΊΏt hΓ΄n?" (Marriage registration procedures?)
## π§ͺ Performance
The fine-tuned model was evaluated using the [MTEB benchmark](https://github.com/embeddings-benchmark/mteb) on the BKAILegalDocRetrieval dataset. Key results:
| Metric | @k | Pre-trained model score (%) | Fine-tuned model score (%) |
|--------------|-----|-----------------------------|-----------------------------|
| **NDCG** | 1 | 0.007 | 42.425 |
| | 5 | 0.011 | 57.387 |
| | 10 | 0.023 | 60.389 |
| | 20 | 0.049 | 62.160 |
| | 100 | 0.147 | 63.894 |
| **MAP** | 1 | 0.007 | 40.328 |
| | 5 | 0.009 | 52.297 |
| | 10 | 0.014 | 53.608 |
| | 20 | 0.021 | 54.136 |
| | 100 | 0.033 | 54.418 |
| **Recall** | 1 | 0.007 | 40.328 |
| | 5 | 0.017 | 70.466 |
| | 10 | 0.054 | 79.407 |
| | 20 | 0.157 | 86.112 |
| | 100 | 0.713 | 94.805 |
| **Precision**| 1 | 0.007 | 42.425 |
| | 5 | 0.003 | 15.119 |
| | 10 | 0.005 | 8.587 |
| | 20 | 0.008 | 4.687 |
| | 100 | 0.007 | 1.045 |
| **MRR** | 1 | 0.007 | 42.418 |
| | 5 | 0.010 | 54.337 |
| | 10 | 0.014 | 55.510 |
| | 20 | 0.021 | 55.956 |
| | 100 | 0.033 | 56.172 |
- **NDCG@k (Normalized Discounted Cumulative Gain)**
Measures ranking quality by evaluating the relevance of results with logarithmic position-based discounting.
- **MAP@k (Mean Average Precision)**
Computes the average precision for each query up to rank kβprecision at each relevant retrieved documentβthen averages across all queries.
- **Recall@k**
The proportion of all relevant documents that are retrieved in the top k results.
- **Precision@k**
The proportion of the top k retrieved documents that are relevant.
- **MRR@k (Mean Reciprocal Rank)**
The average of the reciprocal of the rank position of the first relevant document across all queries.
The model significantly outperforms baseline retrieval methods, with the main evaluation score (NDCG@10) reaching 60.4%, demonstrating strong performance on Vietnamese legal document retrieval tasks.
## π³ Docker Deployment
The project includes a Docker configuration for easy deployment. The Docker image is built on `continuumio/miniconda3` and includes GPU support via PyTorch CUDA and FAISS-GPU.
```bash
# Build the Docker image
docker build -t vietnamese-legal-retrieval .
# Run the container
docker run -p 7860:7860 vietnamese-legal-retrieval
```
The container:
- Uses Python 3.10 with CUDA 12.1 support
- Installs required dependencies from requirements.txt
- Exposes port 7860 for the Gradio web interface
- Sets proper environment variables for security and performance
- Runs as a non-root user for enhanced security
You can access the web interface by navigating to `http://localhost:7860` after starting the container.
## π License
This project is licensed under the MIT License β feel free to modify and distribute it as needed.
## π€ Acknowledgments
Thanks for:
- [BKAI Legal Retrieval Dataset](https://huggingface.co/datasets/tmnam20/BKAI-Legal-Retrieval) for the original data
- [Sentence Transformers](https://www.sbert.net/) library for the embedding model architecture
- [Hugging Face](https://huggingface.co/) for hosting the model and dataset
If you find this project useful, consider βοΈ starring the repository or contributing to further improvements!
## π¬ Contact
For any questions or collaboration opportunities, feel free to reach out:
π§ Email: tainguyenphu2502@gmail.com |