Skip to content

A comprehensive toolkit for Vietnamese Natural Language to SQL conversion with advanced prompting strategies, intelligent example selection, and enhanced evaluation metrics.

License

Notifications You must be signed in to change notification settings

hoadm-net/ViPERSQL

Repository files navigation

ViPERSQL

Vietnamese/English Text-to-SQL System with ViR2 Example Selection Method

A research system for converting natural language questions to SQL queries, featuring ViR2 - a novel two-stage example selection method combining semantic retrieval with syntactic matching and diversity optimization.


🎯 Overview

ViPERSQL addresses the challenge of selecting optimal few-shot examples for Text-to-SQL tasks through:

  • ViR2 Method: Two-stage selection (PhoBERT retrieval → POS-based re-ranking with diversity)
  • Multi-Language Support: Vietnamese (PhoBERT + underthesea) and English (BERT + spaCy)
  • Enhanced Evaluation: Component-wise F1 metrics beyond Exact Match
  • Modular Architecture: Extensible framework with multiple strategies and selectors

Key Innovation:

$$\text{Score}(E, q) = \text{POS}_{\text{Score}}(E, q) + \lambda \cdot \text{Diversity}(E)$$

where $\lambda = 0.3$ balances syntactic similarity and example diversity.


📚 Documentation

Core Concepts

Usage Guides

Advanced


⚡ Quick Start

Installation

git clone https://github.com/hoadm-net/ViPERSQL.git
cd ViPERSQL
pip install -r requirements.txt

Configuration

cp .env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...

Basic Usage

# Zero-shot (baseline)
python vipersql.py --samples 10

# Few-shot with ViR2 (recommended)
python vipersql.py --strategy few-shot --example-selection-strategy vir2 --samples 10

# Chain-of-thought reasoning
python vipersql.py --strategy cot --samples 10

See Usage Examples for more scenarios.


🏗️ System Architecture

Input Question
      ↓
┌─────────────┐
│  Strategy   │  Zero-shot / Few-shot / CoT
└─────┬───────┘
      ↓
┌─────────────┐
│  Selector   │  Random / DICL / ASTRES / Skill-KNN / ViR2
└─────┬───────┘  (if Few-shot)
      ↓
┌─────────────┐
│ LLM Interface│  OpenAI GPT / Anthropic Claude
└─────┬───────┘
      ↓
┌─────────────┐
│  Evaluator  │  Component F1 + Error Analysis
└─────┬───────┘
      ↓
  SQL Query + Metrics

See Architecture for details.


🎓 Research Contributions

  1. ViR2 Method: Novel two-stage example selection combining semantic + syntactic + diversity
  2. Multi-Language Framework: Unified architecture for Vietnamese and English
  3. Enhanced Metrics: Component-wise evaluation beyond Exact Match
  4. Ablation Framework: Systematic testing of individual components

📊 Supported Methods

Method Type Speed Complexity Notes
Zero-shot Baseline ⚡⚡⚡ Low No training examples
Random Few-shot ⚡⚡⚡ Low Random selection baseline
DICL Few-shot ⚡⚡ Medium Semantic similarity only
ASTRES Few-shot High AST-based structural matching
Skill-KNN Few-shot ⚡⚡ Medium SQL skill extraction + matching
ViR2 Few-shot ⚡⚡ Medium Two-stage: Semantic + POS + Diversity
CoT Reasoning High Step-by-step reasoning

📁 Project Structure

ViPERSQL/
├── vipersql.py              # Main CLI entry point
├── requirements.txt         # Dependencies
├── .env.example            # Environment template
├── docs/                   # Documentation
│   ├── ARCHITECTURE.md
│   ├── VIR2_METHOD.md
│   ├── STRATEGIES.md
│   └── ...
├── mint/                   # Core package
│   ├── core/              # Evaluator, LLM, Templates
│   ├── strategies/        # Zero-shot, Few-shot, CoT
│   ├── selectors/         # Random, DICL, ASTRES, ViR2
│   ├── metrics/           # Enhanced metrics
│   └── utils/             # Utilities
├── dataset/               # ViText2SQL dataset
├── templates/             # Prompt templates
├── scripts/               # Preprocessing scripts
└── results/               # Evaluation outputs

🛠️ Configuration

All settings configurable via .env or command-line:

# Model selection
--model gpt-4o              # or claude-3-5-sonnet-20241022

# Strategy selection  
--strategy few-shot         # or zero-shot, cot

# Selector for few-shot
--example-selection-strategy vir2  # or random, dicl, astres, skill_knn

# ViR2 parameters
--vir2-candidate-pool-size 50      # Stage 1 pool size (M)
--vir2-beam-size 5                 # Beam search width (B)
--vir2-diversity-weight 0.3        # Diversity weight (λ)

# Dataset options
--level std                 # or syllable, word
--split dev                 # or test
--samples 100               # Number of samples

See Configuration Guide for all options.


🔬 Example: Running ViR2

# Basic ViR2 with default parameters (M=50, B=5, λ=0.3)
python vipersql.py \
  --strategy few-shot \
  --example-selection-strategy vir2 \
  --samples 100

# Custom ViR2 parameters
python vipersql.py \
  --strategy few-shot \
  --example-selection-strategy vir2 \
  --vir2-candidate-pool-size 100 \
  --vir2-beam-size 10 \
  --vir2-diversity-weight 0.5 \
  --samples 100

# Ablation: ViR2 without POS matching
python vipersql.py \
  --strategy few-shot \
  --example-selection-strategy vir2-no-pos \
  --samples 100

📄 License

MIT License - See LICENSE file for details.

About

A comprehensive toolkit for Vietnamese Natural Language to SQL conversion with advanced prompting strategies, intelligent example selection, and enhanced evaluation metrics.

Topics

Resources

License

Stars

Watchers

Forks