codefuse-ai · fluoryyn-art · Dec 8, 2025 · Dec 9, 2025 · Dec 12, 2025
diff --git a/.idea/CodeFuse-Embeddings.iml b/.idea/CodeFuse-Embeddings.iml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/workspace.xml b/.idea/workspace.xml
diff --git a/F2LLM/GRADIENT_ACCUMULATION_README.md b/F2LLM/GRADIENT_ACCUMULATION_README.md
@@ -0,0 +1,53 @@
+# Gradient Accumulation in F2LLM
+
+## How Gradient Accumulation Works in This Codebase
+
+1. Set `gradient_accumulation_steps` in the config.json and arguments.py file (default is 1, meaning no accumulation)
+   - e.g: `"gradient_accumulation_steps": 4` will accumulate gradients over 4 micro-batches
+
+
+2. `utils.py`:
+   ```python
+   # Scale loss by gradient accumulation steps to maintain same effective learning rate
+   loss_total = loss_total / args.gradient_accumulation_steps
+
+   # Update step only after gradient_accumulation_steps
+   if (completed_steps + 1) % args.gradient_accumulation_steps == 0:
+       optimizer.step()
+       lr_scheduler.step()
+       optimizer.zero_grad()
+   ```
+   - Without accumulation: Process 1 batch of size N → compute loss → update parameters
+   - With accumulation: Process 4 micro-batches of size N/4 → accumulate gradients → update parameters
+
+   Both result in same parameter update if learning rate is properly scaled
+
+
+## Example
+
+Let's say you have:
+- Desired effective batch size: 32
+- GPU memory only allows: 8 samples per batch
+
+**Without Gradient Accumulation**:
+- You're limited to batch size 8
+- Effective batch size = 8
+- May result in suboptimal training dynamics
+
+**With Gradient Accumulation (steps=4)**:
+- Process 4 micro-batches of size 8 each
+- Effective batch size = 32 (4 × 8)
+- Same training dynamics as a batch size of 32
+- Better gradient estimates due to larger effective batch size
+
+## Configuration Example
+
+To use gradient accumulation, modify your config file:
+```json
+{
+  "train_batch_size": 8,
+  "gradient_accumulation_steps": 4,
+  // This gives you an effective batch size of 32 (8 * 4)
+  // while only using memory for 8 samples at a time
+}
+```
diff --git a/F2LLM/RAY_TRAINING.md b/F2LLM/RAY_TRAINING.md
@@ -0,0 +1,39 @@
+## Ray Distributed Training
+
+This directory contains the Ray-based distributed training implementation for F2LLM embedding models, providing scalable, fault-tolerant training capabilities with automatic resource management and seamless scaling from single-node to multi-node clusters.
+
+### Usage
+
+#### Single-Node Training
+```bash
+python ray_distributed_run.py --config configs/ray_config.json --num_workers 4 --num_gpus_per_worker 1.0
+```
+
+#### Multi-Node Training
+
+1. On the head node:
+```bash
+ray start --head --port=6379
+python ray_distributed_run.py --config configs/ray_config.json --num_workers 8 --num_gpus_per_worker 1.0 --ray_head_address HEAD_NODE_IP
+```
+
+2. On worker nodes:
+```bash
+ray start --address=HEAD_NODE_IP:6379
+```
+
+### Configuration
+
+The Ray-specific configuration extends the original config with these additional parameters:
+
+- `num_workers`: Number of Ray workers (processes) to use
+- `num_gpus_per_worker`: Number of GPUs per worker
+- `num_cpus_per_worker`: Number of CPUs per worker
+
+### Requirements
+
+Install Ray-specific dependencies:
+
+```bash
+pip install -r ray_requirements.txt
+```
diff --git a/F2LLM/README.md b/F2LLM/README.md
@@ -27,21 +27,42 @@ In this repo we provide a streamlined and efficient script for training embeddin
 - Setup environment following `requirements.txt`. We note that transformers>=4.51.0 is required for training Qwen3 models.
 - Download data and backbone models from Hugging Face (we use Qwen3 models).
 - Run `tokenize_data_qwen.py` to tokenize the downloaded data
-- Modify model path, data path, and other arguments in `configs/config.json`.
+- Modify model path, data path, and other arguments in `configs/config.json`. Note that you can configure gradient accumulation using the `gradient_accumulation_steps` parameter to enable training with larger effective batch sizes on resource-constrained hardware.
 - Start training with `accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json`.
 
 Note: we recommend setting `num_processes` to 1 in `configs/accelerate_config.yaml` and launch the training code once to generate cache for training data before starting the actual training.
 
-For multi-node training, run on the main node:
+### Gradient Accumulation
 
+The training script supports gradient accumulation to enable training with larger effective batch sizes on resource-constrained hardware. This feature allows users to simulate large batch training by accumulating gradients over multiple smaller batches before performing optimization steps. Configure gradient accumulation by setting the `gradient_accumulation_steps` parameter in your config file - the default value is 1 (no accumulation). For example, with `train_batch_size=8` and `gradient_accumulation_steps=4`, the effective batch size becomes 32.
+
+### Distributed Training Options
+
+We support multiple distributed training frameworks:
+
+#### Hugging Face Accelerate
+```bash
+accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json
+```
+
+For multi-node training with Accelerate, run on the main node:
 ```
 accelerate launch --config_file configs/accelerate_config.yaml --num_machines N_NODE --num_processes N_PROCESSES --machine_rank 0 --main_process_ip MASTER_IP --main_process_port MASTER_PORT run.py --config configs/config.json
 ```
 
-where N_NODE is the number of machines; N_PROCESSES is N_NODE\*8; MASTER_IP is the IP address of your master node, and MASTER_PORT is a port available on your machine (e.g. 6379).
+where N_NODE is the number of machines; N_PROCESSES is N_NODE*8; MASTER_IP is the IP address of your master node, and MASTER_PORT is a port available on your machine (e.g. 6379).
 
 On worker nodes, also run the above commmand but modify `machine_rank` accordingly.
 
+#### Ray Distributed Training (NEW!)
+For scalable, fault-tolerant training across multiple nodes and GPUs, use our new Ray integration:
+
+```bash
+python ray_distributed_run.py --config configs/ray_config.json --num_workers 4 --num_gpus_per_worker 1.0
+```
+
+See [RAY_TRAINING.md](RAY_TRAINING.md) for detailed Ray training documentation.
+
 ### Citation
 
 If you use the F2LLM models, data, or code, please cite the following technical report.

diff --git a/F2LLM/__pycache__/arguments.cpython-313.pyc b/F2LLM/__pycache__/arguments.cpython-313.pyc
diff --git a/F2LLM/__pycache__/model.cpython-313.pyc b/F2LLM/__pycache__/model.cpython-313.pyc
diff --git a/F2LLM/__pycache__/ray_distributed_run.cpython-313.pyc b/F2LLM/__pycache__/ray_distributed_run.cpython-313.pyc
diff --git a/F2LLM/__pycache__/run.cpython-313.pyc b/F2LLM/__pycache__/run.cpython-313.pyc
diff --git a/F2LLM/__pycache__/utils.cpython-313.pyc b/F2LLM/__pycache__/utils.cpython-313.pyc
diff --git a/F2LLM/arguments.py b/F2LLM/arguments.py
@@ -27,6 +27,8 @@ class Args:
     log_interval: int = 20
     checkpointing_steps: int = 100
     validation_steps: int = 100
+    # gradient accumulation
+    gradient_accumulation_steps: int = 1
     # just placeholder, for logging purpose
     num_processes: int=0
 

diff --git a/F2LLM/configs/config.json b/F2LLM/configs/config.json
@@ -15,5 +15,6 @@
   "warmup_steps": 500,
   "train_epochs": 2,
   "log_interval": 100,
-  "num_hard_neg": 7
+  "num_hard_neg": 7,
+  "gradient_accumulation_steps": 1
 }
diff --git a/F2LLM/configs/ray_config.json b/F2LLM/configs/ray_config.json
@@ -0,0 +1,23 @@
+{
+  "model_path": "models/qwen3-4b",
+  "experiment_id": "ray_distributed_4b+lr.8e-6+bs.16x32+context.1024+2epochs",
+  "train_data_path": "training_data/data_tokenized_qwen",
+  "output_dir": "output",
+  "tb_dir": "output/tb",
+  "cache_dir": "cache",
+  "train_batch_size": 16,
+  "checkpointing_steps": 5000,
+  "validation_steps": 5000,
+  "max_seq_length": 1024,
+  "learning_rate": 8e-6,
+  "min_lr": 1e-7,
+  "weight_decay": 0.01,
+  "warmup_steps": 500,
+  "train_epochs": 2,
+  "log_interval": 100,
+  "num_hard_neg": 7,
+  "gradient_accumulation_steps": 1,
+  "num_workers": 4,
+  "num_gpus_per_worker": 1.0,
+  "num_cpus_per_worker": 2
+}