Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .idea/CodeFuse-Embeddings.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

57 changes: 57 additions & 0 deletions .idea/workspace.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

53 changes: 53 additions & 0 deletions F2LLM/GRADIENT_ACCUMULATION_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Gradient Accumulation in F2LLM

## How Gradient Accumulation Works in This Codebase

1. Set `gradient_accumulation_steps` in the config.json and arguments.py file (default is 1, meaning no accumulation)
- e.g: `"gradient_accumulation_steps": 4` will accumulate gradients over 4 micro-batches


2. `utils.py`:
```python
# Scale loss by gradient accumulation steps to maintain same effective learning rate
loss_total = loss_total / args.gradient_accumulation_steps

# Update step only after gradient_accumulation_steps
if (completed_steps + 1) % args.gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
```
- Without accumulation: Process 1 batch of size N → compute loss → update parameters
- With accumulation: Process 4 micro-batches of size N/4 → accumulate gradients → update parameters

Both result in same parameter update if learning rate is properly scaled


## Example

Let's say you have:
- Desired effective batch size: 32
- GPU memory only allows: 8 samples per batch

**Without Gradient Accumulation**:
- You're limited to batch size 8
- Effective batch size = 8
- May result in suboptimal training dynamics

**With Gradient Accumulation (steps=4)**:
- Process 4 micro-batches of size 8 each
- Effective batch size = 32 (4 × 8)
- Same training dynamics as a batch size of 32
- Better gradient estimates due to larger effective batch size

## Configuration Example

To use gradient accumulation, modify your config file:
```json
{
"train_batch_size": 8,
"gradient_accumulation_steps": 4,
// This gives you an effective batch size of 32 (8 * 4)
// while only using memory for 8 samples at a time
}
```
39 changes: 39 additions & 0 deletions F2LLM/RAY_TRAINING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## Ray Distributed Training

This directory contains the Ray-based distributed training implementation for F2LLM embedding models, providing scalable, fault-tolerant training capabilities with automatic resource management and seamless scaling from single-node to multi-node clusters.

### Usage

#### Single-Node Training
```bash
python ray_distributed_run.py --config configs/ray_config.json --num_workers 4 --num_gpus_per_worker 1.0
```

#### Multi-Node Training

1. On the head node:
```bash
ray start --head --port=6379
python ray_distributed_run.py --config configs/ray_config.json --num_workers 8 --num_gpus_per_worker 1.0 --ray_head_address HEAD_NODE_IP
```

2. On worker nodes:
```bash
ray start --address=HEAD_NODE_IP:6379
```

### Configuration

The Ray-specific configuration extends the original config with these additional parameters:

- `num_workers`: Number of Ray workers (processes) to use
- `num_gpus_per_worker`: Number of GPUs per worker
- `num_cpus_per_worker`: Number of CPUs per worker

### Requirements

Install Ray-specific dependencies:

```bash
pip install -r ray_requirements.txt
```
27 changes: 24 additions & 3 deletions F2LLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,42 @@ In this repo we provide a streamlined and efficient script for training embeddin
- Setup environment following `requirements.txt`. We note that transformers>=4.51.0 is required for training Qwen3 models.
- Download data and backbone models from Hugging Face (we use Qwen3 models).
- Run `tokenize_data_qwen.py` to tokenize the downloaded data
- Modify model path, data path, and other arguments in `configs/config.json`.
- Modify model path, data path, and other arguments in `configs/config.json`. Note that you can configure gradient accumulation using the `gradient_accumulation_steps` parameter to enable training with larger effective batch sizes on resource-constrained hardware.
- Start training with `accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json`.

Note: we recommend setting `num_processes` to 1 in `configs/accelerate_config.yaml` and launch the training code once to generate cache for training data before starting the actual training.

For multi-node training, run on the main node:
### Gradient Accumulation

The training script supports gradient accumulation to enable training with larger effective batch sizes on resource-constrained hardware. This feature allows users to simulate large batch training by accumulating gradients over multiple smaller batches before performing optimization steps. Configure gradient accumulation by setting the `gradient_accumulation_steps` parameter in your config file - the default value is 1 (no accumulation). For example, with `train_batch_size=8` and `gradient_accumulation_steps=4`, the effective batch size becomes 32.

### Distributed Training Options

We support multiple distributed training frameworks:

#### Hugging Face Accelerate
```bash
accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json
```

For multi-node training with Accelerate, run on the main node:
```
accelerate launch --config_file configs/accelerate_config.yaml --num_machines N_NODE --num_processes N_PROCESSES --machine_rank 0 --main_process_ip MASTER_IP --main_process_port MASTER_PORT run.py --config configs/config.json
```

where N_NODE is the number of machines; N_PROCESSES is N_NODE\*8; MASTER_IP is the IP address of your master node, and MASTER_PORT is a port available on your machine (e.g. 6379).
where N_NODE is the number of machines; N_PROCESSES is N_NODE*8; MASTER_IP is the IP address of your master node, and MASTER_PORT is a port available on your machine (e.g. 6379).

On worker nodes, also run the above commmand but modify `machine_rank` accordingly.

#### Ray Distributed Training (NEW!)
For scalable, fault-tolerant training across multiple nodes and GPUs, use our new Ray integration:

```bash
python ray_distributed_run.py --config configs/ray_config.json --num_workers 4 --num_gpus_per_worker 1.0
```

See [RAY_TRAINING.md](RAY_TRAINING.md) for detailed Ray training documentation.

### Citation

If you use the F2LLM models, data, or code, please cite the following technical report.
Expand Down
Binary file added F2LLM/__pycache__/arguments.cpython-313.pyc
Binary file not shown.
Binary file added F2LLM/__pycache__/model.cpython-313.pyc
Binary file not shown.
Binary file not shown.
Binary file added F2LLM/__pycache__/run.cpython-313.pyc
Binary file not shown.
Binary file added F2LLM/__pycache__/utils.cpython-313.pyc
Binary file not shown.
2 changes: 2 additions & 0 deletions F2LLM/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ class Args:
log_interval: int = 20
checkpointing_steps: int = 100
validation_steps: int = 100
# gradient accumulation
gradient_accumulation_steps: int = 1
# just placeholder, for logging purpose
num_processes: int=0

Expand Down
3 changes: 2 additions & 1 deletion F2LLM/configs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@
"warmup_steps": 500,
"train_epochs": 2,
"log_interval": 100,
"num_hard_neg": 7
"num_hard_neg": 7,
"gradient_accumulation_steps": 1
}
23 changes: 23 additions & 0 deletions F2LLM/configs/ray_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"model_path": "models/qwen3-4b",
"experiment_id": "ray_distributed_4b+lr.8e-6+bs.16x32+context.1024+2epochs",
"train_data_path": "training_data/data_tokenized_qwen",
"output_dir": "output",
"tb_dir": "output/tb",
"cache_dir": "cache",
"train_batch_size": 16,
"checkpointing_steps": 5000,
"validation_steps": 5000,
"max_seq_length": 1024,
"learning_rate": 8e-6,
"min_lr": 1e-7,
"weight_decay": 0.01,
"warmup_steps": 500,
"train_epochs": 2,
"log_interval": 100,
"num_hard_neg": 7,
"gradient_accumulation_steps": 1,
"num_workers": 4,
"num_gpus_per_worker": 1.0,
"num_cpus_per_worker": 2
}
Loading