diff --git a/auto_bagit_transfer/DOCUMENTATION.md b/auto_bagit_transfer/DOCUMENTATION.md new file mode 100644 index 0000000..fe8d254 --- /dev/null +++ b/auto_bagit_transfer/DOCUMENTATION.md @@ -0,0 +1,368 @@ +# Automated BagIt Transfer Tool - Documentation + +## What is this tool? + +The Automated BagIt Transfer Tool is a cross-platform Python script that helps you safely move folders from one location to another while ensuring your files don't get corrupted during the transfer. It works on both **Windows** and **Mac** (and Linux too!). It uses the BagIt specification developed by the **Library of Congress** that adds extra protection to your files. + +Think of it like putting your files in a secure envelope with a seal - if anything goes wrong during transfer, you'll know about it. + +This tool is built on top of the [bagit-python](https://github.com/LibraryOfCongress/bagit-python) library, which is the official Python implementation of the BagIt specification maintained by the Library of Congress. + +### ๐Ÿ–ฅ๏ธ **Cross-Platform Compatibility** + +- **Windows:** Use `python` command +- **Mac/Linux:** Use `python3` command +- **Same script works everywhere** - no need for different versions! + +## What does "BagIt" mean? + +BagIt is a hierarchical file packaging format developed by the **Library of Congress** for digital preservation and data transfer. It's like a digital safety wrapper for your files that ensures long-term accessibility and integrity. + +The BagIt specification is widely used by libraries, archives, and institutions worldwide for digital preservation. When you "bag" a folder, the tool: + +- Creates checksums (digital fingerprints) for every file using industry-standard algorithms +- Adds metadata (information about when and how the bag was created) +- Organizes everything in a standardized format recognized internationally +- Verifies that nothing was lost or corrupted during transfer +- Follows Library of Congress best practices for digital preservation + +## Key Features + +### ๐ŸŽฏ **Smart Folder Selection** + +- Transfer ALL folders from a directory +- Transfer only SPECIFIC folders you choose +- EXCLUDE certain folders you don't want +- Skip empty folders automatically +- **Handles existing BagIt bags** by re-bagging them with fresh checksums and validation + +### ๐Ÿ”’ **Data Protection** + +- Creates SHA256 and SHA512 checksums for every file +- Validates all files after transfer +- Detects any corruption or missing files +- Keeps original files safe (never modifies them) +- **Automatically excludes system/hidden files** (macOS .DS_Store, ._files, Windows Thumbs.db, desktop.ini, etc.) +- **Cross-platform hidden file detection** for clean, portable bags + +### ๐Ÿ“Š **Detailed Logging** + +- Records every action with timestamps +- Shows success/failure for each folder +- Provides detailed validation error messages for troubleshooting +- Provides transfer statistics +- Saves logs for future reference + +### ๐Ÿงช **Safe Testing** + +- Dry-run mode to preview what will happen +- No actual transfers until you're ready +- See exactly which folders will be processed + +### โš™๏ธ **Flexible Configuration** + +- Use config files for repeated tasks +- Override settings with command-line options +- Customize metadata and processing options +- **Configurable batch processing** for memory-efficient handling of large datasets +- **Space-efficient processing** with automatic temporary file cleanup after each batch + +### ๐Ÿš€ **Performance & Efficiency** + +- **Batch processing** to manage memory usage and disk space efficiently +- **Default batch size of 1** for maximum space efficiency (configurable) +- Automatic cleanup of temporary files after each batch +- **Handles existing bags intelligently** by re-bagging with fresh validation +- Processes regular folders and existing bags separately for optimal handling + +## Usage + +### Basic Commands + +```bash +# Windows: use 'python' | Mac/Linux: use 'python3' + +# Transfer all folders +python auto_bagit_transfer.py --source "C:\Photos" --destination "D:\Backup" + +# Transfer specific folders only +python auto_bagit_transfer.py --source "C:\Photos" --destination "D:\Backup" --include-folders "Vacation" "Family" + +# Exclude certain folders +python auto_bagit_transfer.py --source "C:\Photos" --destination "D:\Backup" --exclude-folders "Temp" "Screenshots" + +# Test first (dry run) +python auto_bagit_transfer.py --source "C:\Photos" --destination "D:\Backup" --dry-run + +# Use config file +python auto_bagit_transfer.py + +# Process in batches (default is 1 for space efficiency) +python auto_bagit_transfer.py --source "C:\Photos" --destination "D:\Backup" --batch-size 5 +``` + +### Configuration File + +The `config.ini` file allows you to set default paths and options so you don't have to type them every time. This is especially useful for repeated transfers or when you have long file paths. + +**What it does:** + +- Stores your frequently used source and destination paths +- Sets default BagIt options (checksums, processing threads) +- Saves metadata information for bag creation +- Eliminates need to type long command-line arguments + +**How to use it:** +Edit `config.ini` with your preferred settings: + +```ini +[PATHS] +source_path = C:\Users\Dell\OneDrive\Pictures +destination_path = D:\1_USA\AICIC\bagit\auto_transfer + +[BAGIT_OPTIONS] +checksums = sha256,sha512 +processes = 4 +batch_size = 1 + +[METADATA] +source_organization = My Organization +contact_name = Your Name +``` + +Once configured, simply run `python auto_bagit_transfer.py` without any arguments to use these settings. + +## Command Options + +| Option | Description | +| ------------------------------- | ---------------------------- | +| `--source PATH` | Source directory | +| `--destination PATH` | Destination directory | +| `--include-folders NAME1 NAME2` | Only transfer these folders | +| `--exclude-folders NAME1 NAME2` | Skip these folders | +| `--dry-run` | Preview without transferring | +| `--include-empty` | Include empty folders | +| `--config FILE` | Use different config file | +| `--batch-size NUMBER` | Process folders in batches (default: 1 for space efficiency) | + +## What Happens When You Run the Tool? + +### Step 1: Preparation + +- Checks that source and destination paths exist +- Creates destination folder if needed +- Sets up logging + +### Step 2: Folder Discovery + +- Scans source directory for folders +- **Identifies existing BagIt bags** and regular folders separately +- Applies your include/exclude filters +- Skips empty folders (unless you say otherwise) +- Shows you what will be processed (regular folders and existing bags) + +### Step 3: Batch Processing + +- **Processes folders in configurable batches** (default: 1 for space efficiency) +- Creates temporary directory for each batch +- **For regular folders:** Creates clean copy excluding hidden/system files, then bags +- **For existing bags:** Extracts data directory, excludes hidden files, creates fresh bag +- Calculates checksums for all files +- Adds BagIt metadata with processing information + +### Step 4: Transfer + +- Copies the bag to destination +- Validates the transferred bag +- Confirms all files arrived safely +- Records success or failure + +### Step 5: Cleanup & Summary + +- **Automatically removes temporary files after each batch** +- Shows detailed transfer summary with separate counts for regular folders and re-bagged items +- Provides success rate statistics +- Saves comprehensive log file with batch processing details + +## Output Structure + +### BagIt Format + +``` +MyFolder/ +โ”œโ”€โ”€ bagit.txt # Format info +โ”œโ”€โ”€ bag-info.txt # Metadata +โ”œโ”€โ”€ manifest-sha256.txt # File checksums +โ”œโ”€โ”€ manifest-sha512.txt # File checksums +โ”œโ”€โ”€ tagmanifest-*.txt # Metadata checksums +โ””โ”€โ”€ data/ # Your original files +``` + +### Log Files + +Creates timestamped logs (e.g., `auto_bagit_transfer_20250904_114137.log`) showing: + +- Processing steps and results +- Success/failure for each folder +- Final transfer statistics + +## Batch Processing & Space Efficiency + +### Why Batch Processing? + +The tool uses **batch processing by default** to ensure efficient use of temporary disk space and system resources: + +- **Prevents temporary space exhaustion** - Only processes one folder at a time by default +- **Memory efficient** - Doesn't load all folders into memory simultaneously +- **Automatic cleanup** - Removes temporary files after each batch +- **Handles large datasets** - Can process thousands of folders without running out of space + +### How Batch Processing Works + +1. **Creates temporary directory** for current batch (e.g., `bagit_batch_1_`) +2. **Processes folders in batch:** + - Regular folders: Creates clean copy (excluding hidden files) โ†’ bags โ†’ transfers + - Existing bags: Extracts data directory โ†’ excludes hidden files โ†’ creates fresh bag โ†’ transfers +3. **Validates each transfer** with detailed error reporting +4. **Cleans up temporary directory** completely before next batch +5. **Moves to next batch** until all folders processed + +### Default Space-Efficient Settings + +```bash +# Default behavior (batch size = 1, maximum space efficiency) +python auto_bagit_transfer.py --source "SOURCE" --destination "DEST" + +# Process multiple folders per batch (if you have more temp space) +python auto_bagit_transfer.py --batch-size 5 --source "SOURCE" --destination "DEST" + +# Set default in config.ini +[BAGIT_OPTIONS] +batch_size = 1 +``` + +### Hidden File Exclusion + +The tool automatically excludes problematic system/hidden files that can cause validation issues: + +**macOS files excluded:** +- `.DS_Store` (Finder metadata) +- `._filename` (resource forks) +- Files starting with `._` + +**Windows files excluded:** +- `Thumbs.db` / `thumbs.db` (thumbnail cache) +- `desktop.ini` (folder customization) +- `folder.jpg`, `albumartsmall.jpg` (media metadata) + +**General exclusions:** +- Hidden files starting with `.` (except in existing bag structure) + +## Troubleshooting + +| Problem | Solution | +| ---------------------------- | ----------------------------------------------------------------------- | +| "Python not found" | Install Python 3.x, ensure it's in PATH | +| "Source path does not exist" | Check path spelling and existence | +| "Permission denied" | Run as Administrator or check permissions | +| "Out of space" | Default batch size of 1 should prevent this; check available temp space | +| "Bag validation failed" | Check detailed error in logs; tool now excludes problematic hidden files | + +### Cross-Platform Compatibility + +**Hidden File Handling:** +The tool now **automatically excludes** problematic system files that previously caused validation issues: + +- **macOS:** `.DS_Store`, `._files`, resource forks automatically excluded +- **Windows:** `Thumbs.db`, `desktop.ini`, media cache files automatically excluded +- **All platforms:** Generic hidden files (starting with `.`) excluded from data + +**Benefits:** +- **Clean, portable bags** that work across different operating systems +- **No more validation failures** due to hidden system files +- **Consistent behavior** whether source is Windows, Mac, or Linux + +## Safety & Best Practices + +**Safety Features:** + +- Never modifies original files +- Automatic validation of all transfers +- Complete audit trail in logs +- Dry-run testing mode + +**Best Practices:** + +1. Always test with `--dry-run` first +2. Ensure adequate disk space (bags are ~10% larger) +3. Don't interrupt transfers in progress +4. Review log files for any failures + +## FAQ + +**Q: Can I stop and resume a transfer?** +A: Yes! Check the destination directory to see which folders transferred successfully, then use `--exclude-folders` to skip completed ones or `--include-folders` to process only remaining ones. + +**Q: What about duplicate folder names?** +A: Tool automatically adds numbers (folder_1, folder_2, etc.) + +**Q: Can I transfer to network drives?** +A: Yes, with proper write permissions. + +**Q: How to verify successful transfer?** +A: Check log for "Success rate: 100.0%" and detailed batch summaries showing successful transfers. + +**Q: What happens to existing BagIt bags in my source?** +A: The tool detects existing bags and re-bags them with fresh checksums and validation, excluding any hidden files that may have been added. + +## Quick Reference + +```bash +# Get help +python auto_bagit_transfer.py --help + +# Most common usage +python auto_bagit_transfer.py --source "SOURCE_PATH" --destination "DEST_PATH" + +# With specific folders +python auto_bagit_transfer.py --include-folders "Folder1" "Folder2" + +# Test first +python auto_bagit_transfer.py --dry-run +``` + +**Files needed:** `auto_bagit_transfer.py`, `config.ini` +**Dependencies:** `bagit-python` library (automatically installed) +**Creates:** BagIt-compliant bags, timestamped log files with batch processing details + +### For Large Datasets + +```bash +# Default space-efficient processing (recommended) +python auto_bagit_transfer.py --source "SOURCE" --destination "DEST" + +# Process multiple folders per batch (if you have temp space) +python auto_bagit_transfer.py --batch-size 5 --source "SOURCE" --destination "DEST" +``` + +## About BagIt and Library of Congress + +This tool implements the **BagIt File Packaging Format** (RFC 8493), a specification developed by the Library of Congress and the California Digital Library. BagIt is designed to support storage and transfer of arbitrary digital content in a manner that is both simple and robust. + +### Key Benefits of the BagIt Standard: + +- **Widely adopted** by libraries, archives, and digital preservation communities +- **Platform independent** - works across different operating systems and storage systems +- **Self-describing** - bags contain all necessary metadata and validation information +- **Tamper evident** - any changes to files are immediately detectable +- **Future-proof** - based on open standards and simple file formats + +### Learn More: + +- **BagIt Specification:** [RFC 8493](https://tools.ietf.org/rfc/rfc8493.txt) +- **Library of Congress BagIt:** [https://www.loc.gov/preservation/digital/formats/fdd/fdd000531.shtml](https://www.loc.gov/preservation/digital/formats/fdd/fdd000531.shtml) +- **bagit-python Library:** [https://github.com/LibraryOfCongress/bagit-python](https://github.com/LibraryOfCongress/bagit-python) + +--- + +_This tool is released under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, making it freely available for any use._ diff --git a/auto_bagit_transfer/auto_bagit_transfer.py b/auto_bagit_transfer/auto_bagit_transfer.py new file mode 100644 index 0000000..1d812ad --- /dev/null +++ b/auto_bagit_transfer/auto_bagit_transfer.py @@ -0,0 +1,576 @@ +#!/usr/bin/env python3 +""" +Automated BagIt Transfer Tool + +This script automatically transfers all folders from a source directory to a destination +directory using the BagIt format. Each folder is converted to a bag before transfer. + +This work is released under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. +See: https://creativecommons.org/publicdomain/zero/1.0/ + +Usage: + python auto_bagit_transfer.py --source "D:\source\path" --destination "D:\destination\path" +""" + +import os +import sys +import argparse +import shutil +import logging +import configparser +import tempfile +from pathlib import Path +from datetime import datetime + +try: + import bagit +except ImportError: + print("Error: bagit module not found. Installing...") + import subprocess + try: + subprocess.check_call([sys.executable, "-m", "pip", "install", "bagit"]) + import bagit + print("Successfully installed bagit module.") + except subprocess.CalledProcessError: + print("Error: Failed to install bagit module. Please install manually with: pip install bagit") + sys.exit(1) + +def check_dependencies(): + """Check if all required dependencies are available""" + missing_deps = [] + + try: + import bagit # noqa: F401 + except ImportError: + missing_deps.append("bagit") + + if missing_deps: + print("Missing required dependencies:") + for dep in missing_deps: + print(f" - {dep}") + print("\nTo install missing dependencies, run:") + print(" pip install bagit") + print(" or") + print(" pip install -r requirements.txt") + return False + + return True + +def load_config(config_file='config.ini'): + """Load configuration from config file""" + config = configparser.ConfigParser() + + if os.path.exists(config_file): + config.read(config_file) + return config + else: + # Return default configuration if file doesn't exist + config['PATHS'] = { + 'source_path': '', + 'destination_path': '' + } + config['BAGIT_OPTIONS'] = { + 'checksums': 'sha256,sha512', + 'processes': '4', + 'batch_size': '1' + } + config['METADATA'] = { + 'source_organization': 'Auto BagIt Transfer Tool', + 'contact_name': 'Automated Process', + 'bag_software_agent': 'auto_bagit_transfer.py' + } + return config + +def should_exclude_file(file_path): + """Check if a file should be excluded from bagging (cross-platform hidden/system files)""" + filename = file_path.name.lower() + + # macOS hidden files + if file_path.name.startswith('._'): + return True + if filename == '.DS_Store': + return True + + # Windows system files + if filename == 'thumbs.db' or filename == 'Thumbs.db': + return True + if filename == 'desktop.ini': + return True + if filename == 'folder.jpg': + return True + if filename == 'albumartsmall.jpg': + return True + + # General hidden files (starting with dot) + if file_path.name.startswith('.') and len(file_path.name) > 1: + return True + + return False + +def setup_logging(): + """Set up logging configuration""" + log_filename = f"auto_bagit_transfer_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" + + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler(log_filename), + logging.StreamHandler(sys.stdout) + ] + ) + + return logging.getLogger(__name__) + +def validate_paths(source_path, destination_path): + """Validate source and destination paths""" + source = Path(source_path) + destination = Path(destination_path) + + if not source.exists(): + raise ValueError(f"Source path does not exist: {source_path}") + + if not source.is_dir(): + raise ValueError(f"Source path is not a directory: {source_path}") + + # Create destination directory if it doesn't exist + destination.mkdir(parents=True, exist_ok=True) + + return source, destination + +def get_folders_to_transfer(source_path, include_folders=None, exclude_folders=None, skip_empty=True): + """Get list of folders to transfer from source directory""" + folders = [] + existing_bags = [] + + for item in source_path.iterdir(): + if item.is_dir() and not item.name.startswith('.'): + # Check include filter + if include_folders and item.name not in include_folders: + continue + + # Check exclude filter + if exclude_folders and item.name in exclude_folders: + continue + + # Check if folder is empty (skip if requested) + if skip_empty and is_folder_empty(item): + continue + + # Check if it's already a bag + if is_existing_bag(item) or has_bag_structure(item): + existing_bags.append(item) + else: + folders.append(item) + + return folders, existing_bags + +def is_folder_empty(folder_path): + """Check if a folder is empty (no files, only empty subdirectories allowed)""" + try: + for item in folder_path.rglob('*'): + if item.is_file(): + return False + return True + except (PermissionError, OSError): + # If we can't read the folder, assume it's not empty to be safe + return False + +def is_existing_bag(folder_path): + """Check if a folder is already a valid BagIt bag""" + try: + bag = bagit.Bag(str(folder_path)) + return bag.is_valid() + except Exception: + return False + +def has_bag_structure(folder_path): + """Check if folder has BagIt structure (even if not valid)""" + required_files = ['bagit.txt', 'bag-info.txt'] + has_data_dir = (folder_path / 'data').is_dir() + has_required_files = all((folder_path / f).exists() for f in required_files) + return has_data_dir and has_required_files + +def copy_folder_excluding_hidden(src, dst, logger): + """Copy folder contents excluding hidden/system files from both Windows and macOS""" + dst.mkdir(parents=True, exist_ok=True) + + for item in src.rglob('*'): + if item.is_file() and not should_exclude_file(item): + # Calculate relative path from source + rel_path = item.relative_to(src) + dest_file = dst / rel_path + + # Create parent directories if needed + dest_file.parent.mkdir(parents=True, exist_ok=True) + + # Copy the file + shutil.copy2(item, dest_file) + logger.debug(f"Copied: {rel_path}") + elif item.is_file(): + logger.debug(f"Excluded: {item.relative_to(src)} (hidden/system file)") + +def create_bag_from_folder(folder_path, temp_dir, config, logger): + """Create a bag from a folder""" + try: + logger.info(f"Creating bag for folder: {folder_path.name}") + + # Create a temporary copy of the folder for bagging, excluding hidden files + temp_folder = temp_dir / folder_path.name + copy_folder_excluding_hidden(folder_path, temp_folder, logger) + + # Get configuration values + checksums = [alg.strip() for alg in config.get('BAGIT_OPTIONS', 'checksums').split(',')] + processes = config.getint('BAGIT_OPTIONS', 'processes') + + # Create bag metadata + bag_info = { + 'Source-Organization': config.get('METADATA', 'source_organization'), + 'Contact-Name': config.get('METADATA', 'contact_name'), + 'External-Description': f'Bag created from folder: {folder_path.name}', + 'Bagging-Date': datetime.now().strftime('%Y-%m-%d'), + 'Bag-Software-Agent': config.get('METADATA', 'bag_software_agent') + } + + # Create the bag + bagit.make_bag( + str(temp_folder), + bag_info=bag_info, + checksums=checksums, + processes=processes + ) + + logger.info(f"Successfully created bag for: {folder_path.name}") + return temp_folder + + except Exception as e: + logger.error(f"Failed to create bag for {folder_path.name}: {str(e)}") + return None + +def rebag_existing_bag(bag_path, temp_dir, config, logger): + """Re-bag an existing bag by extracting data and creating fresh bag""" + try: + logger.info(f"Re-bagging existing bag: {bag_path.name}") + + # Validate the existing bag first + try: + existing_bag = bagit.Bag(str(bag_path)) + try: + existing_bag.validate() + logger.info(f"Existing bag {bag_path.name} is valid") + except bagit.BagValidationError as e: + logger.warning(f"Existing bag {bag_path.name} validation failed: {str(e)} - proceeding to re-bag anyway...") + + except Exception as e: + logger.warning(f"Could not validate existing bag {bag_path.name}: {e}") + + # Create temp folder for the new bag + temp_folder = temp_dir / bag_path.name + temp_folder.mkdir() + + # Copy only the data directory contents (not the bag structure), excluding hidden files + data_dir = bag_path / 'data' + if data_dir.exists(): + copy_folder_excluding_hidden(data_dir, temp_folder, logger) + else: + logger.warning(f"No data directory found in bag {bag_path.name}") + return None + + # Get configuration values + checksums = [alg.strip() for alg in config.get('BAGIT_OPTIONS', 'checksums').split(',')] + processes = config.getint('BAGIT_OPTIONS', 'processes') + + # Create new bag metadata (preserve some original info if available) + bag_info = { + 'Source-Organization': config.get('METADATA', 'source_organization'), + 'Contact-Name': config.get('METADATA', 'contact_name'), + 'External-Description': f'Re-bagged from existing bag: {bag_path.name}', + 'Bagging-Date': datetime.now().strftime('%Y-%m-%d'), + 'Bag-Software-Agent': config.get('METADATA', 'bag_software_agent'), + 'Original-Bag-Name': bag_path.name + } + + # Try to preserve some original metadata + try: + original_bag_info_file = bag_path / 'bag-info.txt' + if original_bag_info_file.exists(): + with open(original_bag_info_file, 'r', encoding='utf-8') as f: + for line in f: + if ':' in line: + key, value = line.strip().split(':', 1) + key = key.strip() + value = value.strip() + if key in ['External-Description', 'Source-Organization']: + bag_info[f'Original-{key}'] = value + except Exception as e: + logger.debug(f"Could not read original bag-info.txt: {e}") + + # Create the new bag + bagit.make_bag( + str(temp_folder), + bag_info=bag_info, + checksums=checksums, + processes=processes + ) + + logger.info(f"Successfully re-bagged: {bag_path.name}") + return temp_folder + + except Exception as e: + logger.error(f"Failed to re-bag {bag_path.name}: {str(e)}") + return None + +def process_batch(folders_batch, temp_dir, destination_path, config, logger, batch_num, total_batches): + """Process a batch of folders""" + logger.info(f"\n=== Processing Batch {batch_num}/{total_batches} ({len(folders_batch)} folders) ===") + + batch_successful = 0 + batch_failed = 0 + + for folder in folders_batch: + logger.info(f"\n--- Processing folder: {folder.name} ---") + + # Create bag + bag_path = create_bag_from_folder(folder, temp_dir, config, logger) + + if bag_path: + # Transfer bag + if transfer_bag(bag_path, destination_path, logger): + batch_successful += 1 + else: + batch_failed += 1 + else: + batch_failed += 1 + + logger.info(f"--- Completed processing: {folder.name} ---") + + logger.info(f"=== Batch {batch_num} Summary: {batch_successful} successful, {batch_failed} failed ===\n") + return batch_successful, batch_failed + +def process_existing_bags_batch(bags_batch, temp_dir, destination_path, config, logger, batch_num, total_batches): + """Process a batch of existing bags""" + logger.info(f"\n=== Re-bagging Batch {batch_num}/{total_batches} ({len(bags_batch)} existing bags) ===") + + batch_successful = 0 + batch_failed = 0 + + for bag_folder in bags_batch: + logger.info(f"\n--- Re-bagging existing bag: {bag_folder.name} ---") + + # Re-bag the existing bag + bag_path = rebag_existing_bag(bag_folder, temp_dir, config, logger) + + if bag_path: + # Transfer bag + if transfer_bag(bag_path, destination_path, logger): + batch_successful += 1 + else: + batch_failed += 1 + else: + batch_failed += 1 + + logger.info(f"--- Completed re-bagging: {bag_folder.name} ---") + + logger.info(f"=== Re-bagging Batch {batch_num} Summary: {batch_successful} successful, {batch_failed} failed ===\n") + return batch_successful, batch_failed + +def transfer_bag(bag_path, destination_path, logger): + """Transfer the bagged folder to destination""" + try: + dest_bag_path = destination_path / bag_path.name + + # If destination already exists, create a unique name + counter = 1 + original_dest = dest_bag_path + while dest_bag_path.exists(): + dest_bag_path = destination_path / f"{original_dest.name}_{counter}" + counter += 1 + + logger.info(f"Transferring bag to: {dest_bag_path}") + shutil.copytree(bag_path, dest_bag_path) + + # Validate the transferred bag with detailed error reporting + try: + transferred_bag = bagit.Bag(str(dest_bag_path)) + transferred_bag.validate() # This will raise if invalid + logger.info(f"Successfully transferred and validated bag: {dest_bag_path.name}") + return True + except bagit.BagValidationError as e: + logger.error(f"Transferred bag failed validation: {dest_bag_path.name} - Details: {str(e)}") + + # Additional debugging for macOS hidden file issues + data_dir = dest_bag_path / 'data' + if data_dir.exists(): + logger.debug(f"Files found in data directory of {dest_bag_path.name}:") + try: + for item in data_dir.rglob('*'): + if item.is_file(): + logger.debug(f" - {item.relative_to(data_dir)}") + except Exception as debug_e: + logger.debug(f"Could not list data directory contents: {debug_e}") + + return False + except Exception as e: + logger.error(f"Error validating transferred bag {dest_bag_path.name}: {str(e)}") + return False + + except Exception as e: + logger.error(f"Failed to transfer bag {bag_path.name}: {str(e)}") + return False + +def main(): + """Main function to orchestrate the transfer process""" + parser = argparse.ArgumentParser(description='Automated BagIt Transfer Tool') + parser.add_argument('--source', + help='Source directory path (overrides config file)') + parser.add_argument('--destination', + help='Destination directory path (overrides config file)') + parser.add_argument('--config', default='config.ini', + help='Configuration file path (default: config.ini)') + parser.add_argument('--dry-run', action='store_true', + help='Show what would be transferred without actually doing it') + parser.add_argument('--include-folders', nargs='+', + help='Only transfer these specific folders (space-separated list)') + parser.add_argument('--exclude-folders', nargs='+', + help='Exclude these folders from transfer (space-separated list)') + parser.add_argument('--include-empty', action='store_true', + help='Include empty folders in transfer (default: skip empty folders)') + parser.add_argument('--batch-size', type=int, default=1, + help='Number of folders to process in each batch (default: 1 for space efficiency)') + + args = parser.parse_args() + + # Load configuration + config = load_config(args.config) + + # Use command line arguments if provided, otherwise use config file + source_path_str = args.source or config.get('PATHS', 'source_path') + destination_path_str = args.destination or config.get('PATHS', 'destination_path') + + if not source_path_str or not destination_path_str: + print("Error: Source and destination paths must be provided either via command line or config file") + sys.exit(1) + + # Check dependencies first + if not check_dependencies(): + sys.exit(1) + + # Set up logging + logger = setup_logging() + logger.info("Starting automated BagIt transfer process") + logger.info(f"Source: {source_path_str}") + logger.info(f"Destination: {destination_path_str}") + logger.info(f"Batch size: {args.batch_size} (using small batches for space efficiency)") + + try: + # Validate paths + source_path, destination_path = validate_paths(source_path_str, destination_path_str) + + # Get folders to transfer + skip_empty = not args.include_empty + folders_to_transfer, existing_bags = get_folders_to_transfer(source_path, args.include_folders, args.exclude_folders, skip_empty) + + # Log empty folders that were skipped + if skip_empty: + empty_folders = [] + for item in source_path.iterdir(): + if item.is_dir() and not item.name.startswith('.') and is_folder_empty(item): + empty_folders.append(item.name) + + if empty_folders: + logger.info(f"Skipped {len(empty_folders)} empty folders:") + for folder_name in empty_folders: + logger.info(f" - {folder_name} (empty)") + + # Report what was found + total_items = len(folders_to_transfer) + len(existing_bags) + if total_items == 0: + logger.info("No folders found to transfer") + return + + if folders_to_transfer: + logger.info(f"Found {len(folders_to_transfer)} regular folders to bag:") + for folder in folders_to_transfer: + logger.info(f" - {folder.name}") + + if existing_bags: + logger.info(f"Found {len(existing_bags)} existing bags to re-bag:") + for bag in existing_bags: + logger.info(f" - {bag.name} (existing bag)") + + if args.dry_run: + logger.info("Dry run mode - no actual transfers will be performed") + return + + successful_transfers = 0 + failed_transfers = 0 + + # Process regular folders in batches + if folders_to_transfer: + batch_size = args.batch_size if args.batch_size != 1 else config.getint('BAGIT_OPTIONS', 'batch_size', fallback=1) + total_batches = (len(folders_to_transfer) + batch_size - 1) // batch_size + logger.info(f"\nProcessing {len(folders_to_transfer)} regular folders in {total_batches} batches of {batch_size}") + + for i in range(0, len(folders_to_transfer), batch_size): + batch = folders_to_transfer[i:i + batch_size] + batch_num = (i // batch_size) + 1 + + # Create temporary directory for this batch + temp_dir = Path(tempfile.mkdtemp(prefix=f"bagit_batch_{batch_num}_")) + logger.info(f"Using temporary directory for batch {batch_num}: {temp_dir}") + + try: + batch_successful, batch_failed = process_batch( + batch, temp_dir, destination_path, config, logger, batch_num, total_batches + ) + successful_transfers += batch_successful + failed_transfers += batch_failed + + finally: + # Clean up temporary directory after each batch + logger.info(f"Cleaning up temporary directory for batch {batch_num}: {temp_dir}") + shutil.rmtree(temp_dir, ignore_errors=True) + + # Process existing bags in batches + if existing_bags: + batch_size = args.batch_size if args.batch_size != 1 else config.getint('BAGIT_OPTIONS', 'batch_size', fallback=1) + total_batches = (len(existing_bags) + batch_size - 1) // batch_size + logger.info(f"\nRe-bagging {len(existing_bags)} existing bags in {total_batches} batches of {batch_size}") + + for i in range(0, len(existing_bags), batch_size): + batch = existing_bags[i:i + batch_size] + batch_num = (i // batch_size) + 1 + + # Create temporary directory for this batch + temp_dir = Path(tempfile.mkdtemp(prefix=f"bagit_rebag_batch_{batch_num}_")) + logger.info(f"Using temporary directory for re-bagging batch {batch_num}: {temp_dir}") + + try: + batch_successful, batch_failed = process_existing_bags_batch( + batch, temp_dir, destination_path, config, logger, batch_num, total_batches + ) + successful_transfers += batch_successful + failed_transfers += batch_failed + + finally: + # Clean up temporary directory after each batch + logger.info(f"Cleaning up temporary directory for re-bagging batch {batch_num}: {temp_dir}") + shutil.rmtree(temp_dir, ignore_errors=True) + + # Summary + total_processed = len(folders_to_transfer) + len(existing_bags) + logger.info("=== Transfer Summary ===") + logger.info(f"Regular folders processed: {len(folders_to_transfer)}") + logger.info(f"Existing bags re-bagged: {len(existing_bags)}") + logger.info(f"Total items processed: {total_processed}") + logger.info(f"Successful transfers: {successful_transfers}") + logger.info(f"Failed transfers: {failed_transfers}") + if total_processed > 0: + logger.info(f"Success rate: {(successful_transfers/total_processed*100):.1f}%") + + except Exception as e: + logger.error(f"Fatal error: {str(e)}") + sys.exit(1) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/auto_bagit_transfer/config.ini b/auto_bagit_transfer/config.ini new file mode 100644 index 0000000..a6559c6 --- /dev/null +++ b/auto_bagit_transfer/config.ini @@ -0,0 +1,24 @@ +[PATHS] +# Source directory containing folders to transfer +source_path = "Give source folder path (example: C:\\Users\\Dell\\Downloads\\test_data)" + +# Destination directory where bagged folders will be transferred +destination_path = "Give destination folder path (example: C:\\Users\\Dell\\Downloads\\destination)" + +[BAGIT_OPTIONS] +# Checksum algorithms to use (comma-separated) +# Available options: md5, sha1, sha256, sha512 +checksums = sha256,sha512 + +# Number of processes to use for checksumming (set to 1 for single-threaded) +processes = 4 + +# Number of folders to process in each batch (helps prevent temp space exhaustion) +# Smaller batches use less temp space but may be slower overall +batch_size = 1 + +[METADATA] +# Default bag metadata +source_organization = "Auto BagIt Transfer Tool" +contact_name = "Automated Process" +bag_software_agent = "auto_bagit_transfer.py" \ No newline at end of file