Intelligent Excel Transformation System

An AI-powered system that automatically transforms messy, unstructured Excel files into clean, database-ready formats using LLM-based analysis and dynamic code generation.

Overview

This system uses Large Language Models (LLMs) to:

Analyze any messy Excel format (no hardcoded logic)
Understand business semantics and data structure
Generate custom transformation Python code
Validate output against ground truth samples
Self-optimize through iterative improvement
Learn from successful patterns for future reuse

Key Features

Zero-Shot Excel Processing: Works with unknown Excel formats without manual coding
LLM-Powered Analysis: Intelligently detects patterns, headers, blocks, and structure
Dynamic Code Generation: Generates custom transformation scripts for each file
Self-Validating: Compares output with ground truth (≥99% accuracy required)
Iterative Optimization: Automatically debugs and improves code if validation fails
Pattern Learning: Builds a library of successful transformations for reuse
Complete Audit Trail: Saves analysis reports, generated code, and validation results

System Architecture

Input (Messy Excel + Ground Truth Sample)
    ↓
Phase 1: LLM-Powered Analysis
    ├─ Structure Discovery (layout, headers, blocks, patterns)
    ├─ Semantic Understanding (entities, metrics, dimensions)
    └─ Transformation Planning (compare source vs target)
    ↓
Phase 2: Dynamic Code Generation
    └─ LLM generates custom Python transformation script
    ↓
Phase 3: Execution & Validation
    ├─ Run generated code in sandbox
    └─ Validate against ground truth (schema + values)
    ↓
Phase 4: Iterative Optimization (if accuracy < 99%)
    ├─ Error analysis
    ├─ LLM-powered debugging
    └─ Re-generate improved code
    ↓
Phase 5: Output & Learning
    ├─ Save cleaned data (CSV, Excel, SQLite, JSON)
    ├─ Save artifacts (code, analysis, reports)
    └─ Update pattern library

Directory Structure

intelligent-excel-cleaner/
│
├── main.py                          # CLI orchestrator
├── config.yaml                      # System configuration
├── requirements.txt                 # Dependencies
│
├── core/
│   ├── orchestrator.py              # Main pipeline coordinator
│   └── llm_client.py                # LLM API client (OpenAI/Anthropic)
│
├── analyzer/
│   ├── structure_analyzer.py        # Detect Excel patterns (LLM-assisted)
│   ├── semantic_analyzer.py         # Understand business meaning (LLM)
│   ├── ground_truth_comparator.py   # Compare messy vs ground truth
│   ├── prompts/                     # LLM prompt templates
│   │   ├── structure_analysis.txt
│   │   ├── semantic_analysis.txt
│   │   └── transformation_planning.txt
│   └── analysis_report.py           # Generate analysis JSON
│
├── generator/
│   ├── code_generator.py            # Generate transformation code (LLM)
│   ├── prompt_builder.py            # Build code generation prompts
│   ├── code_templates/              # Reusable code snippets
│   │   ├── base_template.py
│   │   ├── unpivot_template.py
│   │   └── block_extraction_template.py
│   └── generated/                   # Generated transform scripts
│
├── executor/
│   ├── sandbox.py                   # Safe code execution environment
│   ├── runner.py                    # Execute generated transformation
│   └── error_handler.py             # Capture and parse errors
│
├── validator/
│   ├── schema_validator.py          # Compare schemas
│   ├── data_validator.py            # Compare data values
│   ├── metrics.py                   # Calculate accuracy metrics
│   └── mismatch_reporter.py         # Report differences
│
├── optimizer/
│   ├── error_analyzer.py            # Analyze validation failures
│   ├── code_optimizer.py            # LLM-powered code improvement
│   ├── prompts/
│   │   └── debug_and_fix.txt        # Debugging prompt template
│   └── iteration_manager.py         # Manage optimization loops
│
├── output/
│   ├── exporter.py                  # Export cleaned data
│   ├── artifact_saver.py            # Save analysis, code, reports
│   └── metadata_generator.py        # Generate lineage metadata
│
├── learning/
│   ├── pattern_library.py           # Store successful patterns
│   ├── similarity_matcher.py        # Match new files to patterns
│   └── patterns/                    # Saved patterns (JSON)
│
├── utils/
│   ├── logger.py                    # Logging configuration
│   ├── excel_reader.py              # Excel utilities
│   ├── helpers.py                   # Common utilities
│   └── config_loader.py             # Load config.yaml
│
├── tests/
│   ├── test_analyzer.py
│   ├── test_generator.py
│   ├── test_validator.py
│   └── fixtures/
│
├── data/                            # Input files
│   ├── messy/
│   └── ground_truth/
│
└── results/                         # Output directory
    ├── cleaned_data/
    ├── artifacts/
    └── reports/

Installation

Prerequisites

Python 3.8+
API key for LLM provider (Anthropic Claude or OpenAI)

Setup

# Clone repository
cd ExcelSheetAnalysis

# Install dependencies
pip install -r requirements.txt

# Set up API key (choose one)
export ANTHROPIC_API_KEY="your-api-key"
# OR
export OPENAI_API_KEY="your-api-key"

# Configure system (optional - edit config.yaml)
# Set LLM provider, model, thresholds, etc.

Usage

Basic Transformation

python main.py \
    --messy data/messy/messy_sales_report_only.xlsx \
    --ground-truth data/ground_truth/sales_data_ground_truth_only.xlsx \
    --output results/cleaned_data/

With Optimization

python main.py \
    --messy data/messy/messy_sales_report_only.xlsx \
    --ground-truth data/ground_truth/sales_data_ground_truth_only.xlsx \
    --output results/cleaned_data/ \
    --max-iterations 5 \
    --verbose

Reuse Pattern Library

python main.py \
    --messy data/messy/new_file.xlsx \
    --ground-truth data/ground_truth/new_target.xlsx \
    --output results/cleaned_data/ \
    --check-library

Command Line Options

--messy PATH              Path to messy Excel file (required)
--ground-truth PATH       Path to ground truth sample (required)
--output PATH             Output directory for cleaned data
--max-iterations N        Max optimization iterations (default: 5)
--accuracy-threshold N    Required accuracy 0-1 (default: 0.99)
--check-library           Check pattern library for similar transformations
--format FORMAT           Output format: csv, excel, sqlite, json (default: csv,excel)
--verbose                 Enable verbose logging
--save-artifacts          Save analysis and generated code (default: true)

How It Works

1. Structure Analysis

The system uses an LLM to analyze the messy Excel file and identify:

Layout patterns (wide format, blocks, merged cells, etc.)
Header structure and hierarchy
Data row boundaries
Repeating patterns
Noise elements (totals, footers, empty rows)

2. Semantic Understanding

LLM extracts business meaning:

Entities (Region, Product, Customer, etc.)
Metrics (Revenue, Units Sold, etc.)
Dimensions (Time, Geography, Category)
Relationships and grain

3. Transformation Planning

By comparing the messy file with ground truth:

Determines required transformations (pivot, unpivot, merge, etc.)
Maps source fields to target schema
Plans calculation and filtering logic

4. Code Generation

LLM generates a complete Python script with:

Data extraction functions
Transformation logic
Validation checks
Error handling

5. Execution & Validation

Runs generated code in a safe sandbox
Validates output against ground truth
Checks schema match and value accuracy
Generates detailed comparison report

6. Optimization Loop

If validation fails (accuracy < 99%):

Analyzes error patterns
LLM debugs the code
Generates improved version
Re-executes and validates
Repeats up to max iterations

7. Learning

Successful transformations are saved as patterns:

Future similar files can reuse the pattern
Skips analysis and generation phases
Faster processing for known formats

Output

Cleaned Data

results/cleaned_data/
├── cleaned_20241214_1945.csv
├── cleaned_20241214_1945.xlsx
└── cleaned_20241214_1945.db

Artifacts

results/artifacts/
├── analysis_20241214_1945.json          # Structure analysis
├── transform_20241214_1945.py           # Generated code
├── validation_20241214_1945.json        # Validation report
└── execution_20241214_1945.log          # Execution logs

Reports

results/reports/
└── summary_20241214_1945.html           # Human-readable summary

Configuration

Edit config.yaml to customize:

llm:
  provider: "anthropic"              # or "openai"
  model: "claude-sonnet-4"          # or "gpt-4"
  api_key_env: "ANTHROPIC_API_KEY"
  temperature: 0.1

validation:
  accuracy_threshold: 0.99           # 99% match required

optimization:
  max_iterations: 5
  enable_learning: true

output:
  formats: ["csv", "excel", "sqlite"]
  save_artifacts: true

Examples

Example 1: Quarterly Sales Report

Input: Wide-format quarterly blocks with merged headers Output: Normalized long format (3,200 rows) Accuracy: 100% match with ground truth

Example 2: Inventory Report (Future)

Input: Pivot table with nested categories Output: Flat table ready for database Processing: Reuses similar pattern from library

Success Criteria

✅ Works with ANY Excel format (not hardcoded)
✅ Achieves ≥99% accuracy vs ground truth
✅ Minimal human intervention required
✅ Complete audit trail maintained
✅ Learns and improves over time

Testing

# Run unit tests
pytest tests/

# Run integration tests
pytest tests/test_end_to_end.py

# Test with sample data
python main.py --messy tests/fixtures/sample_messy.xlsx \
               --ground-truth tests/fixtures/sample_truth.xlsx

Troubleshooting

Low Validation Accuracy

Check ground truth sample quality (ensure it's representative)
Increase max iterations
Review validation report for patterns
Manually inspect generated code

LLM API Errors

Verify API key is set correctly
Check API rate limits
Reduce sample_rows in config if hitting token limits

Missing Dependencies

pip install -r requirements.txt --upgrade

Requirements

Python 3.8+
pandas >= 2.0.0
openpyxl >= 3.1.0
anthropic >= 0.8.0 (if using Claude)
openai >= 1.0.0 (if using OpenAI)
pyyaml >= 6.0
See requirements.txt for full list

License

MIT

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

Roadmap

Support for multiple ground truth samples
Web UI for monitoring transformations
Support for additional LLM providers
Batch processing mode
Real-time transformation suggestions
Excel template generation from target schema

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analyzer		analyzer
core		core
data		data
executor		executor
generator		generator
learning		learning
optimizer		optimizer
output		output
utils		utils
validator		validator
.gitignore		.gitignore
README.md		README.md
TEST_RESULTS_AND_LEARNINGS.md		TEST_RESULTS_AND_LEARNINGS.md
VERIFICATION_REPORT.md		VERIFICATION_REPORT.md
comprehensive_test_results.txt		comprehensive_test_results.txt
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt

EmergenceAI/excel-sheet-analysis

Folders and files

Latest commit

History

Repository files navigation