Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.
- Single Code Analysis: Analyze individual code snippets with detailed breakdown
- GitHub Repository Scanning: Complete repository analysis with file-by-file insights
- Batch File Processing: Upload and analyze multiple files simultaneously
- Ensemble ML Models: 4 classical models (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
- Smart Voting System: Advanced consensus mechanism with confidence weighting
- Contradiction Detection: Automatically corrects predictions when line-level analysis conflicts with file-level results
- Multi-Language Support: Python, Java, and JavaScript code detection
- Line-by-Line Breakdown: Detailed analysis of individual code lines with pattern detection
- Confidence Scoring: Precision confidence metrics for all predictions
- Model Agreement Tracking: Shows which models agree/disagree and why
- Pattern Recognition: Detects coding patterns like functions, loops, imports, etc.
- Consistency Validation: Cross-validates file-level vs line-level predictions
- Repository Scanning: Analyzes entire GitHub repositories automatically
- Progress Tracking: Real-time analysis progress with status updates
- Comprehensive Reports: Downloadable analysis reports with detailed insights
- API Integration: Direct GitHub API integration for seamless repository access
Code_Detector/
├── 📱 Web Application
│ └── app.py # Main Streamlit application with 3 analysis modes
├── 🤖 Machine Learning Pipeline
│ ├── ml_train.py # Classical ML model training (4 algorithms)
│ └── dl_train.py # Deep learning model training (Transformers)
├── 📊 Data & Models
│ ├── Dataset/ # Training data organized by language
│ │ ├── Python/ # Python samples (AI vs HUMAN)
│ │ ├── Java/ # Java samples (AI vs HUMAN)
│ │ └── JS/ # JavaScript samples (AI vs HUMAN)
│ ├── model/ # Trained classical ML models
│ │ ├── logisticregression.pkl
│ │ ├── randomforest.pkl
│ │ ├── gradientboosting.pkl
│ │ ├── xgboost.pkl
│ │ ├── vectorizer.pkl
│ │ └── labelencoder.pkl
│ └── output/ # Trained transformer models
│ ├── CodeBERT/ # Microsoft CodeBERT model
│ ├── CodeT5/ # Salesforce CodeT5 model
│ └── GraphCodeBERT/ # Microsoft GraphCodeBERT model
├── 📋 Documentation
│ ├── README.md # This file
│ └── requirements.txt # Python dependencies
└── 🗂️ Cache & Temp Files
└── __pycache__/ # Python bytecode cache
- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended for transformer models)
- Internet connection (for GitHub repository analysis)
# Clone the repository
git clone https://github.com/muhammadnavas/Code_Detector.git
cd Code_Detector
# Install required dependencies
pip install -r requirements.txt# Start the Streamlit web interface
streamlit run app.py🌐 Access the app at: http://localhost:8501
If you want to retrain models with custom data:
# Train classical ML models (faster, CPU-friendly)
python ml_train.py
# Train transformer models (requires GPU for optimal performance)
python dl_train.pyPerfect for analyzing individual code snippets:
-
Input Methods:
- Paste code directly into the text area
- Upload single Python/Java/JavaScript files
-
Analysis Output:
- 🎯 Overall Prediction: AI vs Human with confidence score
- 🔧 Model Breakdown: Individual model predictions and confidence
- 📋 Line-by-Line Analysis: Detailed analysis of each code line
- 🏷️ Pattern Detection: Identified coding patterns and structures
Comprehensive analysis of entire GitHub repositories:
-
Repository Input:
https://github.com/username/repository -
Analysis Process:
- 🔍 Auto-Discovery: Finds all Python files in the repository
- ⚡ Progress Tracking: Real-time analysis with progress indicators
- 📊 Summary Statistics: Repository-wide AI vs Human breakdown
- 📁 File-by-File Results: Detailed analysis for each file
-
Advanced Features:
- 🎯 Smart Corrections: Automatically corrects contradictory predictions
⚠️ Warning System: Flags suspicious patterns or inconsistencies- 📄 Report Generation: Download comprehensive analysis reports
Upload and analyze multiple files simultaneously:
- Multi-File Upload: Support for
.py,.java,.jsfiles - Batch Processing: Analyze all files with progress tracking
- Consolidated Results: Summary statistics across all uploaded files
Our intelligent ensemble combines multiple approaches for maximum accuracy:
-
🔗 Logistic Regression
- Linear classification with TF-IDF features
- Fast prediction, good baseline performance
- Confidence: Probability scores from sigmoid function
-
🌲 Random Forest
- Ensemble of decision trees with voting
- Handles feature interactions well
- Confidence: Vote proportion from trees
-
📈 Gradient Boosting
- Sequential ensemble with error correction
- Strong performance on structured data
- Confidence: Probability from gradient boost
-
⚡ XGBoost
- Optimized gradient boosting framework
- State-of-the-art classical ML performance
- Confidence: Native probability estimation
- Majority Voting: 3+ models must agree for high confidence
- Confidence Weighting: Uses model-specific confidence scores
- Contradiction Detection: Compares file-level vs line-level predictions
- Smart Corrections: Automatically adjusts predictions when inconsistencies detected
- Smart Filtering: Skips comments, imports, and trivial lines
- Pattern Detection: Identifies functions, loops, conditionals, etc.
- Confidence Thresholding: Only includes high-confidence line predictions (>60%)
- Context Preservation: Maintains code structure understanding
Our system automatically detects and corrects contradictory predictions:
# Example: File predicted as AI, but 73% of lines are Human
Original Prediction: AI (confidence: 0.86)
Line Analysis: 73% Human lines
Smart Correction: → HUMAN (adjusted confidence: 0.72)
Status: [PREDICTION CORRECTED: AI → HUMAN]Detects various coding patterns:
- Structural: Functions, classes, imports
- Control Flow: Loops, conditionals, exception handling
- Modern Python: F-strings, list comprehensions, lambda functions
- Style Indicators: Docstrings, comments, naming conventions
- 🔵 High Confidence (>0.8): Very reliable prediction
- 🟡 Medium Confidence (0.6-0.8): Generally reliable with some uncertainty
- 🔴 Low Confidence (<0.6): Results may be unreliable, manual review recommended
- ✅ Unanimous: All models agree (highest confidence)
- 📊 Majority: 3/4 models agree (good confidence)
⚠️ Split Decision: 2/2 split (requires careful interpretation)
- ✅ Consistent: File and line predictions align
- 📊 Mixed Signals: Some disagreement between levels
- 🔄 Auto-Corrected: System detected and fixed contradiction
- ❌ Major Contradiction: Significant disagreement requiring manual review
# Core Framework
streamlit>=1.28.0 # Web application framework
# Machine Learning
scikit-learn>=1.3.0 # Classical ML algorithms
xgboost>=1.7.0 # Gradient boosting framework
numpy>=1.24.0 # Numerical computing
pandas>=2.0.0 # Data manipulation
# Deep Learning (Optional)
torch>=2.0.0 # PyTorch framework
transformers>=4.30.0 # Hugging Face transformers
# Web & API
requests>=2.31.0 # HTTP requests for GitHub API
joblib>=1.3.0 # Model serialization
# Utilities
pathlib # Path handling (built-in)
re # Regular expressions (built-in)
typing # Type hints (built-in)- Code Cleaning: Remove excess whitespace, normalize line endings
- TF-IDF Vectorization: Character-level n-grams (3-5) for classical models
- Feature Extraction: Syntactic patterns, complexity metrics
- Tokenization: Language-specific tokenization for transformers
- Syntactic Patterns: Language constructs (functions, classes, loops)
- Stylistic Features: Naming conventions, spacing patterns
- Complexity Metrics: Code depth, nesting levels, line lengths
- AI Indicators: Patterns typical in AI-generated code
# Smart model loading with caching
@st.cache_resource
def load_models():
models = {
'logistic': joblib.load('model/logisticregression.pkl'),
'rf': joblib.load('model/randomforest.pkl'),
'gb': joblib.load('model/gradientboosting.pkl'),
'xgb': joblib.load('model/xgboost.pkl')
}
vectorizer = joblib.load('model/vectorizer.pkl')
return models, vectorizer- Rate Limiting: Respects GitHub API limits
- Error Handling: Robust error handling for network issues
- Recursive Scanning: Deep repository traversal for Python files
- Content Processing: Handles various file encodings
- Streamlit Caching: Models loaded once and cached
- Batch Processing: Efficient handling of multiple files
- Memory Management: Optimized for large repositories
- Progress Tracking: Real-time user feedback
# Example: Analyzing code with the system
from app import CodeAnalyzer
# Initialize analyzer
analyzer = CodeAnalyzer()
# Analyze code snippet
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
results, prediction, confidence = analyzer.analyze_code(code)
print(f"Prediction: {prediction}")
print(f"Confidence: {confidence:.3f}")
# Get individual model results
for result in results:
print(f"{result.name}: {result.prediction} ({result.confidence:.3f})")# Example: Analyzing multiple files
files = ['file1.py', 'file2.py', 'file3.py']
results = []
for file_path in files:
with open(file_path, 'r') as f:
code = f.read()
file_result = analyzer.analyze_file(file_path, code)
results.append(file_result)
# Generate summary
summary = SummarizationEngine.summarize_file_analysis(results)
print(f"AI Files: {summary['ai_files']}/{summary['total_files']}")You can customize which models to use:
# In app.py, modify the models dictionary
model_config = {
'logistic': True, # Enable/disable Logistic Regression
'random_forest': True, # Enable/disable Random Forest
'gradient_boost': True,# Enable/disable Gradient Boosting
'xgboost': True # Enable/disable XGBoost
}Modify the Streamlit interface:
# Custom page configuration
st.set_page_config(
page_title="Custom AI Detector",
page_icon="🤖",
layout="wide",
initial_sidebar_state="expanded"
)
# Custom styling
st.markdown("""
<style>
.main-header { color: #1e88e5; }
.prediction-ai { background-color: #ffebee; }
.prediction-human { background-color: #e8f5e8; }
</style>
""", unsafe_allow_html=True)# Adjust these parameters in app.py
MAX_FILES_ANALYZE = 100 # Limit files to analyze
LINE_CONFIDENCE_THRESHOLD = 0.7 # Higher threshold for line analysis
ENABLE_LINE_ANALYSIS = False # Disable for faster processing# Process files in batches
BATCH_SIZE = 10
for i in range(0, len(files), BATCH_SIZE):
batch = files[i:i+BATCH_SIZE]
process_batch(batch)Organize your training data in this structure:
Dataset/
├── Python/
│ ├── AI/ # AI-generated Python code samples
│ │ ├── A1.py, A2.py, ...
│ └── HUMAN/ # Human-written Python code samples
│ ├── H1.py, H2.py, ...
├── Java/
│ ├── AI/ # AI-generated Java code samples
│ └── HUMAN/ # Human-written Java code samples
└── JS/
├── AI/ # AI-generated JavaScript code samples
└── HUMAN/ # Human-written JavaScript code samples
# Train all classical models with cross-validation
python ml_train.pyTraining Process:
- Data Loading: Loads code samples from Dataset/ directories
- Preprocessing: TF-IDF vectorization with character n-grams
- Class Balancing: Handles imbalanced datasets with class weights
- Model Training: Trains 4 different algorithms with hyperparameter tuning
- Validation: Stratified cross-validation for robust evaluation
- Model Saving: Saves trained models to
model/directory
Expected Output:
Loading dataset...
Found 1000 Python samples (500 AI, 500 Human)
Training Logistic Regression... Accuracy: 0.85
Training Random Forest... Accuracy: 0.88
Training Gradient Boosting... Accuracy: 0.87
Training XGBoost... Accuracy: 0.89
Models saved to model/ directory
# Train transformer models (requires GPU for optimal speed)
python dl_train.pySupported Models:
- CodeBERT: Microsoft's code understanding model
- CodeT5: Salesforce's code generation model
- GraphCodeBERT: Enhanced with data flow understanding
Training Features:
- Custom Trainer: Weighted loss for class imbalance
- Early Stopping: Prevents overfitting
- Learning Rate Scheduling: Optimizes training convergence
- Evaluation Metrics: F1-macro score for balanced evaluation
- Disable Line Analysis: For quick file-level predictions only
- Use Fewer Models: Enable only fast models (Logistic, Random Forest)
- Batch Processing: Analyze multiple files together
- GPU Acceleration: Use CUDA for transformer models
- Enable All Models: Use full ensemble for best results
- Line Analysis: Enable for detailed insights
- Large Training Data: More diverse training samples improve accuracy
- Regular Retraining: Update models with new AI-generated code patterns
- Different Feature Focus: Each model looks at different code aspects
- Training Data Variance: Models trained on slightly different samples
- Algorithm Differences: Linear vs tree-based vs ensemble approaches
- Overfitting: Some models may overfit to specific patterns
- High Agreement: All 4 models agree → High confidence
- High Confidence: Individual confidence scores > 0.8
- Line Consistency: File prediction matches line analysis
- Pattern Recognition: Clear AI/Human coding patterns detected
We welcome contributions to improve the AI detection system!
-
New Programming Languages
- Add support for C++, Go, Rust, etc.
- Language-specific pattern detection
- Training data collection
-
Model Improvements
- Advanced ensemble techniques
- New feature engineering approaches
- Deep learning architecture improvements
-
User Interface Enhancements
- Better visualization components
- Real-time analysis features
- API endpoint development
-
Dataset Expansion
- More diverse AI-generated code samples
- Different AI model outputs (GPT, Claude, etc.)
- Domain-specific code samples
# 1. Fork and clone the repository
git clone https://github.com/your-username/Code_Detector.git
cd Code_Detector
# 2. Create development environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8 # Additional dev tools
# 4. Run tests
pytest tests/
# 5. Format code
black .
flake8 .- Create Issue: Describe the feature/bug
- Fork Repository: Create your own copy
- Create Branch:
git checkout -b feature/your-feature - Make Changes: Implement your improvements
- Add Tests: Ensure functionality works
- Submit PR: Create pull request with description
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Multi-language expansion (C++, Go, Rust)
- Real-time API endpoints for integration
- Advanced visualizations for pattern analysis
- Cloud deployment options
- Mobile app for on-the-go analysis
- Plugin development for popular IDEs
If you find this project useful, please ⭐ star it on GitHub to help others discover it!
Built with ❤️ for the developer community
Empowering developers with intelligent AI detection capabilities