Text Classification Model for Predicting IAB Categories

1. Approach Overview

The goal of this project is to develop a machine learning model that can predict IAB categories for articles based on their text content. We use a supervised learning approach where a training dataset with known text and categories is used to train the model, and a test dataset with unknown categories is used to evaluate the model's performance.

Key Steps:

Data Preprocessing:
- We load large CSV datasets (up to 1.24 GB) in manageable chunks to prevent memory overflow.
- The text data is vectorized using HashingVectorizer, which converts the raw text into a numerical format suitable for machine learning.
Feature Engineering:
- HashingVectorizer: This vectorizer is a fast and memory-efficient alternative to TfidfVectorizer. It hashes text into a fixed number of features (set to 5000 in our case). This allows us to handle large datasets and high-dimensional text data without requiring all terms to be stored in memory.
- Stop Words Removal: We removed common English stop words to improve the model’s performance by eliminating frequently occurring but unimportant words.
Model Selection:
- Support Vector Machine (LinearSVC): We initially chose Linear Support Vector Classification (SVM) because of its strong performance in text classification tasks. However, SVM is computationally expensive, particularly for large datasets.
- Naive Bayes (MultinomialNB): To improve speed, we also implemented Naive Bayes, which is faster for large datasets and often performs well for text classification.
Parallel Processing:
- The model training process is parallelized using joblib and the parallel_backend function, enabling multi-core training to take full advantage of the available hardware (12th Gen Intel Core i5-12300F, 6 cores, 12 logical processors).
Model Evaluation:
- Predictions are made on a test dataset using the trained model.
- The predicted categories are saved in a CSV file along with the corresponding Index from the test data for easy evaluation.

2. Feature Engineering Details

Text Vectorization: We used the HashingVectorizer from scikit-learn to transform text data into a high-dimensional sparse matrix. This matrix represents the frequency of hashed terms in the text and is efficient for handling large datasets. The vectorizer transforms the text into 5000 features (hashed values), which is a reasonable dimensionality for a large text dataset.
- Pros of HashingVectorizer:
  - Fast and memory-efficient.
  - No need to store a vocabulary, making it ideal for large datasets.
  - Handles unseen words gracefully by hashing them into the same space.
Stop Words Removal: Stop words (like "the", "is", "in") are common words that do not provide meaningful information in most text classification tasks. By removing these stop words, we reduced the dimensionality of the input text and improved model performance.

4. Source Files

text-classification-model.py: The Python script that loads the training and test data, vectorizes the text, trains the model, makes predictions, and saves the results.
train.csv: The large training dataset containing ~697,528 rows of text data and target IAB categories.
test.csv: The test dataset used for making predictions, containing text and a unique index for each entry.
predicted_test_data.csv: The output CSV file containing the predicted categories for each article along with the corresponding index.

5. Model Execution

To Run the Model:

Install the necessary libraries:
Run the model with the following command:

Replace --model svm with --model nb to use Naive Bayes instead of SVM.

The results will be saved in predicted_test_data.csv with two columns: target and Index.

6. Potential Improvements

Data Augmentation: To further improve accuracy, data augmentation techniques such as synonym replacement, paraphrasing, or adding noisy examples could be explored.
Hyperparameter Tuning: Grid search or random search could be used to tune hyperparameters and improve model performance.
Ensemble Models: Combining multiple models (e.g., SVM + Naive Bayes) using an ensemble approach might yield better predictions.
Cloud-Based Solution: For even faster processing of large datasets, a cloud-based service with GPU acceleration could be used.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
predicted_test_data.csv		predicted_test_data.csv
sample_submission.csv		sample_submission.csv
text-classification-model.py		text-classification-model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Classification Model for Predicting IAB Categories

1. Approach Overview

Key Steps:

2. Feature Engineering Details

4. Source Files

5. Model Execution

To Run the Model:

6. Potential Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

89Aman/text-classification-model

Folders and files

Latest commit

History

Repository files navigation

Text Classification Model for Predicting IAB Categories

1. Approach Overview

Key Steps:

2. Feature Engineering Details

4. Source Files

5. Model Execution

To Run the Model:

6. Potential Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages