The goal of this project is to develop a machine learning model that can predict IAB categories for articles based on their text content. We use a supervised learning approach where a training dataset with known text and categories is used to train the model, and a test dataset with unknown categories is used to evaluate the model's performance.
-
Data Preprocessing:
- We load large CSV datasets (up to 1.24 GB) in manageable chunks to prevent memory overflow.
- The text data is vectorized using
HashingVectorizer, which converts the raw text into a numerical format suitable for machine learning.
-
Feature Engineering:
- HashingVectorizer: This vectorizer is a fast and memory-efficient alternative to
TfidfVectorizer. It hashes text into a fixed number of features (set to 5000 in our case). This allows us to handle large datasets and high-dimensional text data without requiring all terms to be stored in memory. - Stop Words Removal: We removed common English stop words to improve the model’s performance by eliminating frequently occurring but unimportant words.
- HashingVectorizer: This vectorizer is a fast and memory-efficient alternative to
-
Model Selection:
- Support Vector Machine (LinearSVC): We initially chose Linear Support Vector Classification (SVM) because of its strong performance in text classification tasks. However, SVM is computationally expensive, particularly for large datasets.
- Naive Bayes (MultinomialNB): To improve speed, we also implemented Naive Bayes, which is faster for large datasets and often performs well for text classification.
-
Parallel Processing:
- The model training process is parallelized using
jobliband theparallel_backendfunction, enabling multi-core training to take full advantage of the available hardware (12th Gen Intel Core i5-12300F, 6 cores, 12 logical processors).
- The model training process is parallelized using
-
Model Evaluation:
- Predictions are made on a test dataset using the trained model.
- The predicted categories are saved in a CSV file along with the corresponding
Indexfrom the test data for easy evaluation.
-
Text Vectorization: We used the
HashingVectorizerfromscikit-learnto transform text data into a high-dimensional sparse matrix. This matrix represents the frequency of hashed terms in the text and is efficient for handling large datasets. The vectorizer transforms the text into 5000 features (hashed values), which is a reasonable dimensionality for a large text dataset.- Pros of HashingVectorizer:
- Fast and memory-efficient.
- No need to store a vocabulary, making it ideal for large datasets.
- Handles unseen words gracefully by hashing them into the same space.
- Pros of HashingVectorizer:
-
Stop Words Removal: Stop words (like "the", "is", "in") are common words that do not provide meaningful information in most text classification tasks. By removing these stop words, we reduced the dimensionality of the input text and improved model performance.
- text-classification-model.py: The Python script that loads the training and test data, vectorizes the text, trains the model, makes predictions, and saves the results.
- train.csv: The large training dataset containing ~697,528 rows of text data and target IAB categories.
- test.csv: The test dataset used for making predictions, containing text and a unique index for each entry.
- predicted_test_data.csv: The output CSV file containing the predicted categories for each article along with the corresponding index.
-
Install the necessary libraries:
-
Run the model with the following command:
- Replace
--model svmwith--model nbto use Naive Bayes instead of SVM.
- The results will be saved in
predicted_test_data.csvwith two columns:targetandIndex.
- Data Augmentation: To further improve accuracy, data augmentation techniques such as synonym replacement, paraphrasing, or adding noisy examples could be explored.
- Hyperparameter Tuning: Grid search or random search could be used to tune hyperparameters and improve model performance.
- Ensemble Models: Combining multiple models (e.g., SVM + Naive Bayes) using an ensemble approach might yield better predictions.
- Cloud-Based Solution: For even faster processing of large datasets, a cloud-based service with GPU acceleration could be used.