diff --git a/UseCases/Banking_Customer_Churn_IVSM/IVSM_Banking_Customer_Churn.ipynb b/UseCases/Banking_Customer_Churn_IVSM/IVSM_Banking_Customer_Churn.ipynb deleted file mode 100644 index c226190e..00000000 --- a/UseCases/Banking_Customer_Churn_IVSM/IVSM_Banking_Customer_Churn.ipynb +++ /dev/null @@ -1,1076 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "967a4811-b85a-4a65-b2c0-2f40876b6fff", - "metadata": {}, - "source": [ - "
\n", - "

\n", - " IVSM Banking Customer Churn\n", - "
\n", - " \"Teradata\"\n", - "

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "59ebf5e7-4d95-4077-8dcd-15d4ee322d37", - "metadata": {}, - "source": [ - "

Introduction

\n", - "\n", - "
\n", - "\n", - "\n", - "

Customer churn is a critical metric in banking because it can directly impact a bank's revenue and profitability. When customers leave, banks lose the income they would have earned from those customers' transactions, investments, and account fees. Additionally, attracting new customers to replace those who have left can be expensive and time-consuming, so reducing customer churn is often more cost-effective than acquiring new customers.

\n", - "\n", - "

Customer churn can also be an indicator of customer satisfaction and loyalty. If customers leave at a high rate, they may be dissatisfied with the bank's products or services, customer service, or overall experience.

\n", - "\n", - "

Banks can use various strategies to reduce customer churns, such as improving customer service, offering more competitive rates and fees, providing personalized recommendations and offers, and enhancing digital channels and mobile apps. By tracking and analyzing customer churn rates, banks can identify areas for improvement and make strategic decisions to retain customers and improve overall customer satisfaction.

\n", - "\n", - "

In this demo, we demonstrate how to implement the entire lifecycle of churn prediction can using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.

" - ] - }, - { - "cell_type": "markdown", - "id": "01f76290-5b46-4977-aa87-d20f26577ceb", - "metadata": {}, - "source": [ - "
\n", - "

Import the required libraries

\n", - "\n", - "

Here, we import the required libraries, set environment variables and environment paths (if required).

" - ] - }, - { - "cell_type": "markdown", - "id": "83b4c041-8bc1-43cf-93d6-4162c245ea14", - "metadata": {}, - "source": [ - "
\n", - "

Note: Please execute notebooks Step1 through Step3 before executing this use case.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "61bdabd9-d9af-4008-81fa-44bf639ca35d", - "metadata": {}, - "outputs": [], - "source": [ - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "import os\n", - "import pandas as pd\n", - "\n", - "import teradataml as tdml\n", - "import getpass\n", - "from teradataml import in_schema\n", - "from teradataml import DecisionForest, XGBoost, TrainTestSplit, DecisionForestPredict, XGBoostPredict, SentimentExtractor, ColumnTransformer, ScaleFit, OneHotEncodingFit\n", - "from teradataml import ColumnSummary, AutoML, AutoClassifier\n", - "from teradataml import RoundColumns, ClassificationEvaluator, ROC\n", - "from teradataml import (\n", - " DataFrame\n", - ")\n", - "from teradataml import KMeans\n", - "from teradataml import create_context\n", - "from teradataml import SVM, SVMPredict\n", - "from teradataml import GridSearch, RandomSearch\n", - "from teradatasqlalchemy import BYTEINT" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6de29e63-cb44-4864-8a15-653c0838a64d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.configure.val_install_location = \"val\"" - ] - }, - { - "cell_type": "markdown", - "id": "8419ac9b-e15e-4780-a60f-dba3da785232", - "metadata": {}, - "source": [ - "
\n", - "1. Initiate a connection to Vantage\n", - "

You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8248b487-0b67-4039-905f-5901c6348d05", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Change host and/or username as needed\n", - "%run -i ../startup.ipynb\n", - "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", - "print(eng)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bff62fbc-9013-4854-9a02-35d95acd5d69", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "execute_sql('''SET query_band='DEMO=IVSM_Banking_Customer_Churn.ipynb;' UPDATE FOR SESSION; ''')" - ] - }, - { - "cell_type": "markdown", - "id": "5630371a-648a-42fe-8d90-bc9c23d6d422", - "metadata": {}, - "source": [ - "

Getting Data for This Demo

\n", - "

We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8eb5ed50-5295-462f-8003-c414efbbd90e", - "metadata": {}, - "outputs": [], - "source": [ - "# %run -i ../run_procedure.py \"call get_data('DEMO_BankChurnIVSM_cloud');\" \n", - "%run -i ../run_procedure.py \"call get_data('DEMO_BankChurnIVSM_local');\"" - ] - }, - { - "cell_type": "markdown", - "id": "17d614ea-2481-4b3e-8174-3e900da25dd2", - "metadata": {}, - "source": [ - "

Next is an optional step – if you want to see the status of databases/tables created and space used.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cde7dd9b-6dbc-4e19-8ebe-3c3153fdb969", - "metadata": {}, - "outputs": [], - "source": [ - "%run -i ../run_procedure.py \"call space_report();\" # Takes 10 seconds" - ] - }, - { - "cell_type": "markdown", - "id": "860d53bf-9684-4534-84c8-77c52f17c9d7", - "metadata": {}, - "source": [ - "

1.1 Confirmation for functions\n", - "

Now we can confirm that the required functions are installed.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bc34ce47-36e6-4014-b2b5-9c618a876caf", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, Markdown\n", - "\n", - "df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')\n", - "if df_check.get_values()[0][0] >= 10:\n", - " print('Functions are installed, please continue.')\n", - "else:\n", - " print('Functions are not installed, please go to Instalization notebook before proceeding further')\n", - " display(Markdown(\"[Initialization Notebook](./1.IVSM_Banking_Customer_Churn_Model_Install.ipynb)\"))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e8ab03ea-7dd5-428f-ab04-8ca6fa155850", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df = tdml.DataFrame('complaint_embeddings_store')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d2a8e69b-a40c-4014-831e-3a9c569aab88", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "c56e010d-a05f-4d32-807c-c0cc79862371", - "metadata": {}, - "source": [ - "
\n", - "2. Run K-Means on the Embeddings Store and then build final table with Cluster ID assignments to rows" - ] - }, - { - "cell_type": "markdown", - "id": "9849e495-9b05-41ae-80ec-ff5986059f0e", - "metadata": {}, - "source": [ - "

The K-means() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7e867a83-9b0b-4dba-97c6-1c1d22115071", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "cols = list(df.columns)[2:]\n", - "\n", - "KMeans_out = KMeans(id_column=\"id\",\n", - " target_columns=cols,\n", - " data=df,\n", - " num_clusters=10,\n", - " output_cluster_assignment=True\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "1c693bbd-db8e-424c-84a6-50fe7792e8d4", - "metadata": {}, - "source": [ - "

The output below shows cluster assignment for each row.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8745054e-2dd6-4c6f-8842-e144c92c263c", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "clusters = KMeans_out.result" - ] - }, - { - "cell_type": "markdown", - "id": "40048047-c26d-4434-b98d-b14a68ef22b5", - "metadata": {}, - "source": [ - "

Let's check how many data points each cluster has.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b48dea35-6b0b-4d29-ace7-a541151c2a92", - "metadata": {}, - "outputs": [], - "source": [ - "clusters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3485db2f-7469-4e09-9367-8b7c642cdd37", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "merged_df = clusters.merge(df[['id','txt']], on='id', how='inner', lsuffix='_left', rsuffix='_right')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9af42600-e4bb-4c9f-b9e6-b8b5dc53dc54", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "merged_df=merged_df.drop('id__left', axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "1932167b-921e-49fb-b7f3-39953a7ebe2f", - "metadata": { - "tags": [] - }, - "source": [ - "

Create a \"Virtual DataFrame\" that points to the data set in Vantage.

\n", - "

*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4755aaa1-269b-4003-b2e3-942a41ed1299", - "metadata": {}, - "outputs": [], - "source": [ - "customer_churn = DataFrame(in_schema('DEMO_BankChurnIVSM', 'Bank_Churn'))\n", - "customer_churn" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c777aa80-b025-407d-a960-a05ffda40c66", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_df = customer_churn.merge(merged_df[['id__right','td_clusterid_kmeans']],\n", - " on='customerid = id__right',\n", - " how='inner')\n", - "new_df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dc0233da-f64a-4790-80dc-c5d516a113d4", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_df = new_df.drop('id__right',axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "379b386c-22aa-4044-b52c-584b165f3c3f", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_df" - ] - }, - { - "cell_type": "markdown", - "id": "f5bd271b-8787-4ac5-b568-95c562151ba7", - "metadata": {}, - "source": [ - "
\n", - "3. Data Transformation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff20075d-ece4-49a5-bf7d-46e275622d63", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "target_variable = \"Exited\"\n", - "numeric_columns = [\"Age\", \"Balance\", \"CreditScore\", \"EstimatedSalary\", \"Tenure\"]\n", - "categorical_columns = [\"Gender\", \"Geography\", \"td_clusterid_kmeans\", \"NumOfProducts\"]\n", - "binary_columns = [\"HasCrCard\", \"IsActiveMember\"]\n", - "id_column = [\"CustomerId\"]" - ] - }, - { - "cell_type": "markdown", - "id": "63223a27-eac0-4949-b720-d9c3d5365678", - "metadata": {}, - "source": [ - "

ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "95e23b28-6858-42c7-8224-7efecce93ef8", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "fit1 = ScaleFit(data=new_df,\n", - " target_columns=numeric_columns,\n", - " scale_method=\"USTD\",\n", - " miss_value=\"KEEP\",\n", - " global_scale=False,\n", - " multiplier=\"1\")" - ] - }, - { - "cell_type": "markdown", - "id": "70b3745f-db85-45a9-a2c1-f9dc470619b6", - "metadata": {}, - "source": [ - "

OneHotEncodingFit outputs a table of attributes and categorical values to input to OneHotEncodingTransform which encodes them as one-hot numeric vectors.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e40acda6-2856-4745-a7f7-29000f128b6b", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "fit2 = OneHotEncodingFit(data=new_df,\n", - " is_input_dense=True,\n", - " approach=\"auto\",\n", - " target_column=categorical_columns[0:2],\n", - " category_counts=[2,3])" - ] - }, - { - "cell_type": "markdown", - "id": "7200fd44-cb39-42fb-b8c5-79fe6f9668fd", - "metadata": {}, - "source": [ - "

The ColumnTransformer function transforms the entire dataset in a single operation. You only need\n", - "to provide the FIT tables to the function, and the function runs all transformations that you require in a\n", - "single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43706ada-d4a0-4c55-8250-7433fd4b5b95", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_table = ColumnTransformer(input_data=new_df,\n", - " onehotencoding_fit_data=fit2.result,\n", - " scale_fit_data=fit1.output).result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b177b817-df41-4845-8972-b91224ef76b2", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_table=new_table[['CustomerId', 'Age', 'Balance', 'CreditScore', 'EstimatedSalary', 'Exited', 'Gender', 'Geography', 'HasCrCard',\n", - " 'IsActiveMember', 'NumOfProducts', 'Tenure', 'td_clusterid_kmeans', 'Gender_0', 'Gender_1', 'Geography_0',\n", - " 'Geography_1', 'Geography_2']]" - ] - }, - { - "cell_type": "markdown", - "id": "2e44cfa5-0074-4393-ba3b-53502f630496", - "metadata": {}, - "source": [ - "

3.1 Train-Test Split" - ] - }, - { - "cell_type": "markdown", - "id": "7a85bc76-b901-433b-a419-c147f8c8cf2a", - "metadata": {}, - "source": [ - "

The TrainTestSplit() function divides the dataset into train and test subsets to be used for evaluating machine learning models and validation processes.
\n", - "80% is used for Training and 20% for validation.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9cbe56b7-5320-439e-a605-7b5d22a2d84c", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "TrainTestSplit_out = TrainTestSplit(data = new_table,\n", - " id_column='CustomerId',\n", - " train_size=0.80,\n", - " test_size=0.20,\n", - " seed=3432)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc2ee841-8b9d-40b0-88f4-7f08793a0ea5", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "TrainTestSplit_out.result.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec9c9d04-6f2f-488b-b317-05aac9a0cf57", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)\n", - "df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)\n", - "\n", - "print(\"Training Set = \" + str(df_train.shape[0]) + \". Testing Set = \" + str(df_test.shape[0]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c3398193-36de-43ed-8af4-3369f4465ed1", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.copy_to_sql(df_train, table_name = 'clean_data_train1', if_exists = 'replace')\n", - "tdml.copy_to_sql(df_test, table_name = 'clean_data_test1', if_exists = 'replace')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e4ee0b66-e0f5-4856-908a-1126542b8021", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_train = tdml.DataFrame(in_schema('demo_user','clean_data_train1'))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfbeea0e-0d41-4b2d-809c-89314ce711df", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_test = tdml.DataFrame(in_schema('demo_user','clean_data_test1'))" - ] - }, - { - "cell_type": "markdown", - "id": "b87ce3ac-9ac9-462b-9b5b-8f0e5f9a7ef5", - "metadata": {}, - "source": [ - "
\n", - "

4. Modelling

" - ] - }, - { - "cell_type": "markdown", - "id": "75446770-a32e-4e93-acf5-8ac95924bf6d", - "metadata": {}, - "source": [ - "

4.1 Train an XGBoost Model\n", - "

The XGBoost() function is an efficient implementation of gradient boosting for classification and regression tasks. It builds an ensemble of decision trees in a sequential manner to minimize prediction error.

\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "81064960-5f1b-43ef-8aea-110e7d6ad066", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "formula_str = \"Exited ~ CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Gender_0 + Gender_1 + Geography_0 + Geography_1 + Geography_2 + td_clusterid_kmeans\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "239a6894-16ca-4079-9436-33a620997e4e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "XGBoost_out2 = XGBoost(data=df_train,\n", - " id_column='CustomerId',\n", - " loss_function='logistic',\n", - " formula = formula_str,\n", - " iter_num=5,\n", - " min_node_size=1,\n", - " #num_boosted_trees=50, \n", - " num_boosted_trees=80,\n", - " lambda1 = 500,\n", - " shrinkage_factor=0.5,\n", - " max_depth=10)" - ] - }, - { - "cell_type": "markdown", - "id": "5c2ef7f0-b8ea-4331-a75a-f4920ceef2e0", - "metadata": {}, - "source": [ - "

4.2 Predict Labels using the XGBoost Model

" - ] - }, - { - "cell_type": "markdown", - "id": "bccb95be-6a91-4319-ba67-42c793ffe519", - "metadata": {}, - "source": [ - "

The XGBoostPredict() function is used to predict the target labels for the test dataset (df_test) based on the trained XGBoost model.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27caf437-c3c3-4fa4-baeb-f37fa7d377ae", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "XGBoostPredict_out_1 = XGBoostPredict(newdata=df_test,\n", - " object=XGBoost_out2.result,\n", - " id_column='CustomerId',\n", - " accumulate='Exited')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8ae2e7f-136f-4cfc-ba7e-17bba21cb619", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "XGBoostPredict_out_1.result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3aca8957-cdac-4507-bef7-ea9491458f7e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "predict_df = XGBoostPredict_out_1.result\n", - "predict_df = predict_df.assign(Prediction = predict_df.Prediction.cast(type_ = BYTEINT))\n", - "predict_df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "3f6169eb-b18d-4e75-b5b7-9e897b4926e6", - "metadata": {}, - "source": [ - "

5. Evaluate the Model

\n", - "

ClassificationEvaluator() function evaluates and emits various metrics of classification model based on its predictions on the data. Apart from accuracy, the secondary output data returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.
\n", - "This is a powerful function, and doesn't move data outside Vantage." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2559ce7b-871e-4728-8cc6-185b41889a9e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "ClassificationEvaluator_obj = ClassificationEvaluator(data=predict_df,\n", - " observation_column='Exited',\n", - " prediction_column='Prediction',\n", - " labels=['0', '1'])\n", - "classeval_decisiondf = ClassificationEvaluator_obj.output_data\n", - "classeval_decisiondf" - ] - }, - { - "cell_type": "markdown", - "id": "2de01e8c-7e52-439f-8d11-ebdf0a37110d", - "metadata": {}, - "source": [ - "

5.1 Compute ROC Curve

\n", - "

The ROC() function calculates the Receiver Operating Characteristic (ROC) curve to evaluate the performance of the model, using the predicted probabilities and the actual class labels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5250cb3-8538-4925-aab5-f6bd78732a07", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "roc_df = ROC(data = predict_df, \n", - " probability_column = \"Prediction\",\n", - " observation_column = \"Exited\",\n", - " positive_class=\"1\"\n", - " )\n", - "roc_df.output_data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ccfb7c90-7b53-4ffb-9ae5-e1da312842ec", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "auc = roc_df.result.get_values()[0][0]\n", - "auc" - ] - }, - { - "cell_type": "markdown", - "id": "659a0cf9-7c51-4f03-9d28-9b1a09f31a5e", - "metadata": {}, - "source": [ - "

5.2 Plot ROC Curve

\n", - "

Plots the ROC curve using fpr (False Positive Rate) and tpr (True Positive Rate) from the ROC data, and displays the Area Under the Curve (AUC) for model evaluation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17f4a9f0-65ec-4e62-8e02-d3086777a788", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "plot_roc_df = roc_df.output_data\n", - "plot = plot_roc_df.plot(x=plot_roc_df.fpr, y=plot_roc_df.tpr,\n", - " title=\"Receiver Operating Characteristic (ROC) Curve\",\n", - " xlabel='False Positive Rate', \n", - " ylabel='True Positive Rate', \n", - " color=\"blue\",\n", - " legend=f'AUC = {round(auc, 4)}',\n", - " legend_style='lower right',\n", - " grid_linestyle='--',\n", - " grid_linewidth=0.5)\n", - " \n", - "# Display the plot.\n", - "plot.show()" - ] - }, - { - "cell_type": "markdown", - "id": "a2c2b009-c04f-4a0c-a979-1aed0ad92d10", - "metadata": {}, - "source": [ - "

5.3 Hyperparameter Tuning

\n", - "

Sets the parameters for the classification model, including input columns, response column, hyperparameters (e.g., max_depth, lambda1), and other settings such as shrinkage_factor, seed, and iter_num." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d9a40b30-afa9-433d-84ea-ca6ab8d7dd75", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "model_params = {\"input_columns\":['CreditScore','Age', 'Tenure','Balance','NumOfProducts','HasCrCard','IsActiveMember','EstimatedSalary','Gender_0','Gender_1','Geography_0','Geography_1','Geography_2','td_clusterid_kmeans'],\n", - " \"response_column\" :'Exited',\n", - " \"max_depth\":(5,10,15),\n", - " \"lambda1\" :(1000.0,0.001),\n", - " \"model_type\" :\"Classification\",\n", - " \"seed\":32,\n", - " \"shrinkage_factor\":0.1,\n", - " \"iter_num\":(5, 50)}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "facf3bbe-d0f4-45b2-beff-384bd0dd110d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "eval_params = {\"id_column\": \"CustomerId\",\n", - " \"accumulate\":\"Exited\",\n", - " \"model_type\":'Classification',\n", - " \"object_order_column\":['task_index', 'tree_num', 'iter','class_num', 'tree_order']}" - ] - }, - { - "cell_type": "markdown", - "id": "b3006745-5a9d-4f53-b760-cabcc6feca08", - "metadata": {}, - "source": [ - "

GridSearch is an exhaustive search algorithm that covers all possible parameter values to identify optimal hyperparameters. It works for teradataml analytic functions from SQLE, BYOM, VAL and UAF features.teradataml GridSearch allows user to perform hyperparameter tuning for all model trainer and non-model trainer functions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6bbe0764-f708-41e8-b934-749be5e6da68", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_obj = GridSearch(func=XGBoost, params=model_params)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e6b5fa12-e409-4a4e-b070-b69e162fbbf4", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_obj.fit(data=df_train, verbose=2, run_parallel=True, evaluation_metric='Accuracy', **eval_params)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6bcac47-5d5f-4028-a29e-893ae97de214", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_obj.models" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c04c587b-7188-41b2-ba0a-1ec6c9725938", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_obj.model_stats" - ] - }, - { - "cell_type": "markdown", - "id": "149342fc-9ae8-403f-bc55-4ed2e9cf7c39", - "metadata": {}, - "source": [ - "

Function uses model training function generated models from SQLE, \n", - " VAL and UAF features for predictions. Predictions are made using \n", - " the best trained model. Predict function is not supported for\n", - " non-model trainer function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4502772b-da18-4011-b4d1-c2e010645141", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_pred = gs_obj.predict(newdata=df_test, **eval_params)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "deed48c7-8d62-49f5-9cd9-0686dd208e9e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "print(\"Prediction Result: \\n\", gs_pred.result)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3cca24c3-ade4-4c12-ae90-165caa014ac5", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "gs_obj.best_params_" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "96ece2ee-e6a6-43cc-a7ea-7f997d24d49b", - "metadata": {}, - "outputs": [], - "source": [ - "roc_df = ROC(data = gs_pred.result, \n", - " probability_column = \"Prediction\",\n", - " observation_column = \"Exited\",\n", - " positive_class=\"1\"\n", - " )\n", - "auc = roc_df.result.get_values()[0][0]\n", - "print('AUC: ', auc)\n", - "\n", - "plot_roc_df = roc_df.output_data\n", - "plot = plot_roc_df.plot(x=plot_roc_df.fpr, y=plot_roc_df.tpr,\n", - " title=\"Receiver Operating Characteristic (ROC) Curve\",\n", - " xlabel='False Positive Rate', \n", - " ylabel='True Positive Rate', \n", - " color=\"blue\",\n", - " legend=f'AUC = {round(auc, 4)}',\n", - " legend_style='lower right',\n", - " grid_linestyle='--',\n", - " grid_linewidth=0.5)\n", - " \n", - "# Display the plot.\n", - "plot.show()" - ] - }, - { - "cell_type": "markdown", - "id": "b3afc61d-27bd-4f5e-8977-dc20603f7247", - "metadata": {}, - "source": [ - "


\n", - "6. Cleanup\n", - "

The following code will remove the context.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36bfae31-60de-4ff9-8a64-8e52c44efd14", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.remove_context()" - ] - }, - { - "cell_type": "markdown", - "id": "a7d5c20d-a4fd-439b-8ba3-70c4932b7fcb", - "metadata": {}, - "source": [ - "
\n", - "Dataset:\n", - "\n", - "- `Unnamed`: Unnamed\n", - "- `CustomerId`: Customer ID\n", - "- `Surname`: Surname\n", - "- `CreditScore`: Credit score\n", - "- `Geography`: Country (Germany / France / Spain)\n", - "- `Gender`: Gender (Female / Male)\n", - "- `Age`: Age\n", - "- `Tenure`: No of years the customer has been associated with the bank\n", - "- `Balance`: Balance\n", - "- `NumOfProducts`: No of bank products used\n", - "- `HasCrCard`: Credit card status (0 = No, 1 = Yes)\n", - "- `IsActiveMember`: Active membership status (0 = No, 1 = Yes)\n", - "- `EstimatedSalary`: Estimated salary\n", - "- `Exited`: Abandoned or not? (0 = No, 1 = Yes)\n", - "\n", - "

Links:

\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "450c8c1b-86a7-435b-b0bc-e50bddf4e9b7", - "metadata": { - "tags": [] - }, - "source": [ - "" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/UseCases/Banking_Customer_Churn_IVSM/Step1.IVSM_Banking_Customer_Churn_Model_Install.ipynb b/UseCases/Banking_Customer_Churn_IVSM/Step1.IVSM_Banking_Customer_Churn_Model_Install.ipynb deleted file mode 100644 index ac84c45e..00000000 --- a/UseCases/Banking_Customer_Churn_IVSM/Step1.IVSM_Banking_Customer_Churn_Model_Install.ipynb +++ /dev/null @@ -1,508 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "26632eab-8349-4ce6-82f2-3e34a35887c2", - "metadata": {}, - "source": [ - "
\n", - "

\n", - " IVSM Banking Customer Churn Model Install\n", - "
\n", - " \"Teradata\"\n", - "

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "77679c3b-4db5-427e-b78d-640d05275800", - "metadata": {}, - "source": [ - "

Introduction

\n", - "

\n", - "Hugging Face is a French-American company based in New York City that develops computation tools for building applications using machine learning. They are known for their Transformers Library which provides open-source implementations of transformer models for text, image, video, audio tasks including time-series. These models include well-known architectures like BERT and GPT. The library is compatible with PyTorch, TensorFlow, and JAX deep learning libraries.
\n", - " Deep Learning Models in HuggingFace are pre-trained by users/open source outfits/companies on various types of data – NLP, Audio, Images, Videos etc. Most popular tool of choice by users is PyTorch (open source python library) which helps create a Deep Learning model from scratch or take an existing model, retrain/fine-tune (Transfer Learning) on new set of data to be published in HF. Models can be inference with CPUs and GPUs with slight performance improvement for smaller models.
\n", - "

\n", - "

Why Vantage?

\n", - "

As many Hugging Face models are available in ONNX Runtime, we can load them using the BYOM feature of Vantage and run them in Vantage. Because of Graph Optimizations on ONNX Runtime, there are proven benchmarks that show that inference with ONNX Runtime will be 20% faster than a native PyTorch model on a CPU.

\n", - " \n", - "

Vantage Parallelism on top of boosted ONNX Runtime inference can turn a Vantage system as effective as inference on GPUs. If we have a Vantage box with 72 AMPs, assuming the table is perfectly distributed, it will closely match the performance of a dedicated GPU and data never moves across the network saving time and I/O operations. As parallelism increases with number of AMPs, the model inference will complete faster in Teradata Vantage with the same amount of text data vs a GPU. We can of course quantize the model (change float8 weights to int8/int4) for inference on CPU to go even faster with some tradeoff with accuracy. However, If Model size goes up GPU advantage will widen – example LLM like LLama3 and costs will be disproportionate with GPU but for smaller models we can get comparable performance. \n", - "

\n", - "\n", - "

Overall flow:

" - ] - }, - { - "cell_type": "markdown", - "id": "b9c95843-85c5-402e-911e-d1f02698ee10", - "metadata": {}, - "source": [ - "
\"Design
" - ] - }, - { - "cell_type": "markdown", - "id": "3ee0dba0-93e7-4b5a-be19-820c0c46e4ce", - "metadata": {}, - "source": [ - "
\n", - "1. Configuring the environment\n", - "\n", - "

1.1 Install the required libraries

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27a55daf-1401-4253-93b2-36ba13756146", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install optimum sentence_transformers==4.0.2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eaecd47b-ca12-44ea-bfc2-021a6cd48ec8", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "\n", - "!pip install --upgrade torch" - ] - }, - { - "cell_type": "markdown", - "id": "9061373e-60b6-488e-989f-ba421dcb025a", - "metadata": {}, - "source": [ - "
\n", - "

Note: Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: 0 0\n", - "\n", - "
You can remove or comment the %%capture is you want to observe what !pip install is doing.

" - ] - }, - { - "cell_type": "markdown", - "id": "fb451e8f-55c0-49dd-90b3-35453d4a9e56", - "metadata": {}, - "source": [ - "
\n", - "

1.2 Import the required libraries

\n", - "\n", - "

Here, we import the required libraries, set environment variables and environment paths (if required).

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "704ddfb5-2c6b-4df7-93f7-1135461dc442", - "metadata": {}, - "outputs": [], - "source": [ - "# Suppress warnings\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "# Standard libraries\n", - "import os\n", - "import getpass\n", - "import json\n", - "\n", - "# Data handling\n", - "import pandas as pd\n", - "import teradataml as tdml\n", - "\n", - "# ONNX runtime and tools\n", - "import onnx\n", - "import onnxruntime as rt\n", - "from onnxruntime.tools.onnx_model_utils import *\n", - "\n", - "# Transformers and sentence similarity\n", - "import transformers\n", - "from sentence_transformers import SentenceTransformer\n", - "from sentence_transformers.util import cos_sim\n", - "\n", - "from teradataml import (\n", - " create_context,\n", - " delete_byom,\n", - " display,\n", - " execute_sql,\n", - " save_byom,\n", - " remove_context,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6de29e63-cb44-4864-8a15-653c0838a64d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.configure.val_install_location = \"val\"" - ] - }, - { - "cell_type": "markdown", - "id": "872e2efc-fe5a-4f09-88f1-b3c530600c8e", - "metadata": {}, - "source": [ - "
\n", - "

2. Connect to Vantage

\n", - "\n", - "

We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4ccd0aa4-27a3-4a1a-b3e4-7d2870345106", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%run -i ../startup.ipynb\n", - "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", - "print(eng)" - ] - }, - { - "cell_type": "markdown", - "id": "c4bbd587-8db8-4a22-8d6d-dfe944a9cf65", - "metadata": {}, - "source": [ - "
\n", - "

3. Creation of functions

\n", - "

Below command will create the database and functions required for text summarization and embedding models using Huggingface PyTorch models in Vantage.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5ef4906-78ef-4cca-9fe8-8466be36960e", - "metadata": {}, - "outputs": [], - "source": [ - "with open(\"commands.json\", \"r\") as file:\n", - " data = json.load(file)\n", - "\n", - "for item in data[\"queries\"]:\n", - " try:\n", - " execute_sql(item[\"query\"])\n", - " except Exception as e:\n", - " print(\n", - " f\"The initialization steps have already been executed for this environment!\"\n", - " )\n", - " #print(f\"Error: {e}\")\n", - " pass" - ] - }, - { - "cell_type": "markdown", - "id": "e6af1548-d03d-4f33-ae1b-14b1c992ce09", - "metadata": {}, - "source": [ - "3.1 Drop Tables (if exist)\n", - "

Attempts to drop embeddings_models and embeddings_tokenizers tables, ignoring errors if they don't exist.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9bcae44f-50f9-4fa3-aec8-973776454999", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop embeddings-related tables if they exist\n", - "SQL = [\n", - " \"DROP TABLE embeddings_models;\",\n", - " \"DROP TABLE embeddings_tokenizers;\"\n", - "]\n", - "\n", - "for query in SQL:\n", - " try:\n", - " tdml.execute_sql(query)\n", - " except:\n", - " pass # Suppress any errors if the tables do not exist" - ] - }, - { - "cell_type": "markdown", - "id": "94bd1fe9-81f0-4acb-9d75-a98f496b2f4a", - "metadata": {}, - "source": [ - "
\n", - "

4. HuggingFace Model installation

\n", - "

In the below steps we will download and install the HuggingFace Model in Vantage.

" - ] - }, - { - "cell_type": "markdown", - "id": "a681ec89-7a9b-4fa3-8c36-ecbd61262f51", - "metadata": {}, - "source": [ - "
\n", - "

4.1 Download the Model using Optium utility

\n", - "\n", - "

We will be using BAAI/bge-small-en-v1.5
The bge-small-en model is a small-scale English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) as part of their FlagEmbedding project.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3913a608-c215-4819-bd71-29d6a69dda68", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!optimum-cli export onnx --opset 16 --trust-remote-code -m BAAI/bge-small-en-v1.5 bge-small-en-v1.5-onnx" - ] - }, - { - "cell_type": "markdown", - "id": "e1d50061-e830-4db3-bc89-2f63d7ef034e", - "metadata": {}, - "source": [ - "
\n", - "

4.2 Model Preparation

\n", - "

In the below steps we will fix dynamic dims, fix versions for compatibility, etc and prepare the model to load in Vantage.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88541fa5-ebc3-446d-b77f-d81a86fada11", - "metadata": {}, - "outputs": [], - "source": [ - "# Set the operator set version\n", - "op = onnx.OperatorSetIdProto()\n", - "op.version = 16\n", - "\n", - "# Load the original ONNX model\n", - "model = onnx.load('bge-small-en-v1.5-onnx/model.onnx')\n", - "\n", - "# Create a new model with a specified IR version and opset\n", - "model_ir8 = onnx.helper.make_model(\n", - " model.graph,\n", - " ir_version=8,\n", - " opset_imports=[op]\n", - ")\n", - "\n", - "# Fix variable dimension sizes\n", - "rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, \"batch_size\", 1)\n", - "rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, \"sequence_length\", 512)\n", - "rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, \"Divsentence_embedding_dim_1\", 384)\n", - "\n", - "# Remove the unnecessary \"token_embeddings\" output\n", - "for node in model_ir8.graph.output:\n", - " if node.name == \"token_embeddings\":\n", - " model_ir8.graph.output.remove(node)\n", - "\n", - "# Save the updated model\n", - "onnx.save(model_ir8, 'bge-small-en-v1.5-onnx/model_fixed.onnx')" - ] - }, - { - "cell_type": "markdown", - "id": "27878c2c-b929-4306-8d5c-03ed97f81ad5", - "metadata": {}, - "source": [ - "
\n", - "

4.3 Model Results validation

\n", - "

Checking that everything works with ONNX format locally.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9bd45b72-f1d1-4733-930f-d37615bccfad", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "sentences_1 = u'How is the weather today?'\n", - "sentences_2 = u'What is the current weather like today?'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3a243399-aa88-4ca6-aa9b-86b9de42e5e2", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Load the tokenizer and ONNX model session\n", - "tokenizer = transformers.AutoTokenizer.from_pretrained(\"./bge-small-en-v1.5-onnx\")\n", - "predef_sess = rt.InferenceSession(\"bge-small-en-v1.5-onnx/model_fixed.onnx\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "96b9c5a6-1a60-4703-a52e-307e556192bd", - "metadata": {}, - "outputs": [], - "source": [ - "# Tokenize the first sentence\n", - "enc = tokenizer(sentences_1, max_length=512, padding='max_length')\n", - "\n", - "# Run inference to get embeddings for the first sentence\n", - "result = predef_sess.run(\n", - " None,\n", - " {\n", - " \"input_ids\": [enc.input_ids],\n", - " \"attention_mask\": [enc.attention_mask]\n", - " }\n", - ")\n", - "\n", - "# Tokenize the second sentence\n", - "enc2 = tokenizer(sentences_2, max_length=512, padding='max_length')\n", - "\n", - "# Run inference to get embeddings for the second sentence\n", - "result2 = predef_sess.run(\n", - " None,\n", - " {\n", - " \"input_ids\": [enc2.input_ids],\n", - " \"attention_mask\": [enc2.attention_mask]\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "61193709-b4e2-4810-a72a-350f04c035da", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "print(cos_sim(result[0][0], result2[0][0]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d93b921f-2496-4358-ba16-7ab835a01b92", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Load the SentenceTransformer model\n", - "model = SentenceTransformer('BAAI/bge-small-en-v1.5')\n", - "\n", - "# Generate normalized embeddings for both sentences\n", - "embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)\n", - "embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)\n", - "\n", - "# Print the cosine similarity between the two embeddings\n", - "print(cos_sim(embeddings_1, embeddings_2))" - ] - }, - { - "cell_type": "markdown", - "id": "27927e93-7794-4ad3-8c63-12e73f50c8a4", - "metadata": {}, - "source": [ - "
\n", - "

4.4 Save the Model

\n", - "

In above steps, we have checked that the model is working fine in ONNX format. Now we will save the model file.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c2c8a458-138f-418a-8379-aeb7afd95322", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "try:\n", - " tdml.save_byom('bge-small-en-v1.5','bge-small-en-v1.5-onnx/model_fixed.onnx','embeddings_models')\n", - "except Exception as e:\n", - " print(f\"The model embeddings_models already exist.\")\n", - " pass\n", - " \n", - "try:\n", - " tdml.save_byom('bge-small-en-v1.5','bge-small-en-v1.5-onnx/tokenizer.json','embeddings_tokenizers')\n", - "except Exception as e:\n", - " print(f\"The model embeddings_tokenizers already exist.\")\n", - " pass" - ] - }, - { - "cell_type": "markdown", - "id": "93f8169c-8827-44da-8d4e-39963cd2f91d", - "metadata": {}, - "source": [ - "
\n", - "5. Cleanup\n", - "

The following code will remove the context.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36bfae31-60de-4ff9-8a64-8e52c44efd14", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.remove_context()" - ] - }, - { - "cell_type": "markdown", - "id": "47dfbb00-2743-437d-8c0e-4c0d8506214c", - "metadata": {}, - "source": [ - "" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/UseCases/Banking_Customer_Churn_IVSM/Step2.IVSM_Banking_Customer_Churn_Embeddings_Setup.ipynb b/UseCases/Banking_Customer_Churn_IVSM/Step1_Banking_Customer_Churn_Sentiment_Analysis.ipynb similarity index 53% rename from UseCases/Banking_Customer_Churn_IVSM/Step2.IVSM_Banking_Customer_Churn_Embeddings_Setup.ipynb rename to UseCases/Banking_Customer_Churn_IVSM/Step1_Banking_Customer_Churn_Sentiment_Analysis.ipynb index 8f17a0f5..0a9551ff 100644 --- a/UseCases/Banking_Customer_Churn_IVSM/Step2.IVSM_Banking_Customer_Churn_Embeddings_Setup.ipynb +++ b/UseCases/Banking_Customer_Churn_IVSM/Step1_Banking_Customer_Churn_Sentiment_Analysis.ipynb @@ -7,13 +7,43 @@ "source": [ "
\n", "

\n", - " IVSM Banking Customer Churn Embeddings Setup\n", + " Sentiment Analysis for Banking Customer Complaints\n", "
\n", " \"Teradata\"\n", "

\n", "
" ] }, + { + "cell_type": "markdown", + "id": "85553047-2fab-4f42-9061-9116923f1e97", + "metadata": {}, + "source": [ + "

Introduction

\n", + "\n", + "
\n", + "

Source: Blog

\n", + "\n", + "

\n", + " In today’s digital era, banks receive thousands of customer complaints through multiple channels such as emails, social media, and online portals. Understanding the sentiment behind these complaints is crucial for improving customer experience, prioritizing issues, and maintaining trust. Traditional sentiment analysis methods often rely on keyword-based approaches, which can struggle with nuanced language, sarcasm, or domain-specific terminology common in banking.

\n", + "\n", + "

\n", + " To overcome these limitations, embedding-based sentiment analysis has emerged as a powerful technique. By converting textual complaints into high-dimensional vector representations, embeddings capture semantic meaning beyond simple word matching. These vectors allow us to measure similarity and sentiment polarity using vector distance metrics such as cosine similarity or Euclidean distance. This approach enables more accurate classification of complaints into positive, negative, or neutral sentiments, even when the language is complex or indirect.

\n", + "\n", + "

\n", + " Leveraging embeddings and vector distance for sentiment analysis not only enhances precision but also opens the door for advanced applications like clustering similar complaints, detecting emerging issues, and automating response prioritization. This paper explores how these techniques can be applied effectively in the banking domain to transform raw customer feedback into actionable insights.

\n", + " \n", + "

Steps in the analysis:

\n", + "
    \n", + "
  1. Initiate a connection to Vantage
  2. \n", + "
  3. Load HuggingFace Model
  4. \n", + "
  5. Generate embeddings for Complaints
  6. \n", + "
  7. Generate embeddings for Sentiments
  8. \n", + "
  9. Find the Semantic Search using Teradata's Vantage in-DB function - VectorDistance
  10. \n", + "
  11. Cleanup
  12. \n", + "
" + ] + }, { "cell_type": "markdown", "id": "30752619-09a6-4a8e-8ec5-94532a27d630", @@ -53,7 +83,13 @@ "from teradataml import (\n", " DataFrame,\n", " in_schema,\n", - " create_context\n", + " create_context,\n", + " ONNXEmbeddings,\n", + " delete_byom, \n", + " display,\n", + " execute_sql,\n", + " save_byom,\n", + " configure,\n", ")" ] }, @@ -66,7 +102,8 @@ }, "outputs": [], "source": [ - " tdml.configure.val_install_location = \"val\"" + "tdml.configure.val_install_location = \"val\"\n", + "tdml.byom_install_location = \"mldb\"" ] }, { @@ -93,39 +130,14 @@ "print(eng)" ] }, - { - "cell_type": "markdown", - "id": "ce6e386d-fda2-4d56-a8c7-11e6fc31e08a", - "metadata": {}, - "source": [ - "
\n", - "

2. Confirmation for functions\n", - "

Before starting we'll confirm that the required functions are installed.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1080427-025e-499b-ac85-f2977e4add8e", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, Markdown\n", - "\n", - "df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')\n", - "if df_check.get_values()[0][0] >= 10:\n", - " print('Functions are installed, please continue.')\n", - "else:\n", - " print('Functions are not installed, please go to Instalization notebook before proceeding further')\n", - " display(Markdown(\"[Initialization Notebook](./1.IVSM_Banking_Customer_Churn_Model_Install.ipynb)\"))" - ] - }, { "cell_type": "markdown", "id": "800d3acd-0621-4625-a019-d1cdbae6bc89", "metadata": {}, "source": [ - "2.1 Drop Tables (if exist)\n", + "
\n", + "\n", + "1.1 Drop Tables (if exist)\n", "

Now attempt to drop the complaint_embeddings_store and complaints tables, ignoring errors if they don't exist.

" ] }, @@ -192,367 +204,340 @@ ] }, { - "cell_type": "markdown", - "id": "cbedeeed-91e5-4be2-ab50-57411411db00", + "cell_type": "code", + "execution_count": null, + "id": "1f32dbdd-1282-4986-a432-c34d98dfe1a4", "metadata": {}, + "outputs": [], "source": [ - "
\n", - "3. Creation of the view with tokenized original texts" + "tdf = tdf.assign(txt=tdf.Customer_Complaint)" ] }, { "cell_type": "markdown", - "id": "f8142eed-5643-4976-84cb-192618106c7d", + "id": "cbedeeed-91e5-4be2-ab50-57411411db00", "metadata": {}, "source": [ - "

This code creates a view named v_pdf_tokenized_for_embeddings that contains tokenized consumer complaint data for embedding purposes. It selects the id, txt (complaint text), input_ids (tokenized representations), and attention_mask from a tokenization function ivsm.tokenizer_encode." + "


\n", + "

2. Load HuggingFace Model\n", + "

To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/gte-base-en-v1.5), such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "5203ad1f-ebfd-48ec-bd8f-7c018a8f9691", - "metadata": { - "tags": [] - }, + "id": "d85fd7ac-b627-4ea1-b8d6-4c28a4a31ab8", + "metadata": {}, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "\n", - "Replace view v_pdf_tokenized_for_embeddings as (\n", - " select\n", - " top 1000 id,\n", - " txt,\n", - " IDS as input_ids,\n", - " attention_mask\n", - " from ivsm.tokenizer_encode(\n", - " on (select CustomerId as id,\n", - " Customer_Complaint as txt from DEMO_BankChurnIVSM.Complaints)\n", - " on (select model as tokenizer \n", - " from embeddings_tokenizers where model_id = 'bge-small-en-v1.5')\n", - " DIMENSION\n", - " USING\n", - " ColumnsToPreserve('id', 'txt')\n", - " OutputFields('IDS', 'ATTENTION_MASK')\n", - " MaxLength(1024)\n", - " PadToMaxLength('True')\n", - " TokenDataType('INT64')\n", - " ) a\n", - ")\n", - "\"\"\")" + "import os\n", + "os.environ[\"HF_HUB_DISABLE_PROGRESS_BARS\"] = \"1\"\n", + "os.environ[\"HF_HUB_DISABLE_SYMLINKS_WARNING\"] = \"1\"\n", + "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"0\"" ] }, { "cell_type": "code", "execution_count": null, - "id": "6701bb33-9395-4a82-a00b-cf98a2dc297f", + "id": "354e3c25-9fe4-4378-bd43-118d2917e192", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('v_pdf_tokenized_for_embeddings').head()" + "from huggingface_hub import hf_hub_download\n", + "\n", + "model_name = \"bge-base-en-v1.5\"\n", + "number_dimensions_output = 768\n", + "model_file_name = \"model.onnx\"" ] }, { - "cell_type": "markdown", - "id": "95203cc0-0444-493e-b839-9169493cbf10", + "cell_type": "code", + "execution_count": null, + "id": "427362c0-a4ac-4174-90e1-535ee8f01725", "metadata": {}, + "outputs": [], "source": [ - "
\n", - "3.1 Creation of the view with calculated binary embeddings" + "# Step 1: Download Model from Teradata HuggingFace Page\n", + "\n", + "hf_hub_download(repo_id=f\"Teradata/{model_name}\", filename=f\"onnx/{model_file_name}\", local_dir=\"./\")\n", + "hf_hub_download(repo_id=f\"Teradata/{model_name}\", filename=f\"tokenizer.json\", local_dir=\"./\")" ] }, { "cell_type": "markdown", - "id": "2aef1f84-08cf-4ad1-814b-6d4c4abc6030", + "id": "d8a08d89-b21b-4d0f-ac96-3ee5cf0565a7", "metadata": {}, "source": [ - "

This code creates a view named complaints_embeddings that stores the computed embeddings (vector representations) of consumer complaint texts. The embeddings are generated using the ivsm.IVSM_score function, which scores/tokenizes input data based on a specific model.

" + "
\n", + "

2.1 Save the Model

\n", + "

In above steps, we have checked that the model is working fine in ONNX format. Now we will save the model file.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "eff9d52d-0571-4fdd-8c77-aa99b0c119d4", - "metadata": { - "tags": [] - }, + "id": "cde3d2ff-3497-4a2c-80f6-35ba3d9a6f70", + "metadata": {}, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "Replace view complaint_embeddings as (\n", - " select \n", - " *\n", - " from ivsm.IVSM_score(\n", - " on v_pdf_tokenized_for_embeddings -- table with data to be scored\n", - " on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension\n", - " using\n", - " ColumnsToPreserve('id', 'txt') -- columns to be copied from input table\n", - " ModelType('ONNX') -- model format\n", - " BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors\n", - " BinaryOutputFields('sentence_embedding')\n", - " Caching('inquery') -- tun on model caching within the query\n", - " ) a \n", - ")\n", - "\n", - "\"\"\")" + "try:\n", + " tdml.db_drop_table(\"embeddings_models\")\n", + "except Exception as e:\n", + " pass\n", + "try:\n", + " tdml.db_drop_table(\"embeddings_tokenizers\")\n", + "except:\n", + " pass" ] }, { "cell_type": "code", "execution_count": null, - "id": "cd38b02d-cc91-4cba-9086-0a1502366c84", + "id": "53af356f-3e78-4af5-9b59-c5e66833e81d", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('complaint_embeddings').head(2)" + "# Step 2: Load Models into Vantage\n", + "# a) Embedding model\n", + "save_byom(model_id = model_name, # must be unique in the models table\n", + " model_file = f\"onnx/{model_file_name}\",\n", + " table_name = 'embeddings_models' )\n", + "# b) Tokenizer\n", + "save_byom(model_id = model_name, # must be unique in the models table\n", + " model_file = 'tokenizer.json',\n", + " table_name = 'embeddings_tokenizers') " ] }, { "cell_type": "markdown", - "id": "f77fc0ed-d38d-400d-9351-e645ffbfb665", + "id": "85f3c8d3-4ec1-4cee-8634-de482788998e", "metadata": {}, "source": [ - "
\n", - "

3.2 Creating Final Embeddings table

\n", - "

In this step we will create embeddings table creating a column for each embedding essentially converting an array to separate columns.

" + "

Recheck the installed model and tokenizer" ] }, { "cell_type": "code", "execution_count": null, - "id": "4d5c4351-d673-4691-8a19-4a694e784035", - "metadata": { - "tags": [] - }, + "id": "19dde44f-1aca-4f1e-be90-4c7a8703b6ff", + "metadata": {}, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "create table complaint_embeddings_store as (\n", - " select \n", - " *\n", - " from ivsm.vector_to_columns(\n", - " on complaint_embeddings\n", - " using\n", - " ColumnsToPreserve('id', 'txt') \n", - " VectorDataType('FLOAT32')\n", - " VectorLength(384)\n", - " OutputColumnPrefix('emb_')\n", - " InputColumnName('sentence_embedding')\n", - " ) a \n", - ") with data\n", - "\n", - "\"\"\")" + "df_model = DataFrame('embeddings_models')\n", + "df_model" ] }, { "cell_type": "code", "execution_count": null, - "id": "b128ce33-27a0-45ef-ba24-5ee583fa97c3", - "metadata": { - "tags": [] - }, + "id": "f827989c-736f-462a-864d-763679415240", + "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('complaint_embeddings_store').head()" + "df_token = DataFrame('embeddings_tokenizers')\n", + "df_token" + ] + }, + { + "cell_type": "markdown", + "id": "0134c4f6-61c4-4c7c-9f6a-19978f4be3ae", + "metadata": {}, + "source": [ + "

Load the mode that we have save to DB in previous notebook by passing Model ID.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "cf3628eb-73a0-4c6c-a98d-0794dd1ab2e7", - "metadata": { - "tags": [] - }, + "id": "bc277932-60cf-4f14-92db-546f33c2dafc", + "metadata": {}, "outputs": [], "source": [ - "sent_df = pd.DataFrame({'id': [1,2],\n", - " 'txt': ['Positive and Upbeat comment',\n", - " 'Negative or Abusive comment',\n", - " ]})\n", - "\n", - "tdml.copy_to_sql(sent_df,table_name='sentiment_topics', if_exists='replace', index=False)" + "my_model = DataFrame.from_query(f\"select * from embeddings_models where model_id = '{model_name}'\")\n", + "my_tokenizer = DataFrame.from_query(f\"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f2595b11-f4cc-4210-bc74-db9a7bf6cbba", + "metadata": {}, + "source": [ + "
\n", + "3. Generate embeddings for Complaints" + ] + }, + { + "cell_type": "markdown", + "id": "d0aeab53-ff75-49c0-a893-636dc8ed0c64", + "metadata": {}, + "source": [ + "

This code generate the embeddings for complaints using ONNXEmbeddings in-db function." ] }, { "cell_type": "code", "execution_count": null, - "id": "c1c5fed2-b0b3-4414-8558-f475832e65e4", + "id": "8d77a3e2-ddfc-4e12-8639-77ddae6ba106", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('sentiment_topics').head()" + "tdml.configure.byom_install_location = \"mldb\"" ] }, { - "cell_type": "markdown", - "id": "7e974511-8c44-4d6d-b326-222c42abd531", + "cell_type": "code", + "execution_count": null, + "id": "70eacd9a-80cd-4168-beb9-22b95c574aec", "metadata": {}, + "outputs": [], "source": [ - "


\n", - "3.3 Create Tokenized View" + "DF_embeddings_complaints = ONNXEmbeddings(\n", + " newdata = tdf.iloc[:100],\n", + " modeldata = my_model, \n", + " tokenizerdata = my_tokenizer, \n", + " accumulate = [\"CustomerId\", \"Customer_Complaint\"],\n", + " model_output_tensor = \"sentence_embedding\",\n", + " output_format = f'FLOAT32({number_dimensions_output})',\n", + " enable_memory_check = False\n", + ").result" ] }, { "cell_type": "markdown", - "id": "9e36747f-4f91-4557-b3e3-b4d6f4814cfd", + "id": "eadf5248-e802-4c45-9d73-9b6f7c47f099", "metadata": {}, "source": [ - "

Creates a view v_sentiment_tokenized_for_embeddings by applying a tokenizer to the sentiment_topics table using the specified model.

" + "

Now, embeddings are generated. Let's copy it to DB for further use.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "159d3482-151a-4f62-9db0-a7d9fc3944dd", - "metadata": { - "tags": [] - }, + "id": "ce91db5a-0062-4881-8298-55067d43160c", + "metadata": {}, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "replace view v_sentiment_tokenized_for_embeddings as (\n", - " select\n", - " id,\n", - " txt,\n", - " IDS as input_ids,\n", - " attention_mask\n", - " from ivsm.tokenizer_encode(\n", - " on (select * from sentiment_topics)\n", - " on (select model as tokenizer from embeddings_tokenizers where model_id = 'bge-small-en-v1.5') DIMENSION\n", - " USING\n", - " ColumnsToPreserve('id', 'txt')\n", - " OutputFields('IDS', 'ATTENTION_MASK')\n", - " MaxLength(1024)\n", - " PadToMaxLength('True')\n", - " TokenDataType('INT64')\n", - " ) a\n", - ")\n", - "\"\"\")" + "tdml.copy_to_sql(DF_embeddings_complaints,table_name='complaint_embeddings_store', if_exists='replace', index=False)\n", + "\n", + "tdf_complaint_embeddings_store = tdml.DataFrame('complaint_embeddings_store')" ] }, { "cell_type": "code", "execution_count": null, - "id": "61304418-a91a-4732-a435-fb33450f8238", + "id": "abe7b0dc-97c5-4ce6-8130-dafe5ce310e1", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('v_sentiment_tokenized_for_embeddings').head()" + "tdf_complaint_embeddings_store" ] }, { "cell_type": "markdown", - "id": "eb5d18fe-5047-4c38-b1e9-8eb31698e4ee", + "id": "1ecb9142-4323-4ae5-bb46-e288a0f92c68", "metadata": {}, "source": [ - "

Defines sentiment_topics_embeddings view by generating sentence embeddings using the IVSM_score function and a specified ONNX model.

" + "
\n", + "4. Generate embeddings for Sentiments\n", + "\n", + "

For sentiment analysis, we will create one table with sentiment and then create an embeddings for the same.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "7017175a-99e8-4b2d-b839-af509c1f9a61", - "metadata": { - "tags": [] - }, + "id": "56f08b78-7d48-4625-8438-7fb360887ca8", + "metadata": {}, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "replace view sentiment_topics_embeddings as (\n", - " select \n", - " *\n", - " from ivsm.IVSM_score(\n", - " on v_sentiment_tokenized_for_embeddings -- table with data to be scored\n", - " on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension\n", - " using\n", - " ColumnsToPreserve('id', 'txt') -- columns to be copied from input table\n", - " ModelType('ONNX') -- model format\n", - " BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors\n", - " BinaryOutputFields('sentence_embedding')\n", - " Caching('inquery') -- tun on model caching within the query\n", - " ) a \n", - ")\n", - "\"\"\")" + "sent_df = pd.DataFrame({'id': [1,2],\n", + " 'txt': ['Positive and Upbeat comment',\n", + " 'Negative or Abusive comment',\n", + " ]})\n", + "\n", + "tdml.copy_to_sql(sent_df,table_name='sentiment_topics', if_exists='replace', index=False)" ] }, { "cell_type": "code", "execution_count": null, - "id": "784d790e-c979-4fbf-ad27-512f7a592024", + "id": "00925cb6-8fd0-4523-a414-a4c8136215f0", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('sentiment_topics_embeddings').head()" + "tdf_sent = tdml.DataFrame('sentiment_topics')" ] }, { "cell_type": "code", "execution_count": null, - "id": "0abcc8c3-f426-42c7-af9f-115a3602bc7b", - "metadata": { - "tags": [] - }, + "id": "afc47b6b-663c-4453-9dc6-0e2c80283bd8", + "metadata": {}, "outputs": [], "source": [ - "try:\n", - " tdml.db_drop_table(\"sentiment_topics_embeddings_store\")\n", - "except:\n", - " True" + "tdf_sent" ] }, { - "cell_type": "markdown", - "id": "078dc3ee-6b9b-48b9-b999-d1ed2a64b018", + "cell_type": "code", + "execution_count": null, + "id": "fc18ada2-912c-4035-b6c2-7aa26f9a5144", "metadata": {}, + "outputs": [], "source": [ - "
\n", - "3.4 Store Embeddings as Columns" + "DF_embeddings_sent = ONNXEmbeddings(\n", + " newdata = tdf_sent,\n", + " modeldata = my_model, \n", + " tokenizerdata = my_tokenizer, \n", + " accumulate = [\"id\", \"txt\"],\n", + " model_output_tensor = \"sentence_embedding\",\n", + " output_format = f'FLOAT32({number_dimensions_output})',\n", + " enable_memory_check = False\n", + ").result" ] }, { - "cell_type": "markdown", - "id": "a81609a4-b95a-4a70-a24c-2d6a67fc863b", + "cell_type": "code", + "execution_count": null, + "id": "2f4adab4-4414-4e67-9554-bb25d2fb7b3b", "metadata": {}, + "outputs": [], "source": [ - "

\n", - "Creates a table sentiment_topics_embeddings_store by converting the sentence embeddings into individual float columns using vector_to_columns.\n", - "

" + "DF_embeddings_sent" ] }, { "cell_type": "code", "execution_count": null, - "id": "9f0bc3db-3f34-4c0e-a479-7042d94c76ff", + "id": "0abcc8c3-f426-42c7-af9f-115a3602bc7b", "metadata": { "tags": [] }, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", - "create table sentiment_topics_embeddings_store as (\n", - " select \n", - " *\n", - " from ivsm.vector_to_columns(\n", - " on sentiment_topics_embeddings\n", - " using\n", - " ColumnsToPreserve('id', 'txt') \n", - " VectorDataType('FLOAT32')\n", - " VectorLength(384)\n", - " OutputColumnPrefix('emb_')\n", - " InputColumnName('sentence_embedding')\n", - " ) a \n", - ") with data\n", - "\"\"\")" + "try:\n", + " tdml.db_drop_table(\"sentiment_topics_embeddings_store\")\n", + "except:\n", + " True" + ] + }, + { + "cell_type": "markdown", + "id": "412aec1e-1f37-4f1a-84cf-1ea3bf384b1e", + "metadata": {}, + "source": [ + "

Now, embeddings are generated for sentiments. Let's copy it to DB for further use.

" ] }, { "cell_type": "code", "execution_count": null, - "id": "3019b147-8c97-4909-82e0-ba9377b2108d", + "id": "4f78c21d-72b3-4003-95e0-117a813b72da", "metadata": {}, "outputs": [], "source": [ - "tdml.DataFrame('sentiment_topics_embeddings_store').head()" + "tdml.copy_to_sql(DF_embeddings_sent,table_name='sentiment_topics_embeddings_store', if_exists='replace', index=False)\n", + "tdf_sentiment_topics_embeddings = tdml.DataFrame('sentiment_topics_embeddings_store')" ] }, { @@ -575,8 +560,8 @@ "id": "faeabbc6-954e-4579-b132-73e51f04b782", "metadata": {}, "source": [ - "
\n", - "3.5 Semantic Search Results Table" + "
\n", + "5 Find the Semantic Search using Teradata's Vantage in-DB function - VectorDistance" ] }, { @@ -586,7 +571,11 @@ "source": [ "

\n", "Creates semantic_search_results table by finding the most similar sentiment topic for each complaint using cosine similarity on embeddings.\n", - "

\n" + "

\n", + "\n", + "

The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.

\n", + "\n", + "

The function computes the distance between the target pair and the reference pair from the same table if you provide only one table as the input.

" ] }, { @@ -598,13 +587,15 @@ }, "outputs": [], "source": [ - "tdml.execute_sql(\"\"\"\n", + "emb_col_names = tdf_sentiment_topics_embeddings.columns[2:]\n", + "\n", + "tdml.execute_sql(f\"\"\"\n", "create multiset table semantic_search_results\n", "as (\n", "SELECT \n", " dt.target_id,\n", " dt.reference_id,\n", - " e_tgt.txt as target_txt,\n", + " e_tgt.Customer_Complaint as target_txt,\n", " e_ref.txt as reference_txt,\n", " (1.0 - dt.distance) as similarity \n", "FROM\n", @@ -612,29 +603,19 @@ " ON complaint_embeddings_store AS TargetTable\n", " ON sentiment_topics_embeddings_store AS ReferenceTable DIMENSION\n", " USING\n", - " TargetIDColumn('id')\n", - " TargetFeatureColumns('[emb_0:emb_383]')\n", + " TargetIDColumn('CustomerId')\n", + " TargetFeatureColumns('[emb_0:emb_767]')\n", " RefIDColumn('id')\n", - " RefFeatureColumns('[emb_0:emb_383]')\n", + " RefFeatureColumns('[emb_0:emb_767]')\n", " DistanceMeasure('cosine')\n", " topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it\n", " ) AS dt\n", - "JOIN complaint_embeddings_store e_tgt on e_tgt.id = dt.target_id\n", + "JOIN complaint_embeddings_store e_tgt on e_tgt.CustomerId = dt.target_id\n", "JOIN sentiment_topics_embeddings_store e_ref on e_ref.id = dt.reference_id\n", ") with data\n", "\"\"\")" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "e60cf7a8-312b-47ec-a5ba-382ec23f587e", - "metadata": {}, - "outputs": [], - "source": [ - "tdml.DataFrame('semantic_search_results').head()" - ] - }, { "cell_type": "code", "execution_count": null, @@ -657,7 +638,7 @@ }, "outputs": [], "source": [ - "df[df['reference_txt'] == 'Positive and Upbeat comment']" + "df[df['reference_txt'] == 'Negative or Abusive comment']" ] }, { @@ -666,7 +647,7 @@ "metadata": {}, "source": [ "
\n", - "4. Cleanup\n", + "6. Cleanup\n", "

The following code will remove the context.

" ] }, @@ -731,7 +712,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.11.14" } }, "nbformat": 4, diff --git a/UseCases/Banking_Customer_Churn_IVSM/3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb b/UseCases/Banking_Customer_Churn_IVSM/Step2_Train_Churn_Model_using_BYOM.ipynb similarity index 80% rename from UseCases/Banking_Customer_Churn_IVSM/3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb rename to UseCases/Banking_Customer_Churn_IVSM/Step2_Train_Churn_Model_using_BYOM.ipynb index 4c3b660d..1a34fade 100644 --- a/UseCases/Banking_Customer_Churn_IVSM/3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb +++ b/UseCases/Banking_Customer_Churn_IVSM/Step2_Train_Churn_Model_using_BYOM.ipynb @@ -7,19 +7,79 @@ "source": [ "
\n", "

\n", - " IVSM Banking Customer Churn Embed BYOM\n", + " Train and Export Banking Customer Churn Model using BYOM\n", "
\n", " \"Teradata\"\n", "

\n", "
" ] }, + { + "cell_type": "markdown", + "id": "9fe96b47-414b-40de-a0d3-272246c72223", + "metadata": {}, + "source": [ + "

Introduction

\n", + "\n", + "
\n", + "

Source: Medium

\n", + "\n", + "

Customer churn is a critical metric in banking because it can directly impact a bank's revenue and profitability. When customers leave, banks lose the income they would have earned from those customers' transactions, investments, and account fees. Additionally, attracting new customers to replace those who have left can be expensive and time-consuming, so reducing customer churn is often more cost-effective than acquiring new customers.

\n", + "\n", + "

Customer churn can also be an indicator of customer satisfaction and loyalty. If customers leave at a high rate, they may be dissatisfied with the bank's products or services, customer service, or overall experience.

\n", + "\n", + "

Banks can use various strategies to reduce customer churns, such as improving customer service, offering more competitive rates and fees, providing personalized recommendations and offers, and enhancing digital channels and mobile apps. By tracking and analyzing customer churn rates, banks can identify areas for improvement and make strategic decisions to retain customers and improve overall customer satisfaction.

\n", + "\n", + "

In this demo, we demonstrate how to implement the entire lifecycle of churn prediction can using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.

\n", + "\n", + "

\n", + " Business Value of Customer Churn in Banking\n", + "

\n", + "

\n", + " Customer churn refers to the loss of clients who stop using a bank’s products or services. In the banking industry, churn has significant financial and strategic implications:\n", + "

\n", + "

\n", + " 1. Revenue Loss: Each lost customer reduces recurring revenue from deposits, loans, credit cards, and investment products. High churn directly impacts profitability.\n", + "

\n", + "

\n", + " 2. High Acquisition Costs: Acquiring a new customer is often 5–7 times more expensive than retaining an existing one. Churn increases marketing and onboarding expenses.\n", + "

\n", + "

\n", + " 3. Reduced Lifetime Value (CLV): Banks rely on long-term relationships to cross-sell products. Churn shortens customer lifecycles, reducing opportunities for upselling and cross-selling.\n", + "

\n", + "

\n", + " 4. Brand Reputation & Trust: Frequent churn signals dissatisfaction, which can harm brand reputation and lead to negative word-of-mouth, further accelerating customer loss.\n", + "

\n", + "

\n", + " 5. Regulatory & Competitive Pressure: In highly regulated markets, churn can indicate compliance or service gaps. Competitors can capitalize on these weaknesses, eroding market share.\n", + "

\n", + "

\n", + " 6. Predictive Insights for Growth: Understanding churn drivers helps banks improve customer experience, personalize offerings, and design retention strategies—turning risk into opportunity.\n", + "

\n", + "

Why Vantage?

\n", + "

The ML and AI industry continues to innovate at an unprecedented rate. Tools, technologies, and algorithms\t are being developed and improved in both the open source and commercial communities.

\n", + "\n", + "

Unfortunately, many of these techniques haven’t matured to the point where they are readily deployable to a stable, mature operational environment. Furthermore, many open-source techniques rely on fragile, manual enabling technologies.

\n", + "\n", + "

ClearScape Analytics Bring Your Own Model capabilities allow organizations to leverage third party and open-source models for scoring inside the Vantage Platform; providing enterprise-class scalability and operational stability for any number of users, applications, or volume of data.

\n", + "\n", + "

Steps in the analysis:

\n", + "
    \n", + "
  1. Initiate a connection to Vantage
  2. \n", + "
  3. Data Transformation
  4. \n", + "
  5. Train-Test Split
  6. \n", + "
  7. Grant Access to ModelOps
  8. \n", + "
  9. Convert the model to PMML
  10. \n", + "
  11. Cleanup
  12. \n", + "
" + ] + }, { "cell_type": "markdown", "id": "8c065558-5fb5-40fb-9893-a6c222cf3604", "metadata": {}, "source": [ - "
\n", + "##
\n", "

Import the required libraries

\n", "\n", "

Here, we import the required libraries, set environment variables and environment paths (if required).

" @@ -31,7 +91,7 @@ "metadata": {}, "source": [ "
\n", - "

Note: Please ensure that 2.IVSM_Banking_Customer_Churn_Embeddings_Setup is executed before running this file.

\n", + "

Note: Please execute the Step1 notebooks before executing this notebook.

\n", "
" ] }, @@ -111,32 +171,6 @@ "%run -i ../run_procedure.py \"call get_data('DEMO_BankChurnIVSM_local');\"" ] }, - { - "cell_type": "markdown", - "id": "7a44f6bc-d5e8-4e19-8625-e05342564ddc", - "metadata": {}, - "source": [ - "

1.1 Confirmation for functions\n", - "

Confirm that the required functions are installed.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46c8527e-a47e-461a-b44c-9fc748970d23", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, Markdown\n", - "\n", - "df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')\n", - "if df_check.get_values()[0][0] >= 10:\n", - " print('Functions are installed, please continue.')\n", - "else:\n", - " print('Functions are not installed, please go to Instalization notebook before proceeding further')\n", - " display(Markdown(\"[Initialization Notebook](./1.IVSM_Banking_Customer_Churn_Model_Install.ipynb)\"))" - ] - }, { "cell_type": "code", "execution_count": null, @@ -716,7 +750,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.11.14" } }, "nbformat": 4, diff --git a/UseCases/Banking_Customer_Churn_IVSM/Step3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb b/UseCases/Banking_Customer_Churn_IVSM/Step3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb deleted file mode 100644 index 2504e40f..00000000 --- a/UseCases/Banking_Customer_Churn_IVSM/Step3.IVSM_Banking_Customer_Churn_embed_BYOM.ipynb +++ /dev/null @@ -1,724 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "cedbd4af-6f37-4049-8aaa-863f77ce94f4", - "metadata": {}, - "source": [ - "
\n", - "

\n", - " IVSM Banking Customer Churn Embed BYOM\n", - "
\n", - " \"Teradata\"\n", - "

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "8c065558-5fb5-40fb-9893-a6c222cf3604", - "metadata": {}, - "source": [ - "
\n", - "

Import the required libraries

\n", - "\n", - "

Here, we import the required libraries, set environment variables and environment paths (if required).

" - ] - }, - { - "cell_type": "markdown", - "id": "cf5a95ec-4d8b-46f6-826a-38703a38567d", - "metadata": {}, - "source": [ - "
\n", - "

Note: Please execute the Step1 and Step2 notebooks before executing this notebook.

\n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "61bdabd9-d9af-4008-81fa-44bf639ca35d", - "metadata": {}, - "outputs": [], - "source": [ - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "import os\n", - "import pandas as pd\n", - "\n", - "import teradataml as tdml\n", - "import getpass\n", - "from teradataml import in_schema\n", - "from teradataml import DecisionForest, XGBoost, TrainTestSplit, DecisionForestPredict, XGBoostPredict, SentimentExtractor, ColumnTransformer, ScaleFit, OneHotEncodingFit\n", - "from teradataml import ColumnSummary, AutoML, AutoClassifier\n", - "from teradataml import RoundColumns, ClassificationEvaluator, ROC\n", - "from teradataml import (\n", - " DataFrame,\n", - " create_context\n", - ")\n", - "from xgboost import XGBClassifier\n", - "from sklearn.pipeline import Pipeline\n", - "from nyoka import xgboost_to_pmml\n", - "from teradataml import save_byom,list_byom,retrieve_byom,delete_byom,PMMLPredict" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6de29e63-cb44-4864-8a15-653c0838a64d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.configure.val_install_location = \"val\"" - ] - }, - { - "cell_type": "markdown", - "id": "30662c01-e0ab-42df-ad03-cd12be7469db", - "metadata": {}, - "source": [ - "
\n", - "1. Initiate a connection to Vantage\n", - "

You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4ccd0aa4-27a3-4a1a-b3e4-7d2870345106", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%run -i ../startup.ipynb\n", - "eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)\n", - "print(eng)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "115a55eb-bf63-4cbc-9e9f-e1060ee38423", - "metadata": {}, - "outputs": [], - "source": [ - "# %run -i ../run_procedure.py \"call get_data('DEMO_BankChurnIVSM_cloud');\" \n", - "%run -i ../run_procedure.py \"call get_data('DEMO_BankChurnIVSM_local');\"" - ] - }, - { - "cell_type": "markdown", - "id": "7a44f6bc-d5e8-4e19-8625-e05342564ddc", - "metadata": {}, - "source": [ - "

1.1 Confirmation for functions\n", - "

Confirm that the required functions are installed.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46c8527e-a47e-461a-b44c-9fc748970d23", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, Markdown\n", - "\n", - "df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')\n", - "if df_check.get_values()[0][0] >= 10:\n", - " print('Functions are installed, please continue.')\n", - "else:\n", - " print('Functions are not installed, please go to Instalization notebook before proceeding further')\n", - " display(Markdown(\"[Initialization Notebook](./1.IVSM_Banking_Customer_Churn_Model_Install.ipynb)\"))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6b296443-5871-4efd-a43f-78d6e9405779", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df = tdml.DataFrame('semantic_search_results')\n", - "df[df['reference_txt'] == 'Negative or Abusive comment']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc71fe1f-d0f5-4811-be0e-72d1a141a68d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df[df['reference_txt'] == 'Positive and Upbeat comment']" - ] - }, - { - "cell_type": "markdown", - "id": "6b726116-67b1-4e51-a74a-74377497b77b", - "metadata": {}, - "source": [ - "

Create a \"Virtual DataFrame\" that points to the data set in Vantage.

\n", - "

*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36b36049-7737-4a50-8b73-e0bf8fdd72d2", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "customer_churn = DataFrame(in_schema('DEMO_BankChurnIVSM', 'Bank_Churn'))\n", - "customer_churn" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c777aa80-b025-407d-a960-a05ffda40c66", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_df = customer_churn.merge(df[['target_id','reference_txt']], on='customerid = target_id', how='inner')\n", - "new_df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dc0233da-f64a-4790-80dc-c5d516a113d4", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_df = new_df.drop('target_id',axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "5631b307-eeb2-477b-880c-810c30486095", - "metadata": {}, - "source": [ - "
\n", - "2. Data Transformation" - ] - }, - { - "cell_type": "markdown", - "id": "eefe14e2-2987-4dc0-9417-401b49ae5887", - "metadata": {}, - "source": [ - "

Define Column Categories

\n", - "

Specifies the target variable and categorizes input columns into numeric, categorical, binary, and identifier groups for preprocessing and modeling.
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff20075d-ece4-49a5-bf7d-46e275622d63", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "target_variable = \"Exited\"\n", - "numeric_columns = [\"Age\", \"Balance\", \"CreditScore\", \"EstimatedSalary\", \"Tenure\"]\n", - "categorical_columns = [\"Gender\", \"Geography\", \"reference_txt\", \"NumOfProducts\"]\n", - "binary_columns = [\"HasCrCard\", \"IsActiveMember\"]\n", - "id_column = [\"CustomerId\"]" - ] - }, - { - "cell_type": "markdown", - "id": "eaf8d5a8-0348-4f87-9ef6-e588ae8a8f32", - "metadata": {}, - "source": [ - "

ScaleFit() function outputs statistics to input to ScaleTransform() function, which scales specified input DataFrame columns.
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "95e23b28-6858-42c7-8224-7efecce93ef8", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "fit1 = ScaleFit(data=new_df,\n", - " target_columns=numeric_columns,\n", - " scale_method=\"USTD\",\n", - " miss_value=\"KEEP\",\n", - " global_scale=False,\n", - " multiplier=\"1\")" - ] - }, - { - "cell_type": "markdown", - "id": "09ad2461-4d3b-4980-b7f2-42f399858861", - "metadata": {}, - "source": [ - "

OneHotEncodingFit outputs a table of attributes and categorical values to input to OneHotEncodingTransform which encodes them as one-hot numeric vectors.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e40acda6-2856-4745-a7f7-29000f128b6b", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "fit2 = OneHotEncodingFit(data=new_df,\n", - " is_input_dense=True,\n", - " approach=\"auto\",\n", - " target_column=categorical_columns[0:3],\n", - " category_counts=[2,3,2])" - ] - }, - { - "cell_type": "markdown", - "id": "c3c07d02-0758-496f-bc6d-0b126ce3a423", - "metadata": {}, - "source": [ - "

The ColumnTransformer function transforms the entire dataset in a single operation. You only need\n", - "to provide the FIT tables to the function, and the function runs all transformations that you require in a\n", - "single operation. Running all the it table transformations together in one-go gives approx. 30% performance improvement over running each transformation sequentially.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43706ada-d4a0-4c55-8250-7433fd4b5b95", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_table = ColumnTransformer(input_data=new_df,\n", - " onehotencoding_fit_data=fit2.result,\n", - " scale_fit_data=fit1.output).result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b177b817-df41-4845-8972-b91224ef76b2", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "new_table=new_table[['CustomerId', 'Age', 'Balance', 'CreditScore', 'EstimatedSalary', 'Exited', 'HasCrCard', 'IsActiveMember',\n", - " 'NumOfProducts', 'Tenure', 'Gender_0', 'Gender_1', 'Geography_0', 'Geography_1', 'Geography_2',\n", - " 'reference_txt_0', 'reference_txt_1']]" - ] - }, - { - "cell_type": "markdown", - "id": "1ec2e735-4c62-4f56-a773-0844c3c49144", - "metadata": {}, - "source": [ - "
\n", - "\n", - "

3. Train-Test Split" - ] - }, - { - "cell_type": "markdown", - "id": "118c32d1-8091-4506-99fe-61e0e3c96594", - "metadata": {}, - "source": [ - "

The TrainTestSplit() function divides the dataset into train and test subsets to be used for evaluating machine learning models and validation processes.
\n", - "80% is used for Training and 20% for validation.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9cbe56b7-5320-439e-a605-7b5d22a2d84c", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "TrainTestSplit_out = TrainTestSplit(data = new_table,\n", - " id_column='CustomerId',\n", - " train_size=0.80,\n", - " test_size=0.20,\n", - " seed=3432)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc2ee841-8b9d-40b0-88f4-7f08793a0ea5", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "TrainTestSplit_out.result.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec9c9d04-6f2f-488b-b317-05aac9a0cf57", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)\n", - "df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)\n", - "\n", - "print(\"Training Set = \" + str(df_train.shape[0]) + \". Testing Set = \" + str(df_test.shape[0]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "abf49441-fd83-4c63-adf7-d6d868f2e8bc", - "metadata": {}, - "outputs": [], - "source": [ - "df_test.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c3398193-36de-43ed-8af4-3369f4465ed1", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.copy_to_sql(df_train, table_name = 'clean_data_train', if_exists = 'replace')\n", - "tdml.copy_to_sql(df_test, table_name = 'clean_data_test', if_exists = 'replace')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e4ee0b66-e0f5-4856-908a-1126542b8021", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_train = tdml.DataFrame(in_schema('demo_user','clean_data_train'))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfbeea0e-0d41-4b2d-809c-89314ce711df", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "df_test = tdml.DataFrame(in_schema('demo_user','clean_data_test'))" - ] - }, - { - "cell_type": "markdown", - "id": "15bbeffa-8d4b-4132-8a62-2c410d64717f", - "metadata": {}, - "source": [ - "

3.1 Split Features and Target

\n", - "

Separates feature columns and target labels for both training and test datasets, keeping CustomerId for reference and including encoded categorical and semantic features.

\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e16fa9bf-61b2-4ec6-8722-0db6a42e4bb6", - "metadata": {}, - "outputs": [], - "source": [ - "df_train_features = df_train[['CustomerId', 'Age', 'Balance', 'CreditScore', 'EstimatedSalary', \n", - " 'HasCrCard', 'IsActiveMember', 'NumOfProducts','Tenure', \n", - " 'Gender_0', 'Gender_1', 'Geography_0', 'Geography_1', \n", - " 'Geography_2', 'reference_txt_0','reference_txt_1']]\n", - "\n", - "df_train_target = df_train[['CustomerId', 'Exited']]\n", - "df_test_features = df_test[['CustomerId', 'Age', 'Balance', 'CreditScore', 'EstimatedSalary', \n", - " 'HasCrCard', 'IsActiveMember', 'NumOfProducts','Tenure', \n", - " 'Gender_0', 'Gender_1', 'Geography_0', 'Geography_1', \n", - " 'Geography_2', 'reference_txt_0','reference_txt_1']]\n", - "\n", - "df_test_target = df_test[['CustomerId', 'Exited']]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "649cfba2-8e03-480f-ad70-d1aa09fdd4f6", - "metadata": {}, - "outputs": [], - "source": [ - "tdml.copy_to_sql(df_train_features, table_name = 'xgb_train_features', if_exists = 'replace', schema_name = 'demo_user')\n", - "tdml.copy_to_sql(df_train_target, table_name = 'xgb_train_target', if_exists = 'replace', schema_name = 'demo_user')\n", - "tdml.copy_to_sql(df_test_features, table_name = 'xgb_test_features', if_exists = 'replace', schema_name = 'demo_user')\n", - "tdml.copy_to_sql(df_test_target, table_name = 'xgb_test_target', if_exists = 'replace', schema_name = 'demo_user')" - ] - }, - { - "cell_type": "markdown", - "id": "68a9cb62-01e3-4632-8d96-04e7def2e544", - "metadata": {}, - "source": [ - "
\n", - "

4.Grant Access to ModelOps

\n", - "

Grants SELECT permissions on training, test, and clean data tables to the modelops role, allowing model deployment processes to access the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8fafff11-6f72-49ac-8416-9d85c165f29e", - "metadata": {}, - "outputs": [], - "source": [ - "SQL = ['''grant select on demo_user.xgb_train_features to modelops with grant option;''',\n", - " '''grant select on demo_user.xgb_train_target to modelops with grant option;''',\n", - " '''grant select on demo_user.xgb_test_features to modelops with grant option;''',\n", - " '''grant select on demo_user.xgb_test_target to modelops with grant option;''',\n", - " '''grant select on demo_user.clean_data_train to modelops with grant option;''',\n", - " '''grant select on demo_user.clean_data_test to modelops with grant option;''' \n", - " ]\n", - "\n", - "for i in SQL:\n", - " try:\n", - " tdml.execute_sql(i)\n", - " except:\n", - " True" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e308f588-f124-40b2-87a2-7fc41cd94228", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "train_pdf = df_train.to_pandas(all_rows=True)\n", - "\n", - "features = cols = ['Age', 'Balance', 'CreditScore', 'EstimatedSalary', 'HasCrCard', 'IsActiveMember', 'NumOfProducts',\n", - " 'Tenure', 'Gender_0', 'Gender_1', 'Geography_0', 'Geography_1', 'Geography_2', 'reference_txt_0',\n", - " 'reference_txt_1']\n", - "target = \"Exited\"\n", - "\n", - "# split data into X and y\n", - "X_train = train_pdf[features]\n", - "y_train = train_pdf[target]\n", - "\n", - "model = Pipeline([('xgb', XGBClassifier(n_estimators=5, max_depth=10))])\n", - "\n", - "model.fit(X_train, y_train)\n", - "#database = 'modelops'" - ] - }, - { - "cell_type": "markdown", - "id": "8098d66d-8528-4750-bf76-e3f1c14a7de9", - "metadata": {}, - "source": [ - "


\n", - "

5. Convert the model to PMML

\n", - "

You can use the sklearn2pmml or the nyoka python libraries to convert to pmml. The nyoka is a python only package and so it is preferable.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bcad683b-4839-4495-a2a6-631fbd2def81", - "metadata": {}, - "outputs": [], - "source": [ - "xgboost_to_pmml(\n", - " pipeline=model, \n", - " col_names=cols, \n", - " target_name='Exited', \n", - " pmml_f_name=\"xgb_model.pmml\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4729d988-1d06-403c-9503-64e482345593", - "metadata": {}, - "outputs": [], - "source": [ - "tdml.configure.byom_install_location = \"mldb\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "64accc06-6aa6-4ed7-a48a-ca3247780632", - "metadata": {}, - "outputs": [], - "source": [ - "try:\n", - " save_byom(\"xgb_model\",\n", - " \"xgb_model.pmml\",\n", - " \"byom_models\",\n", - " additional_columns={},\n", - " schema_name='modelops'\n", - " )\n", - "except:\n", - " delete_byom(model_id=\"xgb_model\", table_name=\"byom_models\", schema_name = 'modelops')\n", - " save_byom(\"xgb_model\",\n", - " \"xgb_model.pmml\",\n", - " \"byom_models\",\n", - " additional_columns={},\n", - " schema_name='modelops'\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "248aca94-4cfb-4709-a240-674c7dae4a8c", - "metadata": {}, - "outputs": [], - "source": [ - "list_byom(table_name=\"byom_models\", schema_name=\"modelops\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "50281181-d8fd-496d-a153-a94c5561f6ed", - "metadata": {}, - "outputs": [], - "source": [ - "result = PMMLPredict(\n", - " modeldata = retrieve_byom(\"xgb_model\", \"byom_models\", schema_name=\"modelops\"),\n", - " newdata = df_test,\n", - " accumulate = ['CustomerId'],\n", - " overwrite_cached_models = '*',\n", - ")\n", - "\n", - "print(result.show_query())\n", - "\n", - "result.result" - ] - }, - { - "cell_type": "markdown", - "id": "ad9b3e2d-9118-46ce-afbd-af8a0fe98e8c", - "metadata": {}, - "source": [ - "
\n", - "

6. Clean up

\n", - "

The following code will remove the context.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36bfae31-60de-4ff9-8a64-8e52c44efd14", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tdml.remove_context()" - ] - }, - { - "cell_type": "markdown", - "id": "3b29170c-d8c9-454f-b428-3da6472f3a8a", - "metadata": {}, - "source": [ - "
\n", - "Dataset:\n", - "\n", - "- `Unnamed`: Unnamed\n", - "- `CustomerId`: Customer ID\n", - "- `Surname`: Surname\n", - "- `CreditScore`: Credit score\n", - "- `Geography`: Country (Germany / France / Spain)\n", - "- `Gender`: Gender (Female / Male)\n", - "- `Age`: Age\n", - "- `Tenure`: No of years the customer has been associated with the bank\n", - "- `Balance`: Balance\n", - "- `NumOfProducts`: No of bank products used\n", - "- `HasCrCard`: Credit card status (0 = No, 1 = Yes)\n", - "- `IsActiveMember`: Active membership status (0 = No, 1 = Yes)\n", - "- `EstimatedSalary`: Estimated salary\n", - "- `Exited`: Abandoned or not? (0 = No, 1 = Yes)\n", - "\n", - "

Links:

\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "b87a659d-d9f3-433c-b197-6b826c2b5f1f", - "metadata": {}, - "source": [ - "" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/UseCases/Banking_Customer_Churn_IVSM/4.IVSM_Banking_Customer_Churn.ipynb b/UseCases/Banking_Customer_Churn_IVSM/Step3_Train_Churn_Model_with_Clustering.ipynb similarity index 95% rename from UseCases/Banking_Customer_Churn_IVSM/4.IVSM_Banking_Customer_Churn.ipynb rename to UseCases/Banking_Customer_Churn_IVSM/Step3_Train_Churn_Model_with_Clustering.ipynb index 5da82329..d80fbb2e 100644 --- a/UseCases/Banking_Customer_Churn_IVSM/4.IVSM_Banking_Customer_Churn.ipynb +++ b/UseCases/Banking_Customer_Churn_IVSM/Step3_Train_Churn_Model_with_Clustering.ipynb @@ -7,7 +7,7 @@ "source": [ "
\n", "

\n", - " IVSM Banking Customer Churn\n", + " Train Churn Model for Banking Customer data using Clustering and in-DB Functions\n", "
\n", " \"Teradata\"\n", "

\n", @@ -21,8 +21,8 @@ "source": [ "

Introduction

\n", "\n", - "
\n", - "\n", + "
\n", + "

Source: Medium

\n", "\n", "

Customer churn is a critical metric in banking because it can directly impact a bank's revenue and profitability. When customers leave, banks lose the income they would have earned from those customers' transactions, investments, and account fees. Additionally, attracting new customers to replace those who have left can be expensive and time-consuming, so reducing customer churn is often more cost-effective than acquiring new customers.

\n", "\n", @@ -30,7 +30,17 @@ "\n", "

Banks can use various strategies to reduce customer churns, such as improving customer service, offering more competitive rates and fees, providing personalized recommendations and offers, and enhancing digital channels and mobile apps. By tracking and analyzing customer churn rates, banks can identify areas for improvement and make strategic decisions to retain customers and improve overall customer satisfaction.

\n", "\n", - "

In this demo, we demonstrate how to implement the entire lifecycle of churn prediction can using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.

" + "

In this demo, we demonstrate how to implement the entire lifecycle of churn prediction can using Vantage technologies and, specifically, the combination of Bring Your Own Model (BYOM), Vantage Analytics Library (VAL) and teradataml python client library solution.

\n", + "\n", + "

Steps in the analysis:

\n", + "
    \n", + "
  1. Initiate a connection to Vantage
  2. \n", + "
  3. Run K-Means on the Embeddings Store and then build final table with Cluster ID assignments to rows
  4. \n", + "
  5. Data Transformation
  6. \n", + "
  7. Modelling
  8. \n", + "
  9. Evaluate the Model
  10. \n", + "
  11. Cleanup
  12. \n", + "
" ] }, { @@ -50,7 +60,7 @@ "metadata": {}, "source": [ "
\n", - "

Note: Please ensure that 3.IVSM_Banking_Customer_Churn_embed_BYOM is executed before running this file.

\n", + "

Note: Please ensure that Step1_Banking_Customer_Churn_Sentiment_Analysis is executed before running this file.

\n", "
" ] }, @@ -80,7 +90,8 @@ "from teradataml import create_context\n", "from teradataml import SVM, SVMPredict\n", "from teradataml import GridSearch, RandomSearch\n", - "from teradatasqlalchemy import BYTEINT" + "from teradatasqlalchemy import BYTEINT\n", + "display.max_rows = 5" ] }, { @@ -158,32 +169,6 @@ "%run -i ../run_procedure.py \"call space_report();\" # Takes 10 seconds" ] }, - { - "cell_type": "markdown", - "id": "860d53bf-9684-4534-84c8-77c52f17c9d7", - "metadata": {}, - "source": [ - "

1.1 Confirmation for functions\n", - "

Now we can confirm that the required functions are installed.

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bc34ce47-36e6-4014-b2b5-9c618a876caf", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, Markdown\n", - "\n", - "df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')\n", - "if df_check.get_values()[0][0] >= 10:\n", - " print('Functions are installed, please continue.')\n", - "else:\n", - " print('Functions are not installed, please go to Instalization notebook before proceeding further')\n", - " display(Markdown(\"[Initialization Notebook](./1.IVSM_Banking_Customer_Churn_Model_Install.ipynb)\"))" - ] - }, { "cell_type": "code", "execution_count": null, @@ -236,7 +221,7 @@ "source": [ "cols = list(df.columns)[2:]\n", "\n", - "KMeans_out = KMeans(id_column=\"id\",\n", + "KMeans_out = KMeans(id_column=\"CustomerId\",\n", " target_columns=cols,\n", " data=df,\n", " num_clusters=10,\n", @@ -291,7 +276,7 @@ }, "outputs": [], "source": [ - "merged_df = clusters.merge(df[['id','txt']], on='id', how='inner', lsuffix='_left', rsuffix='_right')" + "merged_df = clusters.merge(df[['CustomerId','Customer_Complaint']], on='CustomerId', how='inner', lsuffix='_left', rsuffix='_right')" ] }, { @@ -303,7 +288,7 @@ }, "outputs": [], "source": [ - "merged_df=merged_df.drop('id__left', axis=1)" + "merged_df=merged_df.drop('CustomerId__left', axis=1)" ] }, { @@ -337,8 +322,8 @@ }, "outputs": [], "source": [ - "new_df = customer_churn.merge(merged_df[['id__right','td_clusterid_kmeans']],\n", - " on='customerid = id__right',\n", + "new_df = customer_churn.merge(merged_df[['CustomerId__right','td_clusterid_kmeans']],\n", + " on='customerid = CustomerId__right',\n", " how='inner')\n", "new_df" ] @@ -352,7 +337,7 @@ }, "outputs": [], "source": [ - "new_df = new_df.drop('id__right',axis=1)" + "new_df = new_df.drop('CustomerId__right',axis=1)" ] }, { @@ -1057,7 +1042,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.11.14" } }, "nbformat": 4, diff --git a/UseCases/Banking_Customer_Churn_IVSM/commands.json b/UseCases/Banking_Customer_Churn_IVSM/commands.json deleted file mode 100644 index a8303d31..00000000 --- a/UseCases/Banking_Customer_Churn_IVSM/commands.json +++ /dev/null @@ -1,64 +0,0 @@ -{ - "queries": [ - { - "query": "CREATE DATABASE ivsm as perm=150000000*4;" - }, - { - "query": "DATABASE ivsm;" - }, - { - "query": "GRANT CREATE EXTERNAL PROCEDURE ON ivsm TO demo_user;" - }, - { - "query": "GRANT CREATE FUNCTION ON ivsm TO demo_user;" - }, - { - "query": "GRANT DROP FUNCTION ON ivsm TO demo_user;" - }, - { - "query": "CALL SQLJ.INSTALL_JAR('cj!./tokenizer-0.0.1-BETA.jar','TOKENIZER',0);" - }, - { - "query": "REPLACE FUNCTION ivsm.tokenizer_encode() RETURNS TABLE VARYING USING FUNCTION Encoder_contract SPECIFIC tokenizer_encode LANGUAGE JAVA NO SQL NO EXTERNAL DATA PARAMETER STYLE SQLTable NOT DETERMINISTIC CALLED ON NULL INPUT EXTERNAL NAME 'TOKENIZER:com.teradata.tokenizer.to.Encoder.execute';" - }, - { - "query": "REPLACE FUNCTION ivsm.tokenizer_decode() RETURNS TABLE VARYING USING FUNCTION Decoder_contract SPECIFIC tokenizer_decode LANGUAGE JAVA NO SQL NO EXTERNAL DATA PARAMETER STYLE SQLTable NOT DETERMINISTIC CALLED ON NULL INPUT EXTERNAL NAME 'TOKENIZER:com.teradata.tokenizer.to.Decoder.execute';" - }, - { - "query": "REPLACE FUNCTION ivsm.vector_to_columns() RETURNS TABLE VARYING USING FUNCTION VectorToColumns_contract SPECIFIC vector_to_columns LANGUAGE JAVA NO SQL NO EXTERNAL DATA PARAMETER STYLE SQLTable NOT DETERMINISTIC CALLED ON NULL INPUT EXTERNAL NAME 'TOKENIZER:com.teradata.tokenizer.to.VectorToColumns.execute';" - }, - { - "query": "DATABASE ivsm;" - }, - { - "query": "CALL SQLJ.INSTALL_JAR('CJ!./engine-8.0.0.jar','IVSM',0);" - }, - { - "query": "REPLACE FUNCTION ivsm.IVSM_score() RETURNS TABLE VARYING USING FUNCTION SMO_contract SPECIFIC IVSM_score LANGUAGE JAVA NO SQL NO EXTERNAL DATA PARAMETER STYLE SQLTable NOT DETERMINISTIC CALLED ON NULL INPUT EXTERNAL NAME 'IVSM:com.teradata.ivsm.engine.SMO.execute()';" - }, - { - "query": "CALL SQLJ.ServerControl('JAVA', 'disable', a);" - }, - { - "query": "CALL SQLJ.ServerControl('JAVA', 'shutdown', a);" - }, - { - "query": "CALL SQLJ.ServerControl('JAVA', 'enable', a);" - }, - { - "query": "CALL SQLJ.ServerControl('JAVA', 'status', a);" - }, - { - "query": "GRANT EXECUTE FUNCTION ON ivsm.tokenizer_encode TO demo_user;" - }, - { - "query": "GRANT EXECUTE FUNCTION ON ivsm.tokenizer_decode TO demo_user;" - }, - { - "query": "GRANT EXECUTE FUNCTION ON ivsm.vector_to_columns TO demo_user;" - }, - { - "query": "GRANT EXECUTE FUNCTION ON ivsm.IVSM_score TO demo_user;" - } - ] -} diff --git a/UseCases/Banking_Customer_Churn_IVSM/engine-8.0.0.jar b/UseCases/Banking_Customer_Churn_IVSM/engine-8.0.0.jar deleted file mode 100644 index 1b3a92ff..00000000 Binary files a/UseCases/Banking_Customer_Churn_IVSM/engine-8.0.0.jar and /dev/null differ diff --git a/UseCases/Banking_Customer_Churn_IVSM/images/sentiment-analysis-applications.jpeg b/UseCases/Banking_Customer_Churn_IVSM/images/sentiment-analysis-applications.jpeg new file mode 100644 index 00000000..d82ed050 Binary files /dev/null and b/UseCases/Banking_Customer_Churn_IVSM/images/sentiment-analysis-applications.jpeg differ diff --git a/UseCases/Banking_Customer_Churn_IVSM/tokenizer-0.0.1-BETA.jar b/UseCases/Banking_Customer_Churn_IVSM/tokenizer-0.0.1-BETA.jar deleted file mode 100644 index 1199e123..00000000 Binary files a/UseCases/Banking_Customer_Churn_IVSM/tokenizer-0.0.1-BETA.jar and /dev/null differ