From 90e66ba6766a124fc5c90223d51b47c2cc84bcc3 Mon Sep 17 00:00:00 2001
From: Mohd-Hassan \n",
+ " Customer Reviews Analysis using Generative AI with Vantage\n",
+ " Introduction: Customer reviews analysis is a crucial aspect of understanding customer sentiment and preferences. By leveraging the power of OpenAIEmbeddings and Vantage InDB Analytic Function, we can gain valuable insights from customer reviews. Why Vantage? In our demo, we use the In this demo, we will build a customer review analysis system using TDApiClient InDB Analytic Function. Customer reviews analysis involves processing and analyzing customer feedback to identify patterns, sentiment, and trends. This analysis can help businesses improve their products, services, and overall customer experience. By integrating OpenAIEmbeddings and Vantage InDB Analytic Function, we can perform advanced natural language processing (NLP) and machine learning techniques to extract meaningful insights from customer reviews. The following diagram illustrates the architecture. Before going any farther, let's get a better understanding of Embeddings We believe that embeddings are the AI-native way to represent any kind of data, making them the perfect fit for us when working with all kinds of AI-powered tools and algorithms. We can represent text, images, and soon audio and video. We have many options for creating embeddings, whether locally using an installed library or by calling an API. Embedding models, like Word2Vec or GloVe, learn vector representations for words based on co-occurrence statistics. For instance, in Word2Vec, a word's vector is optimized to predict surrounding words in a context window. Each word's vector captures semantic relationships, with similar words having closer vectors. In essence, the model learns to represent words in a multi-dimensional space where similar words are close together. For example, \"king\" and \"queen\" might have similar vectors due to their contextual similarity in many sentences. Imagine we have a bunch of words, and we want to find a way to represent them in a way that captures their meaning. One way we can do this is by creating a word embedding. A word embedding is a vector of numbers that represents the meaning of a word. We choose the numbers in the vector so that words that are similar in meaning have similar vectors. For example, we might have vectors for words like \"cheese,\" \"butter,\" \"chocolate,\" and \"sauce\" that look like the following: In our vector, the numbers don't have any special meaning by themselves. They just represent the way that the word \"cheese\" is related to other words in our vocabulary. We can use word embeddings to find the similarity between words. For example, we can calculate the cosine similarity between the vector for \"cheese\" and the vector for \"butter\". The cosine similarity is a measure of how similar two vectors are, and it ranges from 0 to 1. A cosine similarity of 1 means that the two vectors are perfectly aligned, and a cosine similarity of 0 means that the two vectors are completely unrelated. We can also use word embeddings to find related words. For example, we can find all of the words that are similar in meaning to \"cheese\". This would include words like \"milk\", \"cream\", \"yogurt\", and \"feta\". We find word embeddings to be a powerful tool for natural language processing. We can utilize them for a variety of tasks, such as sentiment analysis, machine translation, and question answering. Above is a visual representation of how word embeddings work. Imagine we have a bunch of points in a high-dimensional space. Each point represents a word, and our position in space represents the meaning of the word. Words that are similar in meaning will be close together in space, and words that are different in meaning will be far apart. Now, imagine that we take a slice through our high-dimensional space. This slice will be a two-dimensional space, and the points in our two-dimensional space will represent our word embeddings. The distance between two points in our two-dimensional space will be a measure of the similarity between the two words. In this way, we can use word embeddings to represent the meaning of words in a way that is both compact and informative. Steps in the analysis: \n",
+ " The above statements will install the required libraries for us to run this demo. To gain access to the installed libraries after running this, we should restart the kernel. Note: We want to bring to your attention that the above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If we uncomment those installs, we ensure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: 0 0 1.1 Import the required libraries Here, we import the required libraries, set environment variables and environment paths (if required). 2.1 Connect to Vantage We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys. 2.2 Getting Data for This Demo We have provided data for this demo on cloud storage. We can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. We may switch which mode to choose by changing the comment string. Next is an optional step – if we want to see the status of databases/tables created and space used. The data for this demo comes from the reviews table of TPCx-AI. There are also a few other tables, such as orders, line-items, order_returns, products, etc. However, for this demo, we will only use the reviews table. The reviews table contains information about all of the customer reviews that are available in TPCx-AI. This includes the customer id, review, and spam. Each row is a snapshot of data taken from the review table, below are the list of columns in the review table: \n",
+ "
\n",
+ " \n",
+ "
TDApiClient Vantage function to generate reviews embeddings in parallel. This approach can significantly enhance the performance of generating the embeddings at scale.
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a833b5bf-74af-42be-8543-3782e1da95dc",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "1. Configuring the environment"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2f6027a7-888d-441f-abc7-a6ea1c45f0a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "# '%%capture' suppresses the display of installation steps of the following packages\n",
+ "\n",
+ "!pip install --upgrade -r requirements.txt --quiet"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca97cdce-0d5e-4e54-b404-da7bae24ef51",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "\n",
+ "
\n",
+ "2. Connect to Vantage"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c60e4d23-3f17-4d43-b9b9-7e125f59b8c7",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "3. Data Exploration\n",
+ "\n",
+ "\n",
+ "
\n",
+ "
The source data from TPCx-AI is loaded in Vantage with table named marketplace.
\n", + "\n", + "*Please click here scroll down to the end of the notebook for detailed column descriptions of the dataset.
" + ] + }, + { + "cell_type": "markdown", + "id": "eda8e565-b58b-4cc6-8d9c-50790ca3dcab", + "metadata": {}, + "source": [ + "3.1 Examine the Customer review table
\n", + "Let's look at the sample data in the customer review table.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a9951cf-e2b0-40d7-9a8f-a01c7f48ef09", + "metadata": {}, + "outputs": [], + "source": [ + "# tdf = DataFrame(\"customer_reviews\")\n", + "tdf = DataFrame(in_schema(\"DEMO_CustomerReviews\", \"customer_review\"))\n", + "print(\"Data information: \\n\", tdf.shape)\n", + "tdf.sort(\"id\")" + ] + }, + { + "cell_type": "markdown", + "id": "b0d6d954-b72f-41c3-9f8c-bc15cf74d672", + "metadata": {}, + "source": [ + "There are approx 1k sample records in all, and there are 3 variables. Reviews are listed from different customers. We shall analyse reviews for sentiments and major topics.
" + ] + }, + { + "cell_type": "markdown", + "id": "1584ba35-f053-43c4-a802-1e5683aff5c1", + "metadata": {}, + "source": [ + "3.2 Do we want to generate the embeddings?
\n", + "We have already generated embeddings for the customer review and stored them in files.
\n", + "\n", + "Note: If we would like to skip the embedding generation step and move on to the next section, please click here to skip.
\n", + "To save time, we can move to the already generated embeddings section. However, if we would like to see how we generate the embeddings, or if we need to generate the embeddings for a different dataset, then continue to the following section.
" + ] + }, + { + "cell_type": "markdown", + "id": "e5eb99e9-049f-4550-8f66-c9c755f04b54", + "metadata": {}, + "source": [ + "In this section, we are creating the OpenAI embeddings for 1000 customer reviews. It will cost us a few dollars on our OpenAI account.
\n", + "4.1 Get the OpenAI API key
" + ] + }, + { + "cell_type": "markdown", + "id": "c28ab4f5-28df-41eb-914c-49d5108f3ce9", + "metadata": {}, + "source": [ + "In order to utilize this demo, you will need an OpenAI API key. If you do not have one, please refer to the instructions provided in this guide to obtain your OpenAI API key:
\n", + "\n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcaa2d69-ffc8-4298-88df-afa04c035547", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "\n", + "# enter your openai api key\n", + "api_key = getpass.getpass(\"\\n Please Enter OpenAI API key: \")" + ] + }, + { + "cell_type": "markdown", + "id": "89367d27-22ca-4a0e-b299-0f8e1a669157", + "metadata": {}, + "source": [ + "4.2 Generate the embeddings for customer review via API_Request In-database Function
\n", + "\n", + "OpenAI and Azure OpenAI, provide multiple APIs for our hosted models. We introduce integration with the embedding API, which can be used in various types of applications: Classification, Search, Recommendations, and Anomaly detection. For more information on our Teradata API Integration, click here.
\n", + "\n", + "Under the hood, we will utilize the OpenAI embeddings method to generate the embeddings. OpenAI embeddings are a type of word embedding that we can use to represent review in a way that captures their semantic meaning. To generate embeddings for a customer review table, we will use the review field. We will employ the OpenAI Embeddings API to generate embeddings for each customer review. Please refer to the Embeddings documentation for more information about embeddings and types of models available.
\n", + "\n", + "The OpenAI Embeddings API takes a text string as input and returns a vector of numbers that represent the embedding. The length of the vector depends on the model that we are using. For example, the text-embedding-3-small model returns a vector of 1536 numbers.
\n", + "\n", + "In this demo, we will use text-embedding-3-small as the model and pass num_embeddings to 1536.
" + ] + }, + { + "cell_type": "markdown", + "id": "7bffd2a8-532e-40f1-bf39-66cd2d7cf21c", + "metadata": {}, + "source": [ + "To generate the embeddings, we will call the get_embeddings() function. This function will convert the Teradata DataFrame to a Pandas DataFrame and generate the embeddings. Once the embeddings are generated, we will store them in separate columns so that we can pass them to the K-Means function later on.
\n", + "\n", + "Please be patient: Generating embeddings for 1000 review may take from 60 to 100 seconds. It is depends on number of APMs in the database. Since the volume of data is large and the machine is small, going through the below code could take up to 2 minutes. If we prefer to skip this step and proceed to the next section, we can click here to skip.
\n", + "4.3 Display the customer review embeddings
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f493589f-75bc-463c-a8c7-e2b1f4231dd7", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Data information: \\n\", tdf_review_embeddings.shape)\n", + "\n", + "display.suppress_vantage_runtime_warnings = True\n", + "tdf_review_embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a1cd6525-f65e-44c4-b9b6-efe9a86cb533", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from teradataml import DataFrame as tdDF, get_connection\n", + " \n", + "# Assume is a teradataml DataFrame\n", + "# Collect tdf_review_embeddings locally as pandas\n", + "pdf = tdf_review_embeddings.to_pandas()\n", + " \n", + "# Convert embeddings from BLOB to float32 arrays\n", + "embeddings_array = np.array([np.frombuffer(b, dtype=np.float32) for b in pdf['embedding']])\n", + " \n", + "# Create a final DataFrame with embeddings split into columns\n", + "df_final = pd.concat([pdf[['id', 'review', 'spam']], pd.DataFrame(embeddings_array)], axis=1)\n", + "\n", + "# rename embedding columns\n", + "df_final.columns = (\n", + " ['id', 'review', 'spam']\n", + " + [f'emb_{i}' for i in range(embeddings_array.shape[1])]\n", + ")\n", + "\n", + "# Convert pandas DataFrame to teradataml DataFrame\n", + "tdf_review_embeddings_float = DataFrame(df_final, index=False)\n", + " \n", + "tdf_review_embeddings_float.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "1932e4e0-d66d-40ae-815d-c5213e67065d", + "metadata": {}, + "source": [ + "We can see that generated embeddings for all of the review are in vector of 1536 columns.
\n", + "\n", + "For example: The generated embeddings for review name: Nike SALE Nike UltraBoost These are women's men's\t consists of 1536 numbers and looks like:
\n",
+ "-0.0197175870\t0.005220166\t-0.017851508\t0.00217428\t-0.00335274\t-0.0050890
Now, we have generated the embeddings from the review names and saved the review embeddings dataframe into a vantage table named review_embeddings to use it further.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca1d072f-b855-4ab4-b755-03362692d117", + "metadata": {}, + "outputs": [], + "source": [ + "delete_and_copy_embeddings(\n", + " table_name=\"review_embeddings\", tdf=tdf_review_embeddings_float, eng=eng\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "63c2239a-53df-4e50-91c4-5a6008696da8", + "metadata": {}, + "source": [ + "Note: If you're generating embeddings for a new document and plan to store it as a file, consider uncommenting the code below. Doing so will significantly speed up the process in future runs by skipping section 4 altogether.
\n", + "5.1 Load the reviews embeddings
\n", + "\n", + "In this demo, we will load existing embeddings from files to a database. This will allow us to perform further processing on the embeddings.
\n", + "\n", + "Note: If you have already executed the Generate the embeddings section, then below code will be skipped automatically.
\n", + "The code above first reads the data from the files. The files contain information about the review embeddings. The code then loads the data into a permanent table in SQL. Once the data is loaded, we will use the Vantage InDB Analytic Function KMeans to clusters the review embeddings. The data contains review embeddings, which are lists of numerical values, or vectors.
The embeddings file contains over 1000 records, each with 1536 numerical features. This means that the file is quite large and it may take some time to load it into SQL.
\n", + "Note: Please be patient. The code above is loading data from files and copying it to SQL. This process may take 30-50 seconds.
\n", + "5.2 Display the review embeddings
\n", + "To give you a better idea of what the embeddings look like, here are the first five rows of the review embeddings:
\"\"\"\n", + "\n", + "\n", + "def get_section5_desc_sample():\n", + " return \"\"\"We can see that generated embeddings for all of the review are in vector of 1536 columns.
\n", + "For example: The generated embeddings for review name: Nike running shoes for man consists of 1536 numbers and looks like:
\n",
+ " -0.038744\t-0.016937\t-0.017475\t0.003624\t0.00744\t-0.00275\t0.02374
Section 4: Generate the embeddings is already executed! So, skipping the execution of above code.
The code below will not run if Section 5 has already been skipped.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3f0972f-6796-48a1-ac6e-0bdedd0fb991", + "metadata": {}, + "outputs": [], + "source": [ + "display(Markdown(get_section5_desc_sample())) if flag else None" + ] + }, + { + "cell_type": "markdown", + "id": "6d79ecde-2796-4745-baeb-586af0d60847", + "metadata": {}, + "source": [ + "The k-means algorithm groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
\n", + "6.1 Filter columns from the embeddings
\n", + "In the steps above we took a sample of the dataset, The sample consisted of 1000 sample reviews from e-commerce customers. we need to find clusters in these reviews. In order to find K-Means clusters we just need the embeddings information so we discard remaining columns from the dataframe
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d53e172e-aa59-482b-bac9-6d55b91fc57a", + "metadata": {}, + "outputs": [], + "source": [ + "review_embeddings = sample_embeddings if flag else tdf_review_embeddings_float\n", + "embedding_column_list = review_embeddings.drop(columns=[\"id\", \"review\", \"spam\"])\n", + "embedding_column_names = review_embeddings.columns[3:]" + ] + }, + { + "cell_type": "markdown", + "id": "c4d35195-c6da-47dd-a0eb-eb6421ae9cd8", + "metadata": { + "tags": [] + }, + "source": [ + "The next question we face is How many clusters?. To find this number we use a technique called Elbow method. The Elbow Method is a technique used in data science to help determine the optimal number of clusters in a dataset. In the code snippet below we try cluster values ranging from 1 to 40 and record the distortion value. The visualizer shows us where the elbow lies in the graph.
\n", + "\n", + "Vantage's ClearScape Analytics can easily integrate with 3rd party visualization python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.
" + ] + }, + { + "cell_type": "markdown", + "id": "67854451-5d13-4f09-8be6-c450719f68f0", + "metadata": {}, + "source": [ + "6.2 Find the optimal number of clusters
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a9d7b8e-718b-40c3-b11d-f4854700cb47", + "metadata": {}, + "outputs": [], + "source": [ + "from yellowbrick.cluster import KElbowVisualizer\n", + "from sklearn.cluster import KMeans\n", + "\n", + "model = KMeans()\n", + "visualizer = KElbowVisualizer(model, k=(1, 40))" + ] + }, + { + "cell_type": "markdown", + "id": "28a4161d-4ca1-483a-8255-6414665c1d2f", + "metadata": {}, + "source": [ + "Note: Please be patient. We are currently performing some mathematical calculations. This process may take 3-5 minutes.
\n", + "We observe that elbow lies at {no_clusters}, so thats the optimum number of clusters is {no_clusters}. With that information established we now begin the process of clustering
\"\"\"\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9cb99632-b4a8-4e1b-b2b7-d5459edbfeb5", + "metadata": {}, + "source": [ + "6.3 Apply K-Means Clusters using Teradata Vantage InDB Analytic Function
\n", + "\n", + "The function performs fast K-Means clustering algorithm and returns cluster means\n", + " and averages. we also need to find which clusters each of the 1000 customer reviews we took in the sample belongs to.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec199b25-0eea-4bdd-9db5-27b5bba1ce81", + "metadata": {}, + "outputs": [], + "source": [ + "from teradataml import KMeans\n", + "\n", + "kmeans_model = KMeans(\n", + " id_column=\"id\",\n", + " data=review_embeddings,\n", + " target_columns=embedding_column_names,\n", + " num_clusters=int(no_clusters),\n", + " output_cluster_assignment=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e493c069-bc12-45e1-ae6a-76fb3533ca5f", + "metadata": {}, + "source": [ + "To view the number of reviews per cluster. Verify cluster information from the KMeans Model, It shows clusters and number of entries in each cluster.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a33a8e3-d0e6-472e-bc95-50d3d8fe8416", + "metadata": {}, + "outputs": [], + "source": [ + "selected_result = kmeans_model.result\n", + "selected_model_n = selected_result[selected_result.td_clusterid_kmeans.ge(0)]\n", + "selected_model_n.groupby(\"td_clusterid_kmeans\").count()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47ab67f8-39cc-4347-abbb-c1926636fed2", + "metadata": {}, + "outputs": [], + "source": [ + "df_final = selected_model_n.join(\n", + " other=review_embeddings, on=[\"id\"], how=\"inner\", lsuffix=\"t1\", rsuffix=\"t2\"\n", + ")\n", + "df_final = df_final.drop([\"id_t2\"], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1ccbb05-0b37-4495-b3d6-991ec85d73de", + "metadata": {}, + "outputs": [], + "source": [ + "df_final" + ] + }, + { + "cell_type": "markdown", + "id": "c9592e79-b280-4567-97b3-c9db4f4ed34d", + "metadata": {}, + "source": [ + "Now, let's copy clustered data to SQL.
\n", + "\n", + "Note: Please be patient. The following code optimizes performance by temporarily transferring data between a dataframe and SQL. This process may take 30-50 seconds.
\n", + "The code below facilitates the visualization of all similar customer reviews in a single cluster, effectively grouping similar reviews together.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cbe2ce52-fa73-48ad-af21-8d827e337d3b", + "metadata": {}, + "outputs": [], + "source": [ + "df_cluster1 = df_review_clustered.loc[df_review_clustered.td_clusterid_kmeans == 1][\n", + " [\"id_t1\", \"td_clusterid_kmeans\", \"review\"]\n", + "]\n", + "\n", + "\n", + "df_cluster1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b15db606-eff1-4120-9f8a-f3adc31b7487", + "metadata": {}, + "outputs": [], + "source": [ + "df_cluster1.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0ae0aae-5d04-4a6c-9fd5-f34429fb4617", + "metadata": {}, + "outputs": [], + "source": [ + "unique_clusters = df_review_clustered.select(['td_clusterid_kmeans'])\n", + "unique_clusters.assign(drop_columns=True, distinct_cluster=df_cluster1.td_clusterid_kmeans.distinct())\n", + "print(unique_clusters)" + ] + }, + { + "cell_type": "markdown", + "id": "a58e8b53-32fa-486e-b25b-81eb99083216", + "metadata": {}, + "source": [ + "7. Visualization of Clusters with Customer reviews
\n", + "\n", + "The graph displays the clustering of reviews into distinct groups. Based on the analysis, the data has been divided into n optimal clusters, each representing a unique pattern or category of complaints. This clustering approach helps to identify the key areas or types of reviews that are most prevalent, allowing for more targeted investigation and resolution efforts.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5fe1e12-078a-4aaf-92aa-c50855bca57c", + "metadata": {}, + "outputs": [], + "source": [ + "clus = df_review_clustered.to_pandas().reset_index()\n", + "\n", + "from sklearn.manifold import TSNE\n", + "\n", + "\n", + "tsne = TSNE(n_components=2, random_state=42)\n", + "tsne_result = tsne.fit_transform(clus.iloc[:, 20:-1])\n", + "\n", + "tsne_df = pd.DataFrame(tsne_result, columns=[\"tsne_1\", \"tsne_2\"])\n", + "tsne_df[\"cluster_id\"] = clus[\"td_clusterid_kmeans\"]\n", + "tsne_df[\"id\"] = clus[\"review\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23f81bec-a175-4a00-b2e9-7cd0e78c1d8f", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import plotly.express as px\n", + "\n", + "# Assuming you have already computed tsne_df as per the previous example\n", + "\n", + "# Create a new DataFrame combining t-SNE results with complaint information\n", + "tsne_complaint_df = pd.DataFrame(tsne_result, columns=[\"tsne_1\", \"tsne_2\"])\n", + "tsne_complaint_df[\"cluster_id\"] = clus[\"td_clusterid_kmeans\"]\n", + "tsne_complaint_df[\"id\"] = clus[\"id_t1\"]\n", + "tsne_complaint_df[\"review\"] = clus[\"review\"]\n", + "\n", + "# Truncate text for hover data\n", + "max_chars = 50 # Maximum characters to display\n", + "tsne_complaint_df[\"truncted_reviews\"] = clus[\"review\"].apply(\n", + " lambda x: x[:max_chars] + \"...\" if len(x) > max_chars else x\n", + ")\n", + "\n", + "# Plot using Plotly Express\n", + "fig = px.scatter(\n", + " tsne_complaint_df,\n", + " x=\"tsne_1\",\n", + " y=\"tsne_2\",\n", + " color=\"cluster_id\",\n", + " hover_data=[\"id\", \"truncted_reviews\", \"cluster_id\"],\n", + ")\n", + "\n", + "fig.update_traces(marker=dict(size=15))\n", + "fig.update_layout(\n", + " title=\"t-SNE Visualization of Clusters with Review\",\n", + " xaxis_title=\"dimension-1\",\n", + " yaxis_title=\"dimension-2\",\n", + " xaxis=dict(tickangle=45),\n", + " width=1000,\n", + " height=800,\n", + " hoverlabel=dict(bgcolor=\"white\", font_size=16, font_family=\"Rockwell\"),\n", + " autosize=False,\n", + ")\n", + "\n", + "# Customize the hovertemplate\n", + "fig.update_traces(\n", + " hovertemplate=\"ID: %{customdata[0]}8. Sentiment Analysis
\n", + "\n", + "Sentiment analysis using Generative AI involves leveraging cutting-edge technologies to extract insights from unstructured data. This process empowers businesses to swiftly identify and address customer concerns, enhancing overall customer satisfaction and loyalty.
\n", + "\n", + "In this demo we'll use AWS Bedrock LLM model: us.anthropic.claude-3-7-sonnet-20250219-v1:0 which is very robust in text generation task
us.anthropic.claude-3-7-sonnet-20250219-v1:0We will now verify the configuration and the provided credentials by performing a test call.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9746b445-3db0-4d64-87a7-7ac729e4bbbb", + "metadata": {}, + "outputs": [], + "source": [ + "from botocore.exceptions import ClientError\n", + "\n", + "try:\n", + " response = client.converse(\n", + " modelId=\"us.anthropic.claude-3-7-sonnet-20250219-v1:0\",\n", + " messages=[{\n", + " \"role\": \"user\",\n", + " \"content\": [{\"text\": \"Hello\"}]\n", + " }],\n", + " inferenceConfig={\"maxTokens\": 10}\n", + " )\n", + " print(\"Test call successful! Response:\")\n", + " print(response)\n", + "except ClientError as e:\n", + " print(f\"Error testing client: {e}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f17b302f-5666-4fd8-bc5e-e8d94375ca6d", + "metadata": {}, + "outputs": [], + "source": [ + "def get_conversation(txt):\n", + " # Start a conversation with the user message.\n", + " user_message = (\n", + " f\"Analyze the sentiment of the following review text.\\n\\n\"\n", + " f\"Review: {txt}\\n\\n\"\n", + " \"Respond with exactly one word — either 'Positive', 'Negative', or 'Neutral'. \"\n", + " \"Do not include any explanation or punctuation.\"\n", + " )\n", + "\n", + " return [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [{\"text\": user_message}],\n", + " }\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9d33bbd-9de9-4e02-a57a-a19186f2cbca", + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm\n", + "from botocore.exceptions import ClientError\n", + "\n", + "def sentiment_analysis(df, text_col=\"review\"):\n", + " df = df.copy()\n", + " df[\"Sentiment\"] = None\n", + "\n", + " for i in tqdm(range(len(df))):\n", + " try:\n", + " review_text = df.loc[i, text_col]\n", + " if not isinstance(review_text, str) or not review_text.strip():\n", + " df.loc[i, \"Sentiment\"] = None\n", + " continue\n", + "\n", + " response = client.converse(\n", + " modelId=\"us.anthropic.claude-3-7-sonnet-20250219-v1:0\",\n", + " messages=get_conversation(review_text),\n", + " inferenceConfig={\"maxTokens\": 10, \"temperature\": 0.2, \"topP\": 0.9},\n", + " )\n", + "\n", + " content = response.get(\"output\", {}).get(\"message\", {}).get(\"content\", [])\n", + " if content and \"text\" in content[0]:\n", + " sentiment = content[0][\"text\"].strip()\n", + " df.loc[i, \"Sentiment\"] = sentiment\n", + " else:\n", + " df.loc[i, \"Sentiment\"] = None\n", + "\n", + " except ClientError as e:\n", + " print(f\"ClientError at row {i}: {e}\")\n", + " df.loc[i, \"Sentiment\"] = None\n", + " except Exception as e:\n", + " print(f\"Error at row {i}: {e}\")\n", + " df.loc[i, \"Sentiment\"] = None\n", + "\n", + " return df\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a334799c-b39e-4882-b2a4-a6d931828e4c", + "metadata": {}, + "outputs": [], + "source": [ + "pdf_cluster1 = df_cluster1.to_pandas().reset_index()\n", + "pdf_cluster1[\"Sentiment\"] = None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d26bc46e-54fc-4eb4-982b-2a19077acee5", + "metadata": {}, + "outputs": [], + "source": [ + "response_df = sentiment_analysis(pdf_cluster1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41c46db9-b814-49af-bb41-3e791e5d97db", + "metadata": {}, + "outputs": [], + "source": [ + "response_df.dropna()[[\"id_t1\", \"review\", \"Sentiment\"]].head(30)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d575b0a-dfbb-4445-b9b7-e23ad1a84415", + "metadata": {}, + "outputs": [], + "source": [ + "response_df.shape" + ] + }, + { + "cell_type": "markdown", + "id": "963a6f90-6c95-440c-b45e-6759b9621b78", + "metadata": {}, + "source": [ + "This bar graph visualizes the distribution of sentiment predictions from a dataset, showing the total count for each sentiment category (Positive, Negative, Neutral). Created using Plotly Express, the graph includes text labels on each bar for clarity, highlighting the number of occurrences for each sentiment. This visualization provides a concise overview of sentiment trends in the dataset.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bd823dc-4617-4919-adaf-5ef0fdad0c6e", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import Counter\n", + "\n", + "data = Counter(response_df[\"Sentiment\"])\n", + "\n", + "# Convert Counter data to DataFrame\n", + "viz_df = pd.DataFrame.from_dict(data, orient=\"index\", columns=[\"Count\"]).reset_index()\n", + "\n", + "\n", + "# Rename columns\n", + "viz_df.columns = [\"Prediction\", \"Count\"]\n", + "\n", + "\n", + "# Create bar graph using Plotly Express\n", + "fig = px.bar(\n", + " viz_df,\n", + " x=\"Prediction\",\n", + " y=\"Count\",\n", + " color=\"Prediction\",\n", + " labels={\"Count\": \"Number of Occurrences\", \"Prediction\": \"Prediction\"},\n", + " text=\"Count\",\n", + ")\n", + "\n", + "# Update layout to show text labels on bars\n", + "fig.update_traces(texttemplate=\"%{text}\", textposition=\"inside\")\n", + "\n", + "# Show the plot\n", + "fig.show(renderer=\"notebook\")" + ] + }, + { + "cell_type": "markdown", + "id": "a20dc2ef-43dd-4593-bf00-fa154e7d09ca", + "metadata": {}, + "source": [ + "Unlock the power of customer feedback with intuitive word cloud visualization, which provides a comprehensive snapshot of negative reviews sentiment. This innovative tool highlights the most frequently occurring words and pain points in customer feedback, empowering businesses to:
\n", + "\n", + "By leveraging this word cloud, businesses can proactively address customer concerns, refine their products and services, and ultimately drive growth through a deeper understanding of their customers' needs and preferences.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eae633e7-e59f-423e-95ee-d288924ed106", + "metadata": {}, + "outputs": [], + "source": [ + "from wordcloud import WordCloud\n", + "import matplotlib.pyplot as plt\n", + "\n", + "\n", + "text = response_df[response_df[\"Sentiment\"] == \"Negative\"]\n", + "combine_text = \" \".join(text[\"review\"])\n", + "\n", + "# Replace 'X' with blank space\n", + "modified_string = combine_text.replace(\"X\", \"\").replace(\"Discover\", \"\")\n", + "\n", + "wordcloud = WordCloud(collocations=False, background_color=\"white\").generate(\n", + " modified_string\n", + ")\n", + "\n", + "# Display the word cloud\n", + "plt.imshow(wordcloud, interpolation=\"bilinear\")\n", + "plt.axis(\"off\")\n", + "fig.show(renderer=\"notebook\")" + ] + }, + { + "cell_type": "markdown", + "id": "e8e61cf4-be63-40fe-8f72-92c242571098", + "metadata": {}, + "source": [ + "9. Topic Modelling
\n", + "\n", + "Topic modeling in the context of categorizing reviews involves assigning each review to one of three categories: functionality, comparison, and spam. This classification helps in organizing the reviews based on their content and relevance, allowing for better analysis and understanding. By counting the frequency of reviews in each category, businesses can gain insights into the primary concerns of their customers and identify potential spam.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c7d203d-d450-468e-852c-356cd02a7fad", + "metadata": {}, + "outputs": [], + "source": [ + "def get_conversation(txt):\n", + " # Start a conversation with the user message.\n", + " user_message = (\n", + " f\"Analyze the following customer review text:\\n\\n\"\n", + " f\"Review: {txt}\\n\\n\"\n", + " \"Instruction: Respond with **exactly one word** — either 'Comparisons', 'Functionality', or 'Spam'. \"\n", + " \"Do NOT include any explanation, reasoning, or punctuation.\"\n", + " )\n", + "\n", + " return [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [{\"text\": user_message}],\n", + " }\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae69bb83-136b-40a2-811d-73e7d422d8f7", + "metadata": {}, + "outputs": [], + "source": [ + "def topic_modelling(df):\n", + " for i in tqdm(range(len(df))):\n", + "\n", + " try:\n", + " # Send the message to the model, using a basic inference configuration.\n", + " response = client.converse(\n", + " modelId=model_id,\n", + " messages=get_conversation(df.iloc[i, 1]),\n", + " inferenceConfig={\"maxTokens\": 512, \"temperature\": 0.2, \"topP\": 0.9},\n", + " )\n", + "\n", + " # Extract and print the response text.\n", + " response_text = response[\"output\"][\"message\"][\"content\"][0][\"text\"].strip()\n", + " topic = response_text.split()[0] # just the first word\n", + " df.loc[i, \"Predicted_Topic\"] = topic\n", + "\n", + "\n", + " except (ClientError, Exception) as e:\n", + " exit(1)\n", + "\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94874efc-6ffc-45ca-b509-8640005e5f73", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# pdf_cluster1[\"Predicted_Topic\"] = None\n", + "\n", + "# response_df = topic_modelling(pdf_cluster1)\n", + "response_df[\"Predicted_Topic\"] = None\n", + "response_df = topic_modelling(response_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62bb868b-ba87-4276-9277-ec23a9dd863a", + "metadata": {}, + "outputs": [], + "source": [ + "response_df = response_df.dropna()\n", + "response_df[\"Predicted_Topic\"] = (\n", + " response_df[\"Predicted_Topic\"].str.strip().str.replace(r\"^,+|,+$\", \"\", regex=True)\n", + ")\n", + "response_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "9b5ef509-514f-41cc-8987-2e61d4c3f477", + "metadata": {}, + "source": [ + "A graph illustrating the relationship between review sentiments (positive, negative, neutral) prediction and the number of occurrences. This visual representation helps identify trends, patterns, and areas for improvement, enabling data-driven decision making.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2616514a-1b54-4cfd-8d7d-628819d95bb3", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import Counter\n", + "\n", + "data = Counter(response_df[\"Predicted_Topic\"])\n", + "\n", + "# Convert Counter data to DataFrame\n", + "viz_df = pd.DataFrame.from_dict(data, orient=\"index\", columns=[\"Count\"]).reset_index()\n", + "\n", + "# Rename columns\n", + "viz_df.columns = [\"Prediction\", \"Count\"]\n", + "\n", + "# Create bar graph using Plotly Express\n", + "fig = px.bar(\n", + " viz_df,\n", + " x=\"Prediction\",\n", + " y=\"Count\",\n", + " color=\"Prediction\",\n", + " labels={\"Count\": \"Number of Occurrences\", \"Prediction\": \"Prediction\"},\n", + " text=\"Count\",\n", + ")\n", + "\n", + "\n", + "# Update layout to show text labels on bars\n", + "fig.update_traces(texttemplate=\"%{text}\", textposition=\"inside\")\n", + "\n", + "# Show the plot\n", + "fig.show(renderer=\"notebook\")" + ] + }, + { + "cell_type": "markdown", + "id": "50b8b817-c0ff-42b3-a651-c8268ac40942", + "metadata": {}, + "source": [ + "10.1 Work Tables
\n", + "Cleanup work tables to prevent errors next time.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99ccbcfb-8ae5-4a13-80af-262195c6a0fd", + "metadata": {}, + "outputs": [], + "source": [ + "# Loop through the list of tables and execute the drop table command for each table\n", + "for table in db_list_tables()['TableName'].tolist():\n", + " try:\n", + " db_drop_table(table_name=table, schema_name=\"demo_user\")\n", + " except:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "0aacb87c-0d81-40f0-8754-608d47e602cc", + "metadata": {}, + "source": [ + "10.2 Databases and Tables
\n", + "The following code will clean up tables and databases created above.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b398148-9486-4535-8c06-d3f77669bbdc", + "metadata": {}, + "outputs": [], + "source": [ + "%run -i ../run_procedure.py \"call remove_data('DEMO_CustomerReviews');\" # Takes 5 seconds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73ef4cbe-1beb-4c65-ac60-ffd6250668a4", + "metadata": {}, + "outputs": [], + "source": [ + "remove_context()" + ] + }, + { + "cell_type": "markdown", + "id": "ae971f90-bef1-4228-bcbf-c333e9843284", + "metadata": {}, + "source": [ + "\n", + " \n", + "Let’s look at the elements we have available for reference for this notebook:
\n", + "\n", + "Filters:
\n", + "Related Resources:
\n", + "\n", + "Links:
\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "e5463848-592f-4321-b852-287e133872dd", + "metadata": {}, + "source": [ + "" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}