diff --git a/UseCases/Graph_Analysis/Graph_Analysis_PY_SQL.ipynb b/UseCases/Graph_Analysis/Graph_Analysis_PY_SQL.ipynb index d9a59a58..245bc87b 100644 --- a/UseCases/Graph_Analysis/Graph_Analysis_PY_SQL.ipynb +++ b/UseCases/Graph_Analysis/Graph_Analysis_PY_SQL.ipynb @@ -19,21 +19,21 @@ "id": "05bc242a-7ee9-4b33-b837-ff52a9cd76d2", "metadata": {}, "source": [ - "

Introduction

\n", + "

Introduction

\n", "\n", - "

Call Detail Records (CDRs) contain valuable information about communication patterns and interactions between users. By leveraging community detection algorithms on CDR data, businesses can gain insights into the underlying network structure and uncover meaningful communities within their user base.

\n", + "

Call Detail Records (CDRs) contain valuable information about communication patterns and interactions between users. By leveraging community detection algorithms on CDR data, businesses can gain insights into the underlying network structure and uncover meaningful communities within their user base.

\n", "\n", - "

The objective of this analysis is to identify distinct communities or groups of users within the CDR network. Communities are like smaller social circles or friend groups within the larger group of friends. It helps us understand how people naturally form different clusters based on their interactions and relationships. This analysis also identifies influential people in the graph.\n", + "

The objective of this analysis is to identify distinct communities or groups of users within the CDR network. Communities are like smaller social circles or friend groups within the larger group of friends. It helps us understand how people naturally form different clusters based on their interactions and relationships. This analysis also identifies influential people in the graph.\n", "
\n", "
\n", "By grouping users into communities based on their calling patterns, the business can better understand the dynamics and relationships among users, leading to several potential applications and benefits like Customer Segmentation, Fraud Detection, Network Optimization, Cross-Selling and Upselling Opportunities, Customer Support and Retention, etc.\n", "

\n", "\n", - "

In this demo, we'll be using Script Table Operator(STO) to execute custom python scripts on Vantage. The STO operates by executing R and Python scripts from the command line of the Advanced SQL Engine underlying operating system, according to\n", + "

In this demo, we'll be using Script Table Operator(STO) to execute custom python scripts on Vantage. The STO operates by executing R and Python scripts from the command line of the Advanced SQL Engine underlying operating system, according to\n", "the following sequence:\n", "

\n", "\n", - "
    \n", + "
      \n", "
    1. The language script is installed on the Advanced SQL Engine of the target Teradata Vantage system via a call to\n", "an External Stored Procedure (XSP)
    2. \n", "
    3. The script is invoked by executing a SQL query that calls the STO
    4. \n", @@ -51,8 +51,8 @@ "id": "92c9fc4a-42bb-454a-8f7e-5b0983004501", "metadata": {}, "source": [ - "
      \n", - "

      Downloading and installing additional software needed" + "


      \n", + "

      Downloading and installing additional software needed" ] }, { @@ -64,8 +64,7 @@ "source": [ "%%capture\n", "# '%%capture' suppresses the display of installation steps of the following packages\n", - "!pip install python-louvain\n", - "!pip install mplcursors" + "!pip install python-louvain" ] }, { @@ -74,10 +73,10 @@ "metadata": {}, "source": [ "

      \n", - "

      Note: The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: 0 0

      \n", + "

      Note: The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: 0 0

      \n", "
      \n", "\n", - "

      Here, we import the required libraries, set environment variables and environment paths (if required).

      " + "

      Here, we import the required libraries, set environment variables and environment paths (if required).

      " ] }, { @@ -94,9 +93,7 @@ "import teradataml\n", "\n", "import community\n", - "import matplotlib.pyplot as plt\n", "import warnings\n", - "import mplcursors\n", "\n", "# Suppress warnings\n", "warnings.filterwarnings('ignore')" @@ -107,9 +104,9 @@ "id": "ced12adf-b2c4-49aa-b5c8-67dd090ff080", "metadata": {}, "source": [ - "
      \n", - "1. Connect to Vantage\n", - "

      You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.

      " + "
      \n", + "1. Connect to Vantage\n", + "

      You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.

      " ] }, { @@ -140,7 +137,7 @@ "id": "3732eef0-96bd-417b-95e3-403c1999afa8", "metadata": {}, "source": [ - "

      Begin running steps with Shift + Enter keys.

      " + "

      Begin running steps with Shift + Enter keys.

      " ] }, { @@ -148,8 +145,8 @@ "id": "1a96f1c6-cb43-440f-9682-a851fe09165f", "metadata": {}, "source": [ - "

      Getting Data for This Demo

      \n", - "

      We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.

      " + "

      Getting Data for This Demo

      \n", + "

      We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.

      " ] }, { @@ -168,7 +165,7 @@ "id": "658a3111-ab72-451c-85c2-de0a5f7a0188", "metadata": {}, "source": [ - "

      Next is an optional step – if you want to see the status of databases/tables created and space used.

      " + "

      Next is an optional step – if you want to see the status of databases/tables created and space used.

      " ] }, { @@ -186,9 +183,9 @@ "id": "e601e5e6-c6b3-466a-9a7b-f5ff16d2610a", "metadata": {}, "source": [ - "
      \n", - "2. Using Script Table Operator\n", - "

      Below is a sample of the data provided, where 'fromuserid' represents the caller and 'touserid' represents the callee.

      " + "
      \n", + "2. Using Script Table Operator\n", + "

      Below is a sample of the data provided, where 'fromuserid' represents the caller and 'touserid' represents the callee.

      " ] }, { @@ -206,10 +203,10 @@ "id": "97340221-0690-4a0f-814b-5a25c14a5c8a", "metadata": {}, "source": [ - "
      \n", - "

      Community Detection using Louvain Algorithm

      \n", - "

      The below cell will perform the following steps:

      \n", - "
        \n", + "
        \n", + "

        Community Detection using Louvain Algorithm

        \n", + "

        The below cell will perform the following steps:

        \n", + "
          \n", "
        1. Set SEARCHUIFDBPATH to demo_user
        2. \n", "
        3. Install the communities.py file on Vantage
        4. \n", "
        5. If the file is already installed, it will remove the file and install it again. This ensures we always have latest script in Vantage.
        6. \n", @@ -258,7 +255,7 @@ "id": "5916690f-42d6-40b9-a663-669cd0349c1c", "metadata": {}, "source": [ - "

          The below cell will run the script installed in the above step and store the result in the communities variable.

          " + "

          The below cell will run the script installed in the above step and store the result in the communities variable.

          " ] }, { @@ -279,7 +276,7 @@ "id": "f15a9780-7fb3-4994-9d77-0e4d7a6be492", "metadata": {}, "source": [ - "

          \n", + "

          \n", "We have a big group of customers, and they all like to talk to each other on the phone. When we talk about communities, we are interested in finding groups of users who are closely connected to each other and interact more frequently among themselves.\n", "
          \n", "
          \n", @@ -291,13 +288,13 @@ "id": "b95dfc16-8e0d-4729-80d7-58d582ab8a75", "metadata": {}, "source": [ - "


          \n", - "

          Eigenvector Centrality

          \n", - "

          Eigenvector Centrality is an algorithm that measures the transitive influence of nodes. Relationships originating from high-scoring nodes contribute more to the score of a node than connections from low-scoring nodes. A high eigenvector score means that a node is connected to many nodes who themselves have high scores.\n", + "


          \n", + "

          Eigenvector Centrality

          \n", + "

          Eigenvector Centrality is an algorithm that measures the transitive influence of nodes. Relationships originating from high-scoring nodes contribute more to the score of a node than connections from low-scoring nodes. A high eigenvector score means that a node is connected to many nodes who themselves have high scores.\n", "
          \n", "
          \n", "The below cell will perform the following steps:

          \n", - "
            \n", + "
              \n", "
            1. Set SEARCHUIFDBPATH to demo_user
            2. \n", "
            3. Install the centralities.py file on Vantage
            4. \n", "
            5. If the file is already installed, it will remove the file and install it again. This ensures we always have latest script in Vantage.
            6. \n", @@ -344,7 +341,7 @@ "id": "07ae94f0-8368-4a44-98d9-c8e26c45679c", "metadata": {}, "source": [ - "

              The below cell will run the script installed in the above step and store the result in the centralities variable.

              " + "

              The below cell will run the script installed in the above step and store the result in the centralities variable.

              " ] }, { @@ -363,7 +360,7 @@ "id": "b51bfbc5-66a4-4eda-b7f8-b5a7bd93abb0", "metadata": {}, "source": [ - "

              \n", + "

              \n", "We have a large group of customers, and they to talk to each other on the phone. Some of the customers are very popular and talk to lots of other customers, while others talk to only a few customers. Eigenvector centrality is a way to measure how important or popular each person is in this group based on the phone calls they make. So, in our graph with phone calls, eigenvector centrality helps us identify the people who are most connected to others and who have important connections. These people are considered more influential or popular in the group. This information can be used to efficiently target the the influential users and in turn the respective communities.

              " ] }, @@ -372,13 +369,13 @@ "id": "4e691b4f-0599-40cc-992d-2d12b02544a1", "metadata": {}, "source": [ - "
              \n", - "

              Betweenness Centrality

              \n", - "

              Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another. The algorithm calculates shortest paths between all pairs of nodes in a graph.\n", + "


              \n", + "

              Betweenness Centrality

              \n", + "

              Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another. The algorithm calculates shortest paths between all pairs of nodes in a graph.\n", "
              \n", "
              \n", "The below cell will perform the following steps:

              \n", - "
                \n", + "
                  \n", "
                1. Set SEARCHUIFDBPATH to demo_user
                2. \n", "
                3. Install the betweenness.py file on Vantage
                4. \n", "
                5. If the file is already installed, it will remove the file and install it again. This ensures we always have latest script in Vantage.
                6. \n", @@ -425,7 +422,7 @@ "id": "446805c5-c139-4e11-ab36-0df3cb5c5bb5", "metadata": {}, "source": [ - "

                  The below cell will run the script installed in the above step and store the result in the betweenness variable.

                  " + "

                  The below cell will run the script installed in the above step and store the result in the betweenness variable.

                  " ] }, { @@ -444,7 +441,7 @@ "id": "7ae8c526-04af-441f-aa8f-2a8180b49c60", "metadata": {}, "source": [ - "

                  \n", + "

                  \n", "We have a group of customers, and they all like to talk to each other on the phone. Betweenness centrality is a way to measure how important or influential you are in this group based on the phone calls made by everyone. So, if you have a lot of customers who rely on you to connect with each other, it means you have high betweenness centrality. You're like a central hub in the group, helping users communicate and making sure everyone stays connected.\n", "
                  \n", "
                  \n", @@ -456,13 +453,13 @@ "id": "9bf951e4-016b-4fbf-b82f-29e3aac86ca0", "metadata": {}, "source": [ - "


                  \n", - "

                  Closeness Centrality

                  \n", - "

                  Closeness centrality is a way of detecting nodes that are able to spread information very efficiently through a graph. The closeness centrality of a node measures its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes.\n", + "


                  \n", + "

                  Closeness Centrality

                  \n", + "

                  Closeness centrality is a way of detecting nodes that are able to spread information very efficiently through a graph. The closeness centrality of a node measures its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes.\n", "
                  \n", "
                  \n", "The below cell will perform the following steps:

                  \n", - "
                    \n", + "
                      \n", "
                    1. Set SEARCHUIFDBPATH to demo_user
                    2. \n", "
                    3. Install the closeness.py file on Vantage
                    4. \n", "
                    5. If the file is already installed, it will remove the file and install it again. This ensures we always have latest script in Vantage.
                    6. \n", @@ -509,7 +506,7 @@ "id": "56692b56-b859-4091-9152-c6b49420fd5d", "metadata": {}, "source": [ - "

                      The below cell will run the script installed in the above step and store the result in the closeness variable.

                      " + "

                      The below cell will run the script installed in the above step and store the result in the closeness variable.

                      " ] }, { @@ -528,7 +525,7 @@ "id": "418b16e6-6090-4169-ad45-066df91b56d3", "metadata": {}, "source": [ - "

                      \n", + "

                      \n", "We have a group of customers, and you all enjoy talking to each other on the phone. Closeness centrality is a way to measure how close or connected you are to all users in the group. When we talk about closeness centrality, we are interested in figuring out how quickly you can reach all the users when you make a phone call. If you can reach the users easily and quickly, then you have high closeness centrality.\n", "
                      \n", "
                      \n", @@ -541,91 +538,114 @@ "id": "2b5b1d8b-18b5-4236-b05e-83d265030882", "metadata": {}, "source": [ - "


                      \n", - "3. Visualization" + "
                      \n", + "3. Visualization" ] }, { "cell_type": "code", "execution_count": null, - "id": "a9b00e2a-6c38-4fb5-8424-b591dc0c5ec6", + "id": "e49df417-7d32-43ab-84ac-21d9acee4e50", "metadata": {}, "outputs": [], "source": [ - "%matplotlib widget" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f00c200-60d3-473e-9a5f-30613f946a62", - "metadata": {}, - "outputs": [], - "source": [ - "G = nx.from_pandas_edgelist(\n", - " DataFrame(in_schema('DEMO_GraphAnalysis', 'graph_data')).to_pandas().reset_index(),\n", - " source = 'fromuserid',\n", - " target = 'touserid',\n", - " create_using = nx.Graph()\n", - ")\n", + "import plotly.graph_objects as go\n", + "\n", + "# --- Load data from Teradata ---\n", + "graph_df = graph_data.to_pandas().reset_index()\n", "centrality = centralities.to_pandas().reset_index()\n", "cdf = communities.to_pandas().reset_index()\n", "\n", - "# Define the leader nodes\n", - "df_sorted = communities.merge(\n", - " centralities,\n", - " left_on = 'graph_node',\n", - " right_on = 'graph_node',\n", - " how = 'inner',\n", - " lsuffix = 'community',\n", - " rsuffix = 'centrality'\n", - ").to_pandas().sort_values('centrality', ascending = False)\n", - "\n", - "leader_nodes = df_sorted.groupby('community_id')['graph_node_community'].first().tolist()\n", - "\n", - "# Draw the graph with nodes colored based on communities\n", - "pos = nx.spring_layout(G)\n", - "cmap = plt.get_cmap('tab10') # Color map for communities\n", - "\n", - "plt.figure(figsize=(10, 6))\n", - "\n", - "# Create a dictionary to store community colors\n", - "community_colors = {}\n", - "\n", - "init_nodes = nx.draw_networkx_nodes(G, pos, node_size=200)\n", - "\n", - "for community_id in set(cdf.community_id):\n", - " nodes = list(cdf[cdf['community_id'] == community_id].graph_node)\n", - "\n", - " # Check if a node is a leader node\n", - " node_sizes = [800 if node in leader_nodes else 200 for node in nodes]\n", - " node_colors = [cmap(community_id) for node in nodes] \n", - " scatter = nx.draw_networkx_nodes(G, pos, nodelist = nodes, node_color = node_colors, node_size = node_sizes)\n", - "\n", - " # Store community color in the dictionary\n", - " community_colors[community_id] = scatter.get_facecolor()[0]\n", - "\n", - "nx.draw_networkx_edges(G, pos, width=0.5, alpha=0.5)\n", - "\n", - "# Create a custom legend with community colors\n", - "legend_labels = [f'Community {community_id}' for community_id in set(cdf.community_id)]\n", - "custom_legend = [plt.Line2D([], [], marker='o', markersize=10, color=community_colors[community_id], linestyle='', label=label) for community_id, label in zip(set(cdf.community_id), legend_labels)]\n", - "plt.legend(handles=custom_legend)\n", + "# --- Build NetworkX graph ---\n", + "G = nx.from_pandas_edgelist(\n", + " graph_df,\n", + " source='fromuserid',\n", + " target='touserid',\n", + " create_using=nx.Graph()\n", + ")\n", "\n", - "plt.title('Graph with Communities')\n", - "plt.axis('off')\n", + "# --- Build position layout ---\n", + "pos = nx.spring_layout(G, seed=42)\n", + "\n", + "# --- Create community and centrality maps for quick lookup ---\n", + "node_community_map = dict(zip(cdf.graph_node, cdf.community_id))\n", + "node_centrality_map = dict(zip(centrality.graph_node, centrality.centrality))\n", + "\n", + "# --- Build edge traces ---\n", + "edge_x, edge_y = [], []\n", + "for u, v in G.edges():\n", + " if u not in pos or v not in pos:\n", + " continue\n", + " x0, y0 = pos[u]\n", + " x1, y1 = pos[v]\n", + " edge_x += [x0, x1, None]\n", + " edge_y += [y0, y1, None]\n", + "\n", + "edge_trace = go.Scatter(\n", + " x=edge_x,\n", + " y=edge_y,\n", + " line=dict(width=0.5, color='#999'),\n", + " hoverinfo='none',\n", + " mode='lines'\n", + ")\n", "\n", - "# Add hover functionality to nodes\n", - "def update_annot(sel):\n", - " node_index = sel.target.index\n", - " node_name = list(G.nodes)[node_index]\n", - " text = 'Customer ID: ' + str(node_name) + '\\n EigenVector Centrality Score: ' + str(max(centrality[centrality['graph_node'] == node_name]['centrality']))\n", - " sel.annotation.set_text(text)\n", + "# --- Build node traces ---\n", + "node_x, node_y, hover_text, node_color, node_size = [], [], [], [], []\n", + "\n", + "for node in G.nodes():\n", + " if node not in pos:\n", + " continue\n", + " x, y = pos[node]\n", + " node_x.append(x)\n", + " node_y.append(y)\n", + "\n", + " comm = node_community_map.get(node, 'Unknown')\n", + " cent = node_centrality_map.get(node, 0)\n", + "\n", + " node_color.append(comm)\n", + " node_size.append(10) # All nodes same size, no leader emphasis\n", + " hover_text.append(\n", + " f\"Customer ID: {node}\"\n", + " f\"
                      Community ID: {comm}\"\n", + " f\"
                      Eigenvector Centrality: {cent:.4f}\"\n", + " )\n", + "\n", + "node_trace = go.Scatter(\n", + " x=node_x,\n", + " y=node_y,\n", + " mode='markers',\n", + " hoverinfo='text',\n", + " text=hover_text,\n", + " marker=dict(\n", + " size=node_size,\n", + " color=node_color,\n", + " colorscale='Turbo',\n", + " colorbar=dict(title='Community ID'),\n", + " line_width=1\n", + " )\n", + ")\n", "\n", - "cursor = mplcursors.cursor(init_nodes, hover=True)\n", - "cursor.connect('add', update_annot)\n", + "# --- Combine and show ---\n", + "fig = go.Figure(\n", + " data=[edge_trace, node_trace],\n", + " layout=go.Layout(\n", + " title='Graph with Communities ',\n", + " showlegend=False,\n", + " hovermode='closest',\n", + " margin=dict(b=20, l=5, r=5, t=40),\n", + " height=500, # Maintain taller plot\n", + " annotations=[\n", + " dict(\n", + " text='Interactive visualization of communities',\n", + " showarrow=False,\n", + " xref='paper', yref='paper',\n", + " x=0, y=-0.1\n", + " )\n", + " ]\n", + " )\n", + ")\n", "\n", - "plt.show()" + "fig.show()\n" ] }, { @@ -634,7 +654,7 @@ "metadata": {}, "source": [ "
                      \n", - "

                      Note: Please hover over the nodes to see additional information.

                      \n", + "

                      Note: Please hover over the nodes to see additional information.

                      \n", "
                      " ] }, @@ -643,11 +663,11 @@ "id": "5db12928-d747-4f04-bcd3-671092f1aab7", "metadata": {}, "source": [ - "

                      The above graph displays the data in graph format. On hovering on the node, you might see Customer ID and the EigenVector Centrality Score i.e., the influence score. The larger nodes are influential and are connected to other influential nodes. These are the leader nodes of the respective communities.\n", + "

                      The above graph displays the data in graph format. On hovering on the node, you might see Customer ID, the EigenVector Centrality Score i.e., the influence score and the Community ID. The larger nodes are influential and are connected to other influential nodes. These are the leader nodes of the respective communities.\n", "
                      \n", "
                      Targeting the leader of the communities in a telecom dataset can provide several benefits to a telecom company. Here are some ways it can help:

                      \n", "\n", - "
                        \n", + "
                          \n", "
                        1. Influencing the community: Leaders of communities often hold significant influence over their members. By targeting and engaging with these leaders, a telecom company can leverage their influence to promote their products or services within the community. This can lead to increased brand awareness, customer acquisition, and loyalty.
                        2. \n", " \n", "
                        3. Word-of-mouth marketing: Community leaders are typically respected and trusted individuals within their communities. When they endorse or recommend a telecom company's offerings, it can have a powerful impact on the community members' decisions. This word-of-mouth marketing can result in positive brand perception, organic referrals, and a higher likelihood of community members becoming customers.
                        4. \n", @@ -659,7 +679,7 @@ "
                        5. Customer retention and satisfaction: Targeting community leaders allows a telecom company to prioritize customer satisfaction and address any issues or concerns promptly. By providing personalized support and assistance to these influential individuals, the company can improve overall customer experience, potentially leading to higher customer retention rates within the community.
                        6. \n", "
                        \n", "\n", - "

                        In summary, targeting leaders of telecom communities enables a company to tap into their influence, leverage word-of-mouth marketing, gain valuable insights, foster partnerships, and enhance customer satisfaction. These efforts can result in increased brand visibility, customer acquisition, and long-term success for the telecom company.

                        \n" + "

                        In summary, targeting leaders of telecom communities enables a company to tap into their influence, leverage word-of-mouth marketing, gain valuable insights, foster partnerships, and enhance customer satisfaction. These efforts can result in increased brand visibility, customer acquisition, and long-term success for the telecom company.

                        \n" ] }, { @@ -667,8 +687,8 @@ "id": "4a811fed-0b58-4770-ae1f-aae0e6b9be23", "metadata": {}, "source": [ - "
                        \n", - "4. Cleanup" + "
                        \n", + "4. Cleanup" ] }, { @@ -676,8 +696,8 @@ "id": "23c6332a-22e7-422b-a121-0b68540ca5e8", "metadata": {}, "source": [ - "

                        Databases and Tables

                        \n", - "

                        The following code will clean up tables and databases created above.

                        " + "

                        Databases and Tables

                        \n", + "

                        The following code will clean up tables and databases created above.

                        " ] }, { @@ -732,7 +752,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.10" + "version": "3.11.14" } }, "nbformat": 4,