Welcome to Stambia MDM Pulse.
This guide explains how to use Stambia MDM Pulse Profiling to perform source data profiling and analyze data when starting your Master Data Management initiative.

Preface

Audience

This document is intended for Data Architects, Data Stewards and Business Users interested in using Stambia MDM Pulse Profiling to build insight on their data.

If you want to learn about MDM or discover Stambia MDM, you can watch our tutorials.
The Stambia MDM Documentation Library, including the development, administration and installation guides is available online.

Document Conventions

This document uses the following formatting conventions:

Convention Meaning

boldface

Boldface type indicates graphical user interface elements associated with an action, or a product specific term or concept.

italic

Italic type indicates special emphasis or placeholder variable that you need to provide.

monospace

Monospace type indicates code example, text or commands that you enter.

Other Stambia Resources

In addition to the product manuals, Stambia provides other resources available on its web site: http://www.stambia.com.

Obtaining Help

There are many ways to access the Stambia Technical Support. You can call or email our global Technical Support Center ([email protected]). For more information, see http://www.stambia.com.

Feedback

We welcome your comments and suggestions on the quality and usefulness of this documentation.
If you find any error or have any suggestion for improvement, please mail [email protected] and indicate the title of the documentation along with the chapter, section, and page number, if available. Please let us know if you want a reply.

Overview

Using this guide, you will learn:

  • The architecture of Pulse and its various components.

  • To install and configure Pulse.

  • To execute and schedule Pulse.

  • To access dashboards powered by Pulse.

Introduction to Stambia MDM Pulse

What is Stambia MDM Pulse?

Stambia MDM Pulse enables business users to:

  • Gather metrics and profile any source data to prepare a Master Data Management Initiative, using the Pulse Profiling component.

  • Create dashboards and KPIs to measure the health of their Master Data Stambia MDM Hub using the Pulse Metrics component.

This guide focuses on the Pulse Profiling. Pulse Metrics is covered in the Stambia MDM Installation, Administration and Developer’s guides.

About Pulse Profiling

Pulse Profiling scans data tables and gathers the following profiling statistics in a Pulse Profiling Warehouse:

  • Table metrics:

    • Number of records

  • Column metrics:

    • Lowest/highest values

    • Most frequent value

    • Uniqueness

    • Distinct value number and distribution

    • Pattern distribution

    • Null count

    • Minimum, Maximum and Average value length

Using these metrics, data architects have a clear assessment of their existing data quality. They can infer the structure and data rules to apply on the master data entities. Pulse Profiling helps in the design phase of the Master Data model, before and while implementing this model in Stambia MDM.

Architecture Overview

Pulse Profiling uses the following components:

  • Source Data Schemas are Oracle schemas containing the source applications’ data to profile. This data must be loaded in this schema using a data integration layer.

  • The Pulse Profiling Warehouse database schema contains the profiling statistics. You can profile several source data schemas and store their profiling statistics in a single Pulse Profiling Warehouse. This schema must be in the same data instance as the source data schemas.

  • Pulse Profiling is an executable component that gathers the statistics from the source data schemas and loads them into the Pulse Profiling Warehouse. This component is started as a command-line script and can be scheduled by an external scheduler.

  • The MS Excel Profiling Dashboard is a Microsoft Excel workbook containing built-in sample dashboards that connect to the Pulse Profiling Warehouse and allow business users and data stewards to analyze the profiling statistics. Alternately, you can connect a third party BI Platform to create your own dashboards and visualizations.

Make sure to co-locate the Pulse Profiling Warehouse and Source Data schemas within the same database instance.
Stambia MDM Pulse includes pre-packaged sample dashboards using Microsoft Excel. Any BI, dashboarding or visualization software able to connect an Oracle Schema can be used on top of Stambia MDM Pulse.

Installing Stambia MDM Pulse

This section explains how to install Stambia MDM Pulse.

Preparing to Install

Review the information in this section before you begin your installation.

System Requirements and Certification

Before installing Stambia MDM Pulse, you should read the system requirements and certification documents to ensure that your environment meets the minimum installation requirements.

The supported Java Runtime Environment (JRE) or Development Kit (JDK) versions for running the Stambia MDM Pulse Profiling executable are:

  • Java 7 - version 1.7.0 or above.

  • Java 8 - version 1.8.0 or above.

The JAVA_HOME (for a JDK) or JRE_HOME (for a JRE) environment variable must be configured to point to this installation of Java.

Type the java -version command from a command line interface to check the default Java version available on your machine. This Java version will be used by Stambia MDM Pulse.

The supported database versions for Stambia MDM Pulse are:

  • Oracle Database 10g Release 2 - 10.2.0.1-10.2.0.5

  • Oracle Database 11g Release 1: 11.1.0.6–11.1.0.7

  • Oracle Database 11g Release 2: 11.2.0.1–11.2.0.3

The Stambia MDM versions supported for Stambia MDM Pulse are:

  • Stambia MDM 2.2.x (2.2.1 and above) and 3.0.x.

Configuring Stambia MDM

Stambia MDM Pulse requires a working installation of Stambia MDM, configured with a license that supports Stambia MDM Pulse.

Stambia MDM Pulse requires a Pulse enabled license in Stambia MDM. Make sure that a Pulse enabled license is configured in the Stambia MDM instance.

Configuring the Schemas

Creating the Pulse Profiling Warehouse Schema

To use Stambia MDM Pulse Profiling, you must create an Oracle schema storing the Pulse Profiling Warehouse. You can create it manually or use your database administration interface for this purpose. In this section, we provide a sample script for creating this schema. Make sure to adapt this script to your database configuration.

CREATE USER <profiling_user_name> IDENTIFIED BY <profiling_user_password> DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, SELECT ANY TABLE, CREATE ANY VIEW TO <profiling_user_name>;

-- The following command should be used for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <profiling_user_name>;
The Pulse Profiling user must have SELECT privileges on the tables that will be profiled. In the example, we grant SELECT ANY TABLE. It is recommended to GRANT SELECT specifically for each table that requires profiling.
Store the values of the <profiling_user_name> and <profiling_user_password> as you will need them later for creating the datasource to access the Pulse Profiling Warehouse.
Example 1. Using PROFILING_USER for the Pulse Profiling Warehouse user name
CREATE USER PROFILING_USER IDENTIFIED BY MyPassword DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, SELECT ANY TABLE, CREATE ANY VIEW TO PROFILING_USER;

-- The following command should be used for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO PROFILING_USER;

Installing Pulse

Pulse Profiling is available in the /pulse/profiling/ folder in your installation package of Stambia MDM.

Pulse is a component that is typically installed on a server machine, as it may need to run at regular intervals.

To install Stambia MDM Pulse:

  1. Copy the /pulse/profiling/ content in the Pulse installation directory, for example in /stambiamdm/pulse_profiling.

  2. If you require a specific Oracle JDBC driver for your Oracle database version, copy this driver in the /lib folder of the Pulse installation directory.

Package Content

The archive file contains the following files and folders:

File/folder Description

/bin

Scripts to run Pulse Profiling.

/conf

Pulse configuration files.

/dashboards

Sample Microsoft Excel Dashboards.

/doc

Product documentation.

/lib

Binaries and libraries required for Pulse.

/log

Directory storing the execution.

Configuring Pulse

Environment Variables

Pulse uses one environment variable called PULSE_HOME that contains the path to the Pulse installation folder.
If this variable is not set, then the scripts must be launched from the /bin/ sub-directory of the Pulse installation folder.

Pulse Configuration File

An important phase of the configuration consists in defining the connections Pulse Profiling Warehouse.
This connection is configured as a datasource in the /conf/pulse_configuration.xml configuration file, located in the Pulse installation directory.

The format of the datasource definition is provided below:

<profilingDatasource url="jdbc:oracle:thin:@<oracle_instance_hostname>:<oracle_listener_port>:<oracle_SID_name>"
                     userName="<profiling_user_name>"
                     password="<profiling_user_password>"/>
Example 2. Datasource definition example using the PROFILING_USER user on a local oracle instance.
<profilingDatasource url="jdbc:oracle:thin:@localhost:1521:XE"
                     userName="PROFILING_USER" password="xxx"/>

Creating the Pulse Structures

The Pulse Profiling Warehouse structure is created using a script available from the /bin directory.

Creating the Pulse Profiling Warehouse

To create the table structures for Pulse Profiling:

  1. Open a Windows Command or UNIX/Linux Shell.

  2. Go to the /bin/ sub-directory of the Pulse installation folder.

  3. Run the profiling.bat or profiling.sh command, using the following syntax:

    • For Windows:

      profiling.bat action=create [conf=<configuration_file>] [log=<log_directory>]
    • For UNIX/Linux:

      ./profiling.sh action=create [conf=<configuration_file>] [log=<log_directory>]

The following parameters are optional:

  • conf=<configuration_file>: the full path to Pulse configuration file. If omitted Pulse will try to use the default configuration file in ../conf/pulse_configuration.xml

  • log=<log_directory>: Logging directory for the XML log files produced by the jobs executed by Pulse. If omitted Pulse will not log anything.

Use the same command with action=drop to drop the Pulse Profiling table structures.

Running and Scheduling Stambia MDM Pulse

The Pulse executable component can be launched from the command line using the profile script.

Running Stambia MDM Pulse Profiling Executable

To run Stambia MDM Pulse Profiling Executable:

  1. Open a Windows Command or UNIX/Linux Shell.

  2. Go to the Pulse installation folder.

  3. Run the profiling.bat or profiling.sh command, using the following syntax:

    profiling [action=create|drop|profile]
              [schema=<schema_name>] [table=<table_name>] [sample=<sample_size>]
              [conf=<configuration_file>] [log=<log_directory>]

The available options are listed below:

  • action: The action to complete. If omitted, this option defaults to profile. Possible actions are:

    • create: Creates the Pulse Profiling tables and views. This action is used in the installation process.

    • drop: Drops the Pulse Profiling tables and views.

    • profile (default): Profiles a set of tables stored in <schema_name>. This set of tables may be filtered by a pattern provided in the table parameter.

  • schema: Name of the schema containing the table(s) to profile. Note that this name should be provided in uppercase. It is a mandatory parameter for the profile action.

  • table: Name of the table to profile or pattern corresponding to the list of tables to profile (using %). Note that depending on the Oracle instance configuration, the table name may be case-sensitive or not. This name should be provided in uppercase. If this parameter is not specified all tables in the schema are profiled.

  • sample: Number of distinct values or patterns to analyze per column. Defaults to 1000.

  • conf: The full path to Pulse configuration file. If omitted, the ../conf/pulse_configuration.xml configuration file is used.

  • log: The logging directory for the XML log files produced for the executed jobs. If omitted nothing is logged.

Scheduling the Pulse Executables

Pulse Profiling should be executed when you need to profile a new or updated source dataset. It takes a snapshot of the source data status and is not made to keep a data profiling history.

You may schedule the Pulse Profiling executable through its script using your enterprise scheduler.

Connecting to Stambia MDM Pulse

Once Pulse Profiling is loaded or refreshed, you can visualize the metrics and statistics using dashboards in Microsoft Excel: The PROFILING_DASHBOARD.xls file contains default dashboards to analyze the content of the Pulse Profiling Warehouse.

The PROFILING_DASHBOARD.xlsx dashboard file is available in the /pulse/dashboards/ folder in your installation package of Stambia MDM.

Configuring the Excel Reports

To configure a dashboard to connect Pulse:

  1. If you want to use ODBC Connections to Oracle, configure the ODBC connections to the Pulse Profiling Warehouse schema.

  2. Make a copy of the PROFILING_DASHBOARD.xlsx dashboard file that you want to use, and open it with Microsoft Excel.

  3. Enable content and editing mode on this file.

  4. Select Data > Connections. There are several workbook connections by default.

  5. Open each workbook connection to connect to your Oracle instance.

    1. Double-click a connection. The Connection Properties dialog opens.

    2. Select the Definition tab.

    3. Edit the Connection String to connect to your Oracle instance with the profiling user. Tip: You can create a new connection and copy paste its connection string there. See below some connection string examples.

  6. Once all the connection strings are reset, the dashboard and content should be refreshed. You may force the refresh using the Data > Refresh All action.

Example 3. Oracle XE sample connection string, using the Oracle Driver, without a datasource
DRIVER={Oracle in XE};SERVER=XE;UID=PROFILING_USER;;DBQ=XE;DBA=W;APA=T;EXC=F;XSM=Default;FEN=T;QTO=T;FRC=10;FDL=10;LOB=T;RST=T;BTD=F;BNF=F;BAM=IfAllSuccessful;NUM=NLS;DPM=F;MTS=T;MDI=Me;CSR=F;FWC=F;FBS=60000;TLO=O;MLD=0;ODA=F;

In this example:

  • DRIVER contains the Oracle Driver name

  • SERVER is the XE server name (TNS Alias).

  • UID is the user name of the profiling user.

  • DBQ is the name of the instance.

Example 4. Oracle sample connection string, using a Data Source
DSN=ORACLE_ORA11;UID=PROFILING_USER;DBQ=ORA11;DBA=W;APA=T;EXC=F;FEN=T;QTO=T;FRC=10;FDL=10;LOB=T;RST=T;BTD=F;BNF=F;BAM=IfAllSuccessful;NUM=NLS;DPM=F;MTS=T;MDI=F;CSR=F;FWC=F;FBS=64000;TLO=O;MLD=0;ODA=F;

In this example:

  • DSN contains the name of the ODBC datasource to the Oracle schema

  • UID is the user name of the profiling user.

  • DBQ is the name of the instance.

Using the Pulse Profiling Dashboard

The Pulse Profiling Dashboard contains the following sections:

  • Tables Profiling - Profiling - List of Tables provides a list of tables with their record count.

  • Column Profiling - Profiling - Columns Analysis provides for the profiled columns the lowest, highest, most frequent values and other value metrics. Use this dashboard to infer the correct sizing for a column, the value range and mandatory nature of a column.

  • Column Value Analysis - Profiling - Column Values Analysis provides the value distribution for the column. Use this dashboard to infer whether the values for a column should be restricted to a list of values.

  • Pattern Analysis - Profiling - Column Patterns Analysis provides a list of value patterns appearing in the columns with their frequencies. Use this dashboard to infer the nature of the column (Phone Number, SSID, etc.) and rules on the format of the value.

Using Your BI Tools with Pulse

You can use your own BI Tool to connect and perform analysis on Stambia MDM Pulse.

The Pulse Profiling Warehouse is an Oracle schema, accessible via JDBC, ODBC or other connectivity.
The Appendix A is a schema reference to help you build dashboards from this schema using your Business Intelligence tools.

Refer to the built-in dashboards in Microsoft Excel format for sample dashboards based on these schemas.
Make sure to access these schemas for SELECT purposes only. Do not attempt to insert or update data in these schemas.

Appendices

Appendix A: Pulse Profiling Warehouse Reference

The Pulse Profiling Warehouse contains a set of table that provide the Table and Column statistics, and two tables for the values and pattern distributions.

SEM_TABLE Table

This table contains the list of profiled tables.

Column Name Description

OWNER

PK, Not Null

Name of the schema containing the profiled table.

TABLENAME

PK, Not Null

Name of the profiled table.

NUM_ROWS

Not Null

Count of records for the profiled table

SEM_COLUMN Table

This table contains the list of profiled columns.

Column Name Description

OWNER

PK, Not Null

Name of the schema containing the profiled table.

TABLENAME

PK, Not Null

Name of the profiled table.

COLUMNNAME

PK, Not Null

Name of the profiled column.

LOW_VAL

Lowest Value.

HIGH_VAL

Highest Value.

MOST_FREQ_VAL

Most frequent value.

NUM_DIST_VLS

Number of distinct values.

NUM_NULLS

Number of null values.

MIN_LENGTH

Minimum length.

MAX_LENGTH

Maximum length.

AVG_LENGTH

Average length.

SEM_DIST_VALS Table

This table contains the list of distinct value for each profiled column.

Column Name Description

OWNER

PK, Not Null

Name of the schema containing the profiled table.

TABLENAME

PK, Not Null

Name of the profiled table.

COLUMNNAME

PK, Not Null

Name of the profiled column.

DIST_VAL

PK, Not Null

Column Value.

NUM_ROWS

Number of occurrences of the value.

SEM_DIST_PATTERNS Table

This table contains the list of distinct patterns for each profiled column.

Column Name Description

OWNER

PK, Not Null

Name of the schema containing the profiled table.

TABLENAME

PK, Not Null

Name of the profiled table.

COLUMNNAME

PK, Not Null

Name of the profiled column.

DIST_PATTERN

PK, Not Null

Pattern.

NUM_ROWS

Number of occurrences of the pattern in the column.

HIGH_VAL

Highest value of this pattern in the column.

LOW_VAL

Lowest value of this pattern in the column.