Welcome to Stambia MDM Pulse.
This guide explains how to use Stambia MDM Pulse Profiling to perform source data profiling and analyze data when starting your Master Data Management initiative.
Preface
Audience
If you want to learn about MDM or discover Stambia MDM, you can watch our tutorials. |
The Stambia MDM Documentation Library, including the development, administration and installation guides is available online. |
Document Conventions
This document uses the following formatting conventions:
Convention | Meaning |
---|---|
boldface |
Boldface type indicates graphical user interface elements associated with an action, or a product specific term or concept. |
italic |
Italic type indicates special emphasis or placeholder variable that you need to provide. |
|
Monospace type indicates code example, text or commands that you enter. |
Other Stambia Resources
In addition to the product manuals, Stambia provides other resources available on its web site: http://www.stambia.com.
Obtaining Help
There are many ways to access the Stambia Technical Support. You can call or email our global Technical Support Center ([email protected]). For more information, see http://www.stambia.com.
Feedback
We welcome your comments and suggestions on the quality and usefulness
of this documentation.
If you find any error or have any suggestion for improvement, please
mail [email protected] and indicate the title of the documentation
along with the chapter, section, and page number, if available. Please
let us know if you want a reply.
Overview
Using this guide, you will learn:
-
The architecture of Pulse and its various components.
-
To install and configure Pulse.
-
To execute and schedule Pulse.
-
To access dashboards powered by Pulse.
Introduction to Stambia MDM Pulse
What is Stambia MDM Pulse?
Stambia MDM Pulse enables business users to:
-
Gather metrics and profile any source data to prepare a Master Data Management Initiative, using the Pulse Profiling component.
-
Create dashboards and KPIs to measure the health of their Master Data Stambia MDM Hub using the Pulse Metrics component.
This guide focuses on the Pulse Profiling. Pulse Metrics is covered in the Stambia MDM Installation, Administration and Developer’s guides.
About Pulse Profiling
Pulse Profiling scans data tables and gathers the following profiling statistics in a Pulse Profiling Warehouse:
-
Table metrics:
-
Number of records
-
-
Column metrics:
-
Lowest/highest values
-
Most frequent value
-
Uniqueness
-
Distinct value number and distribution
-
Pattern distribution
-
Null count
-
Minimum, Maximum and Average value length
-
Using these metrics, data architects have a clear assessment of their existing data quality. They can infer the structure and data rules to apply on the master data entities. Pulse Profiling helps in the design phase of the Master Data model, before and while implementing this model in Stambia MDM.
Architecture Overview
Pulse Profiling uses the following components:
-
Source Data Schemas are Oracle schemas containing the source applications’ data to profile. This data must be loaded in this schema using a data integration layer.
-
The Pulse Profiling Warehouse database schema contains the profiling statistics. You can profile several source data schemas and store their profiling statistics in a single Pulse Profiling Warehouse. This schema must be in the same data instance as the source data schemas.
-
Pulse Profiling is an executable component that gathers the statistics from the source data schemas and loads them into the Pulse Profiling Warehouse. This component is started as a command-line script and can be scheduled by an external scheduler.
-
The MS Excel Profiling Dashboard is a Microsoft Excel workbook containing built-in sample dashboards that connect to the Pulse Profiling Warehouse and allow business users and data stewards to analyze the profiling statistics. Alternately, you can connect a third party BI Platform to create your own dashboards and visualizations.
Make sure to co-locate the Pulse Profiling Warehouse and Source Data schemas within the same database instance. |
Stambia MDM Pulse includes pre-packaged sample dashboards using Microsoft Excel. Any BI, dashboarding or visualization software able to connect an Oracle Schema can be used on top of Stambia MDM Pulse. |
Installing Stambia MDM Pulse
This section explains how to install Stambia MDM Pulse.
Preparing to Install
Review the information in this section before you begin your installation.
System Requirements and Certification
Before installing Stambia MDM Pulse, you should read the system requirements and certification documents to ensure that your environment meets the minimum installation requirements.
The supported Java Runtime Environment (JRE) or Development Kit (JDK) versions for running the Stambia MDM Pulse Profiling executable are:
-
Java 7 - version 1.7.0 or above.
-
Java 8 - version 1.8.0 or above.
The JAVA_HOME (for a JDK) or JRE_HOME (for a JRE) environment variable must be configured to point to this installation of Java.
Type the java -version command from a command line interface
to check the default Java version available on your machine. This Java
version will be used by Stambia MDM Pulse.
|
The supported database versions for Stambia MDM Pulse are:
-
Oracle Database 10g Release 2 - 10.2.0.1-10.2.0.5
-
Oracle Database 11g Release 1: 11.1.0.6–11.1.0.7
-
Oracle Database 11g Release 2: 11.2.0.1–11.2.0.3
The Stambia MDM versions supported for Stambia MDM Pulse are:
-
Stambia MDM 2.2.x (2.2.1 and above) and 3.0.x.
Configuring Stambia MDM
Stambia MDM Pulse requires a working installation of Stambia MDM, configured with a license
that supports Stambia MDM Pulse.
Stambia MDM Pulse requires a Pulse enabled license in Stambia MDM. Make sure that a Pulse enabled license is configured in the Stambia MDM instance. |
Configuring the Schemas
Creating the Pulse Profiling Warehouse Schema
To use Stambia MDM Pulse Profiling, you must create an Oracle schema storing the Pulse Profiling Warehouse. You can create it manually or use your database administration interface for this purpose. In this section, we provide a sample script for creating this schema. Make sure to adapt this script to your database configuration.
CREATE USER <profiling_user_name> IDENTIFIED BY <profiling_user_password> DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, SELECT ANY TABLE, CREATE ANY VIEW TO <profiling_user_name>;
-- The following command should be used for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <profiling_user_name>;
The Pulse Profiling user must have SELECT privileges on the
tables that will be profiled. In the example, we grant SELECT ANY TABLE .
It is recommended to GRANT SELECT specifically for each table that
requires profiling.
|
Store the values of the <profiling_user_name> and
<profiling_user_password> as you will need them later for creating the
datasource to access the Pulse Profiling Warehouse.
|
PROFILING_USER
for the Pulse Profiling Warehouse user nameCREATE USER PROFILING_USER IDENTIFIED BY MyPassword DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, SELECT ANY TABLE, CREATE ANY VIEW TO PROFILING_USER;
-- The following command should be used for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO PROFILING_USER;
Installing Pulse
Pulse Profiling is available in the /pulse/profiling/
folder in your installation package of Stambia MDM.
Pulse is a component that is typically installed on a server machine, as it may need to run at regular intervals.
To install Stambia MDM Pulse:
-
Copy the
/pulse/profiling/
content in the Pulse installation directory, for example in/stambiamdm/pulse_profiling
. -
If you require a specific Oracle JDBC driver for your Oracle database version, copy this driver in the
/lib
folder of the Pulse installation directory.
Package Content
The archive file contains the following files and folders:
File/folder | Description |
---|---|
|
Scripts to run Pulse Profiling. |
|
Pulse configuration files. |
|
Sample Microsoft Excel Dashboards. |
|
Product documentation. |
|
Binaries and libraries required for Pulse. |
|
Directory storing the execution. |
Configuring Pulse
Environment Variables
Pulse uses one environment variable called PULSE_HOME
that contains
the path to the Pulse installation folder.
If this variable is not set, then the scripts must be launched from the
/bin/
sub-directory of the Pulse installation folder.
Pulse Configuration File
An important phase of the configuration consists in defining the connections Pulse Profiling Warehouse.
This connection is configured as a datasource in the /conf/pulse_configuration.xml
configuration file, located in the Pulse installation directory.
The format of the datasource definition is provided below:
<profilingDatasource url="jdbc:oracle:thin:@<oracle_instance_hostname>:<oracle_listener_port>:<oracle_SID_name>"
userName="<profiling_user_name>"
password="<profiling_user_password>"/>
PROFILING_USER
user on a local oracle instance.<profilingDatasource url="jdbc:oracle:thin:@localhost:1521:XE"
userName="PROFILING_USER" password="xxx"/>
Creating the Pulse Structures
The Pulse Profiling Warehouse structure
is created using a script available from the /bin
directory.
Creating the Pulse Profiling Warehouse
To create the table structures for Pulse Profiling:
-
Open a Windows Command or UNIX/Linux Shell.
-
Go to the
/bin/
sub-directory of the Pulse installation folder. -
Run the
profiling.bat
orprofiling.sh
command, using the following syntax:-
For Windows:
profiling.bat action=create [conf=<configuration_file>] [log=<log_directory>]
-
For UNIX/Linux:
./profiling.sh action=create [conf=<configuration_file>] [log=<log_directory>]
-
The following parameters are optional:
-
conf=<configuration_file>
: the full path to Pulse configuration file. If omitted Pulse will try to use the default configuration file in../conf/pulse_configuration.xml
-
log=<log_directory>
: Logging directory for the XML log files produced by the jobs executed by Pulse. If omitted Pulse will not log anything.
Use the same command with action=drop to drop the Pulse Profiling table structures.
|
Running and Scheduling Stambia MDM Pulse
The Pulse executable component can be launched from the command line using the profile script.
Running Stambia MDM Pulse Profiling Executable
To run Stambia MDM Pulse Profiling Executable:
-
Open a Windows Command or UNIX/Linux Shell.
-
Go to the Pulse installation folder.
-
Run the
profiling.bat
orprofiling.sh
command, using the following syntax:profiling [action=create|drop|profile] [schema=<schema_name>] [table=<table_name>] [sample=<sample_size>] [conf=<configuration_file>] [log=<log_directory>]
The available options are listed below:
-
action
: The action to complete. If omitted, this option defaults to profile. Possible actions are:-
create
: Creates the Pulse Profiling tables and views. This action is used in the installation process. -
drop
: Drops the Pulse Profiling tables and views. -
profile
(default): Profiles a set of tables stored in<schema_name>
. This set of tables may be filtered by a pattern provided in thetable
parameter.
-
-
schema
: Name of the schema containing the table(s) to profile. Note that this name should be provided in uppercase. It is a mandatory parameter for the profile action. -
table
: Name of the table to profile or pattern corresponding to the list of tables to profile (using %). Note that depending on the Oracle instance configuration, the table name may be case-sensitive or not. This name should be provided in uppercase. If this parameter is not specified all tables in the schema are profiled. -
sample
: Number of distinct values or patterns to analyze per column. Defaults to 1000. -
conf
: The full path to Pulse configuration file. If omitted, the../conf/pulse_configuration.xml
configuration file is used. -
log
: The logging directory for the XML log files produced for the executed jobs. If omitted nothing is logged.
Scheduling the Pulse Executables
Pulse Profiling should be executed when you need to profile a new or updated source dataset. It takes a snapshot of the source data status and is not made to keep a data profiling history.
You may schedule the Pulse Profiling executable through its script using your enterprise scheduler.
Connecting to Stambia MDM Pulse
Once Pulse Profiling is loaded or refreshed, you can visualize the metrics and statistics using dashboards in Microsoft Excel: The PROFILING_DASHBOARD.xls file contains default dashboards to analyze the content of the Pulse Profiling Warehouse.
The PROFILING_DASHBOARD.xlsx
dashboard file is available in the /pulse/dashboards/
folder in your installation package of Stambia MDM.
Configuring the Excel Reports
To configure a dashboard to connect Pulse:
-
If you want to use ODBC Connections to Oracle, configure the ODBC connections to the Pulse Profiling Warehouse schema.
-
Make a copy of the
PROFILING_DASHBOARD.xlsx
dashboard file that you want to use, and open it with Microsoft Excel. -
Enable content and editing mode on this file.
-
Select Data > Connections. There are several workbook connections by default.
-
Open each workbook connection to connect to your Oracle instance.
-
Double-click a connection. The Connection Properties dialog opens.
-
Select the Definition tab.
-
Edit the Connection String to connect to your Oracle instance with the profiling user. Tip: You can create a new connection and copy paste its connection string there. See below some connection string examples.
-
-
Once all the connection strings are reset, the dashboard and content should be refreshed. You may force the refresh using the Data > Refresh All action.
DRIVER={Oracle in XE};SERVER=XE;UID=PROFILING_USER;;DBQ=XE;DBA=W;APA=T;EXC=F;XSM=Default;FEN=T;QTO=T;FRC=10;FDL=10;LOB=T;RST=T;BTD=F;BNF=F;BAM=IfAllSuccessful;NUM=NLS;DPM=F;MTS=T;MDI=Me;CSR=F;FWC=F;FBS=60000;TLO=O;MLD=0;ODA=F;
In this example:
-
DRIVER
contains the Oracle Driver name -
SERVER
is the XE server name (TNS Alias). -
UID
is the user name of the profiling user. -
DBQ
is the name of the instance.
DSN=ORACLE_ORA11;UID=PROFILING_USER;DBQ=ORA11;DBA=W;APA=T;EXC=F;FEN=T;QTO=T;FRC=10;FDL=10;LOB=T;RST=T;BTD=F;BNF=F;BAM=IfAllSuccessful;NUM=NLS;DPM=F;MTS=T;MDI=F;CSR=F;FWC=F;FBS=64000;TLO=O;MLD=0;ODA=F;
In this example:
-
DSN
contains the name of the ODBC datasource to the Oracle schema -
UID
is the user name of the profiling user. -
DBQ
is the name of the instance.
Using the Pulse Profiling Dashboard
The Pulse Profiling Dashboard contains the following sections:
-
Tables Profiling - Profiling - List of Tables provides a list of tables with their record count.
-
Column Profiling - Profiling - Columns Analysis provides for the profiled columns the lowest, highest, most frequent values and other value metrics. Use this dashboard to infer the correct sizing for a column, the value range and mandatory nature of a column.
-
Column Value Analysis - Profiling - Column Values Analysis provides the value distribution for the column. Use this dashboard to infer whether the values for a column should be restricted to a list of values.
-
Pattern Analysis - Profiling - Column Patterns Analysis provides a list of value patterns appearing in the columns with their frequencies. Use this dashboard to infer the nature of the column (Phone Number, SSID, etc.) and rules on the format of the value.
Using Your BI Tools with Pulse
You can use your own BI Tool to connect and perform analysis on Stambia MDM Pulse.
The Pulse Profiling Warehouse is an Oracle
schema, accessible via JDBC, ODBC or other connectivity.
The Appendix A is a schema reference to help you build dashboards from this schema using your Business Intelligence tools.
Refer to the built-in dashboards in Microsoft Excel format for sample dashboards based on these schemas. |
Make sure to access these schemas for SELECT purposes only. Do not attempt to insert or update data in these schemas. |
Appendices
Appendix A: Pulse Profiling Warehouse Reference
The Pulse Profiling Warehouse contains a set of table that provide the Table and Column statistics, and two tables for the values and pattern distributions.
SEM_TABLE Table
This table contains the list of profiled tables.
Column Name | Description | |
---|---|---|
|
PK, Not Null |
Name of the schema containing the profiled table. |
|
PK, Not Null |
Name of the profiled table. |
|
Not Null |
Count of records for the profiled table |
SEM_COLUMN Table
This table contains the list of profiled columns.
Column Name | Description | |
---|---|---|
|
PK, Not Null |
Name of the schema containing the profiled table. |
|
PK, Not Null |
Name of the profiled table. |
|
PK, Not Null |
Name of the profiled column. |
|
Lowest Value. |
|
|
Highest Value. |
|
|
Most frequent value. |
|
|
Number of distinct values. |
|
|
Number of null values. |
|
|
Minimum length. |
|
|
Maximum length. |
|
|
Average length. |
SEM_DIST_VALS Table
This table contains the list of distinct value for each profiled column.
Column Name | Description | |
---|---|---|
|
PK, Not Null |
Name of the schema containing the profiled table. |
|
PK, Not Null |
Name of the profiled table. |
|
PK, Not Null |
Name of the profiled column. |
|
PK, Not Null |
Column Value. |
|
Number of occurrences of the value. |
SEM_DIST_PATTERNS Table
This table contains the list of distinct patterns for each profiled column.
Column Name | Description | |
---|---|---|
|
PK, Not Null |
Name of the schema containing the profiled table. |
|
PK, Not Null |
Name of the profiled table. |
|
PK, Not Null |
Name of the profiled column. |
|
PK, Not Null |
Pattern. |
|
Number of occurrences of the pattern in the column. |
|
|
Highest value of this pattern in the column. |
|
|
Lowest value of this pattern in the column. |