Data profiling

Start database service from the XAMPP control panel

Start the services from the XAMPP Control Panel.

Perform database queries

Select a database

Select the database on which the processes described below should be performed.

Perform simple queries

SQL query box
- Select SQL to get to the query box from the horizontal menu.
- Enter the query in the SQL query box.
- Select Go to submit the query to the server.

Run routines

Import a routine
- Select SQL query box from the horizontal menu.
- Copy the content of the text file containing the procedure you want to import and Paste it into the SQL query box.
- Select Go to submit the query to the server.
Run routines
- Select Routines from the horizontal menu.
- Select Execute for the routine that you want to run.
- Follow the instructions issued by the procedure.

Structural discovery

Structure

Procedure STRUCTURE returns information on the structure of a given table (tbl) in a database into a generic table of the name tbl_structure in database db_profile and a file of the general name dir\tbl_structure_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(256) CHARSET utf8)
Output
- The output is written to:
  - file: dir\tblstructureYYYY-MM-DD hhmmss.csv
  - table: tbl_structure in database db_profile
- It contains the following columns:
  - Id: Number of the row.
  - Name: Name of the column obtained from column COLUMN_NAME in table TABLE_SCHEMA of dataset information_schema.
  - Format: Data format as defined in MariaDB, obtained from column COLUMN_TYPE in table TABLE_SCHEMA of dataset information_schema.
  - Length: Minimum and Maximum length of characters, measured in code points.
  - IsNull: Field may contain NULL values (YES, NO), obtained from column COLUMN_IS_NULLABLE in table TABLE_SCHEMA of dataset information_schema.
  - Description: Content of column, obtained from column COLUMN_COMMENT in table TABLE_SCHEMA of dataset information_schema.
Procedure
- Run the query STRUCTURE in file structure.txt

Content

Procedure CONTENT returns information on the content of a given table (tbl) in a database into a generic table of the name tbl_content in database db_profile and a file of the general name dir\tbl_content_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
Output
- The output is written to:
  - file: dir\tbl_content_YYYY-MM-DD hhmmss.csv
  - table: tbl_content in database db_profile
- It contains the following columns:
  - Id: Number of the row.
  - Name: Name of the column.
  - Average: Arithmetic mean of all values in the column.
  - Minimum: Minimum of all values in the column.
  - Maximum: Maximum of all values in the column.
  - Ranges: Range of all values in the column.
  - VoVar: Coefficient of variation based on the standard deviation of all values in the column.
  - IsNull: Number and percentage of values in the column containing NULL.
  - IsZero: Number and percentage of values in the column containing zero.
  - IsBlank: Number and percentage of values in the column containing blank.
  - IsDistinct: Number and percentage of values in the column containing blank.
Procedure
- Run the query CONTENT in file content.txt.

Duplicates

Procedure DUPLICATES returns duplicate records in a given table (tbl) in a database into a generic table of the name tbl_duplicates in database db_profile and a file of the general name dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
Output
- The output is written to:
  - file: dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv
  - table: tbl_duplicates in database db_profile
- It contains the following columns:
  - $Id: Number of the row.
  - $Duplicates: Number of duplicates of the record.
  - Fields of the original table
Procedure
- Run the query DUPLICATES in file duplicates.txt.

Profile

Procedure PROFILE runs procedures STRUCTURE, CONTENT and DUPLICATES.

Input
- name of the database to be analysed: db_profile (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir_profile (VARCHAR(255) CHARSET utf8)
Output
- See procedures STRUCTURE, CONTENT and DUPLICATES.
Procedure
- Run the query PROFILE in file profile.txt.

Distinct instances overall

Procedure DISTINCTINSTANCEOVERALL returns information on distinct data occurrences (case sensitive) in a given column (col) in a given table (tbl).

Input
- table name: tbl (VARCHAR(64) CHARSET utf8)
- column name: col (VARCHAR(64) CHARSET utf8)
Output
- The results are displayed in the execution results of the procedure.
- It contains the following fields:
  - Number: Number of instances.
  - Percent: Percentage share of instances in total number of rows.
  - col: Name of the column to be analysed
Procedure
- Run the query DISTINCTINSTANCEOVERALL in file distinctinstancesoverall.txt.

Distinct instances per group

Procedure DISTINCTINSTANCEGROUP returns information on distinct data occurrences (case sensitive) in a given column (colprobe) per occcurrence of another column (colgroup) in a given table (tbl).

Input
- table name: tbl (VARCHAR(64) CHARSET utf8)
- column name of the probe: colprobe (VARCHAR(64) CHARSET utf8)
- column name of the group: colgroup (VARCHAR(64) CHARSET utf8)
Output
- The results are displayed in the execution results of the procedure.
- It contains the following fields:
  - colgroup: Name of the group
  - Instances: Number of distinct occurrences per group.
  - I_Percent: Percentage share of distinct occurrences per group in total number of distinct occurrences.
  - Records: Number of total records per group.
  - R_Percent: Percentage share of records per group in total number of records.
Procedure
- Run the query DISTINCTINSTANCEGROUP in file distinctinstancesgroup.txt.