Data profiling
Start database service from the XAMPP control panel
- Start the services from the XAMPP Control Panel.
Perform database queries
Select a database
- Select the database on which the processes described below should be performed.
Perform simple queries
- SQL query box
- Select SQL to get to the query box from the horizontal menu.
- Enter the query in the SQL query box.
- Select Go to submit the query to the server.
Run routines
-
Import a routine
- Select SQL query box from the horizontal menu.
- Copy the content of the text file containing the procedure you want to import and Paste it into the SQL query box.
- Select Go to submit the query to the server.
-
Run routines
- Select Routines from the horizontal menu.
- Select Execute for the routine that you want to run.
- Follow the instructions issued by the procedure.
Structural discovery
Structure
Procedure STRUCTURE returns information on the structure of a given table (tbl) in a database into a generic table of the name tbl_structure in database db_profile and a file of the general name dir\tbl_structure_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.
-
Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(256) CHARSET utf8)
-
Output
-
The output is written to:
- file: dir\tblstructureYYYY-MM-DD hhmmss.csv
- table: tbl_structure in database db_profile
-
It contains the following columns:
- Id: Number of the row.
- Name: Name of the column obtained from column COLUMN_NAME in table TABLE_SCHEMA of dataset information_schema.
- Format: Data format as defined in MariaDB, obtained from column COLUMN_TYPE in table TABLE_SCHEMA of dataset information_schema.
- Length: Minimum and Maximum length of characters, measured in code points.
- IsNull: Field may contain NULL values (YES, NO), obtained from column COLUMN_IS_NULLABLE in table TABLE_SCHEMA of dataset information_schema.
- Description: Content of column, obtained from column COLUMN_COMMENT in table TABLE_SCHEMA of dataset information_schema.
-
-
Procedure
- Run the query STRUCTURE in file structure.txt
Content
Procedure CONTENT returns information on the content of a given table (tbl) in a database into a generic table of the name tbl_content in database db_profile and a file of the general name dir\tbl_content_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.
-
Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
-
Output
-
The output is written to:
- file: dir\tbl_content_YYYY-MM-DD hhmmss.csv
- table: tbl_content in database db_profile
-
It contains the following columns:
- Id: Number of the row.
- Name: Name of the column.
- Average: Arithmetic mean of all values in the column.
- Minimum: Minimum of all values in the column.
- Maximum: Maximum of all values in the column.
- Ranges: Range of all values in the column.
- VoVar: Coefficient of variation based on the standard deviation of all values in the column.
- IsNull: Number and percentage of values in the column containing NULL.
- IsZero: Number and percentage of values in the column containing zero.
- IsBlank: Number and percentage of values in the column containing blank.
- IsDistinct: Number and percentage of values in the column containing blank.
-
-
Procedure
- Run the query CONTENT in file content.txt.
Duplicates
Procedure DUPLICATES returns duplicate records in a given table (tbl) in a database into a generic table of the name tbl_duplicates in database db_profile and a file of the general name dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.
-
Input
- name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
- name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
-
Output
- The output is written to:
- file: dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv
- table: tbl_duplicates in database db_profile
- It contains the following columns:
- $Id: Number of the row.
- $Duplicates: Number of duplicates of the record.
- Fields of the original table
- The output is written to:
-
Procedure
- Run the query DUPLICATES in file duplicates.txt.
Profile
Procedure PROFILE runs procedures STRUCTURE, CONTENT and DUPLICATES.
-
Input
- name of the database to be analysed: db_profile (VARCHAR(64) CHARSET utf8)
- target directory of the output file: dir_profile (VARCHAR(255) CHARSET utf8)
-
Output
- See procedures STRUCTURE, CONTENT and DUPLICATES.
-
Procedure
- Run the query PROFILE in file profile.txt.
Distinct instances overall
Procedure DISTINCTINSTANCEOVERALL returns information on distinct data occurrences (case sensitive) in a given column (col) in a given table (tbl).
-
Input
- table name: tbl (VARCHAR(64) CHARSET utf8)
- column name: col (VARCHAR(64) CHARSET utf8)
-
Output
- The results are displayed in the execution results of the procedure.
- It contains the following fields:
- Number: Number of instances.
- Percent: Percentage share of instances in total number of rows.
- col: Name of the column to be analysed
-
Procedure
- Run the query DISTINCTINSTANCEOVERALL in file distinctinstancesoverall.txt.
Distinct instances per group
Procedure DISTINCTINSTANCEGROUP returns information on distinct data occurrences (case sensitive) in a given column (colprobe) per occcurrence of another column (colgroup) in a given table (tbl).
-
Input
- table name: tbl (VARCHAR(64) CHARSET utf8)
- column name of the probe: colprobe (VARCHAR(64) CHARSET utf8)
- column name of the group: colgroup (VARCHAR(64) CHARSET utf8)
-
Output
- The results are displayed in the execution results of the procedure.
- It contains the following fields:
- colgroup: Name of the group
- Instances: Number of distinct occurrences per group.
- I_Percent: Percentage share of distinct occurrences per group in total number of distinct occurrences.
- Records: Number of total records per group.
- R_Percent: Percentage share of records per group in total number of records.
-
Procedure
- Run the query DISTINCTINSTANCEGROUP in file distinctinstancesgroup.txt.