Data profiling

Data profiling

Start database service from the XAMPP control panel

  1. Start the services from the XAMPP Control Panel.

Perform database queries

Select a database

  1. Select the database on which the processes described below should be performed.

Perform simple queries

  1. SQL query box
    • Select SQL to get to the query box from the horizontal menu.
    • Enter the query in the SQL query box.
    • Select Go to submit the query to the server.

Run routines

  1. Import a routine

    • Select SQL query box from the horizontal menu.
    • Copy the content of the text file containing the procedure you want to import and Paste it into the SQL query box.
    • Select Go to submit the query to the server.
  2. Run routines

    • Select Routines from the horizontal menu.
    • Select Execute for the routine that you want to run.
    • Follow the instructions issued by the procedure.

Structural discovery

Structure

Procedure STRUCTURE returns information on the structure of a given table (tbl) in a database into a generic table of the name tbl_structure in database db_profile and a file of the general name dir\tbl_structure_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

  1. Input

    • name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
    • name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
    • target directory of the output file: dir (VARCHAR(256) CHARSET utf8)
  2. Output

    • The output is written to:

      • file: dir\tblstructureYYYY-MM-DD hhmmss.csv
      • table: tbl_structure in database db_profile
    • It contains the following columns:

      • Id: Number of the row.
      • Name: Name of the column obtained from column COLUMN_NAME in table TABLE_SCHEMA of dataset information_schema.
      • Format: Data format as defined in MariaDB, obtained from column COLUMN_TYPE in table TABLE_SCHEMA of dataset information_schema.
      • Length: Minimum and Maximum length of characters, measured in code points.
      • IsNull: Field may contain NULL values (YES, NO), obtained from column COLUMN_IS_NULLABLE in table TABLE_SCHEMA of dataset information_schema.
      • Description: Content of column, obtained from column COLUMN_COMMENT in table TABLE_SCHEMA of dataset information_schema.
  3. Procedure

Content

Procedure CONTENT returns information on the content of a given table (tbl) in a database into a generic table of the name tbl_content in database db_profile and a file of the general name dir\tbl_content_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

  1. Input

    • name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
    • name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
    • target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
  2. Output

    • The output is written to:

      • file: dir\tbl_content_YYYY-MM-DD hhmmss.csv
      • table: tbl_content in database db_profile
    • It contains the following columns:

      • Id: Number of the row.
      • Name: Name of the column.
      • Average: Arithmetic mean of all values in the column.
      • Minimum: Minimum of all values in the column.
      • Maximum: Maximum of all values in the column.
      • Ranges: Range of all values in the column.
      • VoVar: Coefficient of variation based on the standard deviation of all values in the column.
      • IsNull: Number and percentage of values in the column containing NULL.
      • IsZero: Number and percentage of values in the column containing zero.
      • IsBlank: Number and percentage of values in the column containing blank.
      • IsDistinct: Number and percentage of values in the column containing blank.
  3. Procedure

Duplicates

Procedure DUPLICATES returns duplicate records in a given table (tbl) in a database into a generic table of the name tbl_duplicates in database db_profile and a file of the general name dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv. Target database and table are created if necessary.

  1. Input

    • name of the database to be analysed: db (VARCHAR(64) CHARSET utf8)
    • name of the table to be analysed: tbl (VARCHAR(64) CHARSET utf8)
    • target directory of the output file: dir (VARCHAR(255) CHARSET utf8)
  2. Output

    • The output is written to:
      • file: dir\tbl_duplicates_YYYY-MM-DD hhmmss.csv
      • table: tbl_duplicates in database db_profile
    • It contains the following columns:
      • $Id: Number of the row.
      • $Duplicates: Number of duplicates of the record.
      • Fields of the original table
  3. Procedure

Profile

Procedure PROFILE runs procedures STRUCTURE, CONTENT and DUPLICATES.

  1. Input

    • name of the database to be analysed: db_profile (VARCHAR(64) CHARSET utf8)
    • target directory of the output file: dir_profile (VARCHAR(255) CHARSET utf8)
  2. Output

    • See procedures STRUCTURE, CONTENT and DUPLICATES.
  3. Procedure

Distinct instances overall

Procedure DISTINCTINSTANCEOVERALL returns information on distinct data occurrences (case sensitive) in a given column (col) in a given table (tbl).

  1. Input

    • table name: tbl (VARCHAR(64) CHARSET utf8)
    • column name: col (VARCHAR(64) CHARSET utf8)
  2. Output

    • The results are displayed in the execution results of the procedure.
    • It contains the following fields:
      • Number: Number of instances.
      • Percent: Percentage share of instances in total number of rows.
      • col: Name of the column to be analysed
  3. Procedure

Distinct instances per group

Procedure DISTINCTINSTANCEGROUP returns information on distinct data occurrences (case sensitive) in a given column (colprobe) per occcurrence of another column (colgroup) in a given table (tbl).

  1. Input

    • table name: tbl (VARCHAR(64) CHARSET utf8)
    • column name of the probe: colprobe (VARCHAR(64) CHARSET utf8)
    • column name of the group: colgroup (VARCHAR(64) CHARSET utf8)
  2. Output

    • The results are displayed in the execution results of the procedure.
    • It contains the following fields:
      • colgroup: Name of the group
      • Instances: Number of distinct occurrences per group.
      • I_Percent: Percentage share of distinct occurrences per group in total number of distinct occurrences.
      • Records: Number of total records per group.
      • R_Percent: Percentage share of records per group in total number of records.
  3. Procedure

See also