Skip to content

Heuristic JSON "column" type & size analysis w/ enumeration detection.

License

Notifications You must be signed in to change notification settings

justsml/schema-analyzer

Repository files navigation

Build Status Codacy Badge GitHub package version GitHub stars Node.js CI npm bundle size npm downloads codecov

Schema Analyzer

An Open Source joint by Dan Levy

Analyze column type & size summary from any input JSON array

Schema Analyzer is the core library behind Dan's Schema Generator.

Features

The primary goal is to support any input JSON/CSV and infer as much as possible. More data will generally yield better results.

  • Heuristic type analysis for arrays of objects.
  • Browser-based (local, no server necessary)
  • Automatic type detection:
    • ID - Identifier column, by name and unique Integer check (detects BigInteger)
    • ObjectId (MongoDB's 96 bit/12 Byte ID. 32bit timestamp + 24bit MachineID + 16bit ProcessID + 24bit Counter)
    • UUID/GUID (Common 128 bit/16 Byte ID. Stored as a hex string, dash delimited in parts: 8, 4, 4, 4, 12)
    • Boolean (detects obvious strings true, false, Y, N)
    • Date (Smart detection via comprehensive regex pattern)
    • Timestamp (integer, number of milliseconds since unix epoch)
    • Currency (62 currency symbols supported)
    • Float (w/ scale & precision measurements)
    • Number (Integers)
    • Null (sparse column data helps w/ certain inferences)
    • Email (falls back to string)
    • String (big text and varchar awareness)
    • Array (includes min/max/avg length)
    • Object
  • Detects column size minimum, maximum and average
  • Includes data points at the 30th, 60th and 90th percentiles (for detecting outliers and enum types!)
  • Handles some error/outliers
  • Quantify # of unique values per column
  • Identify enum Fields w/ Values
  • Identify Not Null fields
  • Nested data structure & multi-table relational output.

Getting Started

npm install schema-analyzer
import { schemaBuilder } from 'schema-builder'

schemaBuilder(schemaName: String, data: Array<Object>): TypeSummary

Preview Analysis Results

What does this library's analysis look like?

It consists of 3 key top-level properties:

  • totalRows - # of rows analyzed.
  • fields: FieldTypeSummary - a map of field names with all detected types (includes meta-data for each type detected, with possible overlaps. e.g. an Email is also a String, "42" is a String and Number)

Review the raw results below

Details about each field can be found below.

{
  "totalRows": 5,
  "fields": {
    "id": {
      "types": {
        "Number": {
          "rank": 8,
          "count": 5,
          "value": { "min": 1, "mean": 3, "max": 5, "p25": 2, "p33": 2, "p50": 3, "p66": 4, "p75": 4, "p99": 5 }
        },
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 1, "mean": 1, "max": 1, "p25": 1, "p33": 1, "p50": 1, "p66": 1, "p75": 1, "p99": 1 }
        }
      }
    },
    "name": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 3, "mean": 7.2, "max": 15, "p25": 3, "p33": 3, "p50": 5, "p66": 10, "p75": 10, "p99": 15 }
        }
      }
    },
    "role": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 4, "mean": 5.4, "max": 9, "p25": 4, "p33": 4, "p50": 5, "p66": 5, "p75": 5, "p99": 9 }
        }
      }
    },
    "email": {
      "types": {
        "Email": {
          "rank": 11,
          "count": 5,
          "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
        }
      }
    },
    "createdAt": {
      "types": {
        "Date": {
          "rank": 4,
          "count": 4,
          "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 6, "mean": 6, "max": 6, "p25": 6, "p33": 6, "p50": 6, "p66": 6, "p75": 6, "p99": 6 }
        }
      }
    },
    "accountConfirmed": {
      "types": {
        "Unknown": {
          "rank": -1,
          "count": 1
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 9, "mean": 9, "max": 9, "p25": 9, "p33": 9, "p50": 9, "p66": 9, "p75": 9, "p99": 9 }
        },
        "Boolean": {
          "rank": 3,
          "count": 4
        }
      }
    }
  }
}

Sample input dataset for the example results above

id name role email createdAt accountConfirmed
1 Eve poweruser eve@example.com 01/20/2020 undefined
2 Alice user ali@example.com 02/02/2020 true
3 Bob user robert@example.com 12/31/2019 true
4 Elliot Alderson admin falkensmaze@protonmail.com 01/01/2001 false
5 Sam Sepiol admin falkensmaze@hotmail.com 9/9/99 true

AggregateSummary

Numeric and String types include a summary of the observed field sizes:

Number & String Range Object Details

Properties
  • min the minimum number or string length
  • max the maximum number or string length
  • mean the average number or string length
  • percentiles[25th, 33th, 50th, 66th, 75th, 99th] values from the Nth percentile number or string length

Percentile is based on input data, as-is with out sorting.

Length Range Data

Range data for the length of a String field type:

{
  "rank": 11,
  "count": 5,
  "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}

This is useful for defining strict length limits or minimums, for example as SQL servers often require..

Range data for a Date fields value:

{
  "rank": 4,
  "count": 4,
  "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
}

Notes

We recommend you provide at least 100+ rows. Accuracy increases greatly with 1,000 rows.

The following features require a certain minimum # of records:

  • Enumeration detection.
    • 100+ Rows Required.
    • Number of unique values must not exceed 20 or 5% of the total number of records. (100 records will identify as Enum w/ 5 values. Up to 20 are possible given 400 or 1,000+.)
  • Not Null detection.
    • where rowCount === field count

Full List of Detected Types

  • Unknown
  • ObjectId
  • UUID
  • Boolean
  • Date
  • Timestamp
  • Currency
  • Float
  • Number
  • Email
  • String
  • Array
  • Object
  • Null