Infer Clojure specs from sample data. Inspired by F#'s type providers.
This is a library that will produce a best-guess Clojure spec based on multiple examples of in-memory data. The inferred spec is not meant to be used as is and without human intervention, it is rather a starting point that can (and should) be refined.
The idea is analogous to F# type providers -- specifically the JSON type provider, but the input in the case of spec-provider is any in-memory Clojure data structure.
Since Clojure spec is still in alpha, this library should also be considered to be in alpha -- so, highly experimental, very likely to change, possibly flawed.
This library works in both Clojure and ClojureScript.
Maturity level: mature and useful. Has not reached full potential as some ideas are still unexplored.
To use this library, add this dependency to your Leiningen
project.cljfile:
[spec-provider "0.4.14"]
The are two main use cases for spec-provider:
See a summary of what shape the data is. You can use spec-provider as a way to explore new datasets.
You already know what shape your data is, and you just want some help getting started writing a spec for it because your data is deeply nested, has a lot of corner cases, you're lazy etc.
You think you know what shape your data is, but because it's neither typed checked nor contract checked, some exceptions have sneaked into it. Instead of eyeballing 100,000 maps, you run spec-provider on them and to your surprise you find that one of the fields is
(s/or :integer integer? :string string?)instead of just string as you expected. You can use spec-provider as a data debugging tool.
To infer a spec of a bunch of data just pass the data to the
infer-specsfunction:
> (require '[spec-provider.provider :as sp])> (def inferred-specs (sp/infer-specs [{:a 8 :b "foo" :c :k} {:a 10 :b "bar" :c "k"} {:a 1 :b "baz" :c "k"}] :toy/small-map))
> inferred-specs
((clojure.spec.alpha/def :toy/c (clojure.spec/or :keyword keyword? :string string?)) (clojure.spec.alpha/def :toy/b string?) (clojure.spec.alpha/def :toy/a integer?) (clojure.spec.alpha/def :toy/small-map (clojure.spec/keys :req-un [:toy/a :toy/b :toy/c])))
The sequence of specs that you get out of
infer-specis technically correct, but not very useful for pasting into your code. Luckily, you can do:
> (sp/pprint-specs inferred-specs 'toy 's)(s/def ::c (s/or :keyword keyword? :string string?)) (s/def ::b string?) (s/def ::a integer?) (s/def ::small-map (s/keys :req-un [::a ::b ::c]))
Passing
'toyto
pprint-specssignals that we intend to paste this code into the
toynamespace, so spec names are printed using the
::syntax.
Passing
'ssignals that we are going to require clojure.spec as
s, so the calls to
clojure.spec/defbecome
s/defetc.
spec-provider will walk nested data structures in your sample data and attempt to infer specs for everything.
Let's use clojure.spec to generate a larger sample of data with nested structures.
(s/def ::id (s/or :numeric pos-int? :string string?)) (s/def ::codes (s/coll-of keyword? :max-gen 5)) (s/def ::first-name string?) (s/def ::surname string?) (s/def ::k (nilable keyword?)) (s/def ::age (s/with-gen (s/and integer? pos? #(<= % 130)) #(gen/int 130))) (s/def :person/role #{:programmer :designer}) (s/def ::phone-number string?)(s/def ::street string?) (s/def ::city string?) (s/def ::country string?) (s/def ::street-number pos-int?)
(s/def ::address (s/keys :req-un [::street ::city ::country] :opt-un [::street-number]))
(s/def ::person (s/keys :req-un [::id ::first-name ::surname ::k ::age ::address] :opt-un [::phone-number ::codes] :req [:person/role]))
This spec can be used to generate a reasonably large random sample of persons:
(def persons (gen/sample (s/gen ::person) 100))
Which generates structures like:
{:id "d7FMcH52", :first-name "6", :surname "haFsA", :k :a-*?DZ/a, :age 5, :person/role :designer, :address {:street "Yrx963uDy", :city "b", :country "51w5NQ6", :street-number 53}, :codes [:*.?m_o-9_j?b.N?_!a+IgUE._coE.S4l4_8_.MhN!5_!x.axztfh.x-/?* :*-DA?+zU-.T0u5R.evD8._r_D!*K0Q.WY-F4--.O*/**O+_Qg+ :Bh8-A?t-f]}
Now watch what happens when we infer the spec of
persons:
> (sp/pprint-specs (sp/infer-specs persons :person/person) 'person 's)(s/def ::codes (s/coll-of keyword?)) (s/def ::phone-number string?) (s/def ::street-number integer?) (s/def ::country string?) (s/def ::city string?) (s/def ::street string?) (s/def ::address (s/keys :req-un [::street ::city ::country] :opt-un [::street-number])) (s/def ::age integer?) (s/def ::k (s/nilable keyword?)) (s/def ::surname string?) (s/def ::first-name string?) (s/def ::id (s/or :string string? :integer integer?)) (s/def ::role #{:programmer :designer}) (s/def ::person (s/keys :req [::role] :req-un [::id ::first-name ::surname ::k ::age ::address] :opt-un [::phone-number ::codes]))
Which is very close to the original spec. We are going to break down this result to bring attention to specific features in the following sections.
If the sample data contain any
nilvalues, this is detected and reflected in the inferred spec:
(s/def ::k (s/nilable keyword?))
Things like
::street-number,
::codesand
::phone-numberdid not appear consistently in the sampled data, so they are correctly identified as optional in the inferred spec.
(s/def ::address (s/keys :req-un [::street ::city ::country] :opt-un [::street-number]))
Most of the keys in the sample data are not qualified, and they are detected as such in the inferred spec. The
:person/rolekey is identified as fully qualified.
(s/def ::person (s/keys :req [::role] :req-un [::id ::first-name ::surname ::k ::age ::address] :opt-un [::phone-number ::codes]))
Note that the
s/deffor role is pretty printed as
::rolebecause when calling
pprint-specswe indicated that we are going to paste this into the
personnamespace.
> (sp/pprint-specs (sp/infer-specs persons :person/person) 'person 's)...
(s/def ::role #{:programmer :designer})
You may have also noticed that role has been identified as an enumeration of
:programmerand
:designer. To see how it's decided whether a field is an enumeration or not, we have to look under the hood. Let's generate a small sample of roles:
> (gen/sample (s/gen ::role) 5)(:designer :designer :designer :designer :programmer)
spec-provider collects statistics about all the sample data before deciding on the spec:
> (require '[spec-provider.stats :as stats]) > (stats/collect-stats (gen/sample (s/gen ::role) 5) {})#:spec-provider.stats{:distinct-values #{:programmer :designer}, :sample-count 5, :pred-map {#function[clojure.core/keyword?] #:spec-provider.stats{:sample-count 5}}}
The stats include a set of distinct values observed (up to a certain limit), the sample count for each field, and counts on each of the predicates that the field matches -- in this case just
keyword?. Based on these statistics, the spec is inferred and a decision is made on whether the value is an enumeration or not.
If the following statement is true, then the value is considered an enumeration:
(>= 0.1 (/ (count distinct-values) sample-count))
In other words, if the number of distinct values found is less that 10% of the total recorded values, then the value is an enumeration. This threshold is configurable.
Looking at the actual numbers can make this logic easier to understand. For the small sample above:
> (sp/infer-specs (gen/sample (s/gen ::role) 5) ::role)((clojure.spec/def :spec-provider.person-spec/role keyword?))
We have 2 distinct values in a sample of 5, which is 40% of the values being distinct. Imagine this percentage in a larger sample, say distinct 400 values in a sample of size 2000. That doesn't sound likely to be an enumeration, so it's interpreted as a normal value.
If you increase the sample:
> (sp/infer-specs (gen/sample (s/gen ::role) 100) ::role)((clojure.spec/def :spec-provider.person-spec/role #{:programmer :designer}))
We have 2 distinct values in a sample of 100, which is 2%, which means that the same values appear again and again in the sample, so it must be an enumeration.
clojure-spec makes the same assumption as clojure.spec that keys that have same name also have the same data shape as their value, even when they appear in different maps. This means that the specs from different maps are merged by key.
To demonstrate this we need to "spike" the generated persons with an id field that's inconsistent with the existing
(s/or :numeric pos-int? :string string?):
(defn add-inconsistent-id [person] (if (:address person) (assoc-in person [:address :id] (gen/generate (gen/keyword))) person))(def persons-spiked (map add-inconsistent-id (gen/sample (s/gen ::person) 100)))
Inferring the spec of
persons-spikedyields a different result for ids:
> (sp/pprint-specs (sp/infer-specs persons-spiked :person/person) 'person 's)... (s/def ::id (s/or :string string? :integer integer? :keyword keyword?)) ...
This feature is not illustrated by the person example, but before returning them, spec-provider will walk the inferred specs and look for forms that already occur elsewhere and replace them with the name of the known spec. For example:
> (sp/pprint-specs (sp/infer-specs [{:a [{:zz 1}] :b {:zz 2}} {:a [{:zz 1} {:zz 4} nil] :b nil}] ::foo) *ns* 's)(s/def ::zz integer?) (s/def ::b (s/nilable (s/keys :req-un [::zz]))) (s/def ::a (s/coll-of ::b)) (s/def ::foo (s/keys :req-un [::a ::b]))
In this case, because maps like
{:zz 2}appear under the key
:b, spec-provider knows what to call them, so it uses that name for
(s/def ::a (s/coll-of ::b)). This replacement is not performed if the spec definition is a predicate from the
clojure.corenamespace.
spec-provider collects stats about the min/max values of numerical fields, but will not output them in the inferred spec by default. To get range predicates in your specs you have to pass the
:spec-provider.provider/rangeoption:
> (require '[spec-provider.provider :refer :all :as sp])> (pprint-specs (infer-specs [{:foo 3, :bar -400} {:foo 3, :bar 4} {:foo 10, :bar 400}] ::stuff {::sp/range true}) ns 's)
(s/def ::bar (s/and integer? (fn [x] (<= -400 x 400)))) (s/def ::foo (s/and integer? (fn [x] (<= 3 x 10)))) (s/def ::stuff (s/keys :req-un [::bar ::foo]))
You can also restrict range predicates to specific keys by passing a set of qualified keys that are the names of the specs that should get a range predicate:
> (sp/pprint-specs (sp/infer-specs [{:foo 3, :bar -400} {:foo 3, :bar 4} {:foo 10, :bar 400}] ::stuff {::sp/range #{::foo}}) *ns* 's)(s/def ::bar integer?) (s/def ::foo (s/and integer? (fn [x] (<= 3 x 10)))) (s/def ::stuff (s/keys :req-un [::bar ::foo]))
Inferring a spec from raw data is a two step process: Stats collection and then summarization of the stats into specs.
First each data structure is visited recursively and statistics are collected at each level about the types of values that appear, the distinct values for each field (up to a limit), min and max values for numbers, lengths for sequences etc.
Two important points about stats collection:
Spec-provider will not run out of memory even if you throw a lot of data at it because it updates the same statistics data structure with every new example datum it receives.
Collecting stats will (at least partly) realize lazy sequences.
After stats collection, code from the
spec-provider.providernamespace goes through the stats and it summarizes it as a collection of specs.
As mentioned in the previous section, spec-provider first collects statistics about the data that you pass to it and then it uses them to infer specs for this data. The entry point for collecting stats is the
spec-provider.stats/collectfunction. This can be used to explore your data and give you insight about its structure as it was very nicely explained in this blog post by Dan Lebrero.
Assume this:
(require [spec-provider.provider :as sp] [spec-provider.stats :as stats])
There is only one option that affects how the specs are inferred and it can be passed as a map in an extra parameter to
sp/infer-specs:
::sp/rangeIf true, all numerical specs include a range predicate. If it's a set of spec names (qualified keywords), only these specs will include range predicates. See section Inferring specs with numerical ranges for an example (default false).
There is a number of options that can affect how the sample stats are collected (and consequently also affect what spec is inferred). These options are passed to
stats/collect, or as part of the options map passed to
sp/infer-specs.
::stats/distinct-limitHow many distinct values are collected for collections (default 10).
::stats/coll-limitHow many elements of the collection are used to infer/collect data about the type of the contained element (default 101). This means that lazy sequences are at least partly realized.
::stats/positionalResults in positional stats being collected for sequences, so that
s/catcan be inferred instead of
s/coll-of(default false).
::stats/positional-limitBounds the positional stats length (default 100).
Undocumented/under development: there is experimental support for instrumenting functions for the purpose of inferring the spec of args and return values.
multi-spec.
:argsand
:retparts of the spec is generated, the
:fnpart is up to you.
infer-specs?
No, stats collection works by updating the same data structure with every example of data received. The data structure will initially grow a bit and then maintain a constant size. That means that you can use a lazy sequence to stream your huge table through it if you feel that's necessary (not tested!).
The hard part of inferring a spec is collecting the statistics. Summarizing the stats as specs was relatively easy, so plugging in a different "summarizer" that will output schemas from the same stats should be possible. Look at the
providernamespace, write the schema equivalent and send me a pull request!
Run Clojure unit tests with:
lein test
Run ClojureScript unit tests with (default setup uses node):
lein doo
Run self-hosted ClojureScript unit tests with:
lein tach lumo
and
lein tach planck
Copyright © 2016-2018 Stathis Sideris
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.