A machine learning package built for humans.
Machine learning for humans.
A machine learning library designed from the ground up to be human friendly. It is different from other machine learning libraries in the following ways:
This library is meant to be used with sparse, interpretable features such as those that commonly occur in search (search keywords, filters) or pricing (number of rooms, location, price). It is not as interpretable with problems with very dense non-human interpretable features such as raw pixels or audio samples.
There are a few reasons to focus on interpretability:
The artifacts for aerosolve are hosted on bintray. If you use Maven, SBT or Gradle you can just point to bintray as a repository and automatically fetch the artifacts.
Check out the image impression demo where you can learn how to teach the algorithm to paint in the pointillism style of painting. Image Impressionism Demo.
There is also an income prediction demo based on a popular machine learning benchmark. Income Prediction Demo.
This section dives into the thrift based feature representation.
Features are grouped into logical groups called families of features. The reason for this is so we can express transformations on an entire feature family at once or interact two different families of features together to create a new feature family.
There are three kinds of features per FeatureVector:
Examples are the basic unit of creating training data and scoring. A single example is composed of:
The reasons for having this structure are:
This section dives into the feature transform language.
Feature transforms are applied with a separate transformer module that is decoupled from the model. This allows the user to break apart transforms or transform data ahead of time of scoring for example. e.g. in an application the items in a corpus may be transformed ahead of time and stored, while the context is not known until runtime. Then at runtime, one can transform the context and combined them with each transformed item to get the final feature vector that is then fed to the models.
Feature transforms allow us to modify FeatureVectors on the fly. This allows engineers to rapidly iterate on feature engineering quickly and in a controlled way.
Here are some examples of feature transforms that are commonly used:
Please see the corresponding unit tests as to what these transforms do, what kind of features they operate on and what kind of config they expect.
This section covers debuggable models.
Although there are several models in the model directory only two are the main debuggable models. The rest are experimental or sub-models that create transforms for the interpretable models.
Linear model. Supports hinge, logistic, epsilon insensitive regression, ranking loss functions. Only operates on stringFeatures. The label for the task is stored in a special feature family and specified by rank_key in the config. See the linear model unit tests on how to set up the models. Note that in conjunction with quantization and crosses you can get incredible amounts of complexity from the "linear" model, so it is not actually your regular linear model but something more complex and can be thought of as a bushy, very wide decision tree with millions of branches.
Spline model. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature's range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints. If the RMSE of the projection is above threshold, we leave the spline alone in the high resolution piecewise linear mode. This allows us to debug the spline model for features that are buggy or unexpectedly complex (e.g. jumping up and down when we expect some kind of smoothness)
If you use intellij, try build first, so that thrift classes is available and to fix the spark compiling error inside intellij, type
command+;and click dependency and change related files from test to compile, such as org.apache.spark and org.apache.hadoop:hadoop-common. We keep gradle config as testCompile so that to reduce jar file size.
Organizations and projects using
aerosolvecan list themselves here.