New usecase

In order to create a new usecase, from the usecase tab, you need to click on the « new usecase » button:

_images/new_usecase_button.png

You can also create directly a usecase by clicking the … icon near a given Data set in the data screen :

_images/new_usecase_from_ds.png

When creating a new usecase, you should first specify a DATA TYPE among:

  • Tabular (including textual)
  • Time series
  • Images

Then, you can specify a usecase name linked to a previously created Data Set.

_images/new_usecase.png

Depending on the TRAINING TYPE, some options are displayed :

  • Hold out : only for Tabular usecases. It is a Data Set that will be predicted for each model trained and the performance will be compute on it
  • Image folder : only for Images usecases. It is a Data Set labelled as a folder containing images linked to a tabular Data Set

We offer 4 differents TRAINING TYPE:

TYPE TABULAR TIMESERIES IMAGE DEFINITION EXAMPLE
Regression OK OK OK Prediction of a quantitative feature 2.39 / 3.98 / 18.39
Classification OK   OK Prediction of a binary quantitative feature « Yes » / « No » ou 0 / 1
Multi classification OK   OK Prediction of a qualitative feature whose cardinality is > 2 « Victory » / « Defeat » / « Tie game »
Object detection     OK Detection from 1 to n objects per image + location Is there a train on this image ? If so, where ?

Tabular

The screens for these 3 types of usecases are extremely similar. Only metrics, detailed below, change according to the type of project. First, you should give your usecase a name and attach a previously created Data Set:

_images/train_1_ds.png

Note that only tabular Data Sets with an OK parsed status (✓ icon on the Data Set screen on the PARSED column) are selected.

It is also possible, but not mandatory, to add a Data Set for comparison (hold out):

_images/train_ho.png

Typically, the addition of such a Data Set is useful in a study context in which we want to compare the quality of the actual prediction (and no longer only the performance estimators) on a set of models. This Data Set must have the same structure as the original set (same column name).

Once this step done you can proceed on cliquing the configure dataset button, located on the top right of the screen:

_images/configure_dataset.png

Data Set configuration

_images/train_sup.png

On the left part of the screen, you will be able to fill:

  • The target column (mandatory). This column is the one we want to predict on.
  • The id column (optionnal). This column has typically no predictive power and is used to make join on other Data Sets later on.
  • The fold (optionnal). Typically, this column will contain a feature of 1, 2, … n (n being the maximum number of folds). If fed, the CV stratification will be based on this column and won’t be stratified to the target which is Prevision.io’s default behavior.
  • The weight (optionnal). Typically, this column contains a linear feature indicating the importance of a given row. The higher the weight, the more important the row is. If not fed, all rows are considered equally important (which is the case in most usecases).

Note: If your Data Set contains a column named ID or TARGET, these will automatically be detected and selected from the corresponding menus

On the right part of the screen, you will be able to:

  • Filter columns by names
  • Shows only dropped (removed) columns
  • Drop (remove) columns for the training phase. This means that every dropped column won’t be use in the learning process

Once done, you can launch the training by clicking on the create and train button, located on the top right of the screen:

_images/create_train.png

Optionnally, there are advanced options reachable by clicking the tab in the top bar:

_images/train_advanced_options.png

Advanced options

_images/train_advanced_options_screen.png

Training options

_images/training_options.png

In this part of the screen, you can tune the following:

Metric (will differ depending of the training type):

TYPE METRIC DEFINITION DEFAULT ?
Regression RMSE Root mean squared error YES
Regression MSE Mean squared error  
Regression RMSLE Root mean squared logarithmic error  
Regression RMSPE Root mean squared percentage error  
Regression MAE Mean absolute error  
Regression MAPE Mean absolute percentage error  
Regression MER Median absolute error  
Regression R2 Coefficient of determination  
Regression SMAPE Symetric mean absolute percentage error  
Classification AUC Area under ROC curve YES
Classification ERROR RATE Error rate  
Classification LOGLOSS Logarithmic loss  
Classification ACCURACY Accuracy  
Classification F05 F-0.5 Score  
Classification F1 F-1 Score  
Classification F2 F-2 Score  
Classification F3 F-3 Score  
Classification F4 F-4 Score  
Classification MCC Matthews” correlation coefficient  
Classification GINI Gini’s coefficient  
Classification AUPCR Area under precision-recall curve  
Classification LIFT_AT_0.1 Lift @ 10%  
Classification LIFT_AT_0.2 Lift @ 20%  
Classification LIFT_AT_0.3 Lift @ 30%  
Classification LIFT_AT_0.4 Lift @ 40%  
Classification LIFT_AT_0.5 Lift @ 50%  
Classification LIFT_AT_0.6 Lift @ 60%  
Classification LIFT_AT_0.7 Lift @ 70%  
Classification LIFT_AT_0.8 Lift @ 80%  
Classification LIFT_AT_0.9 Lift @ 90%  
Multi classification LOGLOSS Logarithmic loss YES
Multi classification ERROR_RATE Error rate  
Multi classification AUC Area under ROC cure (mean of AUC by class)  
Multi classification MACROF1 Macro F1-Score (mean of F1 by class)  
Multi classification ACCURACY Accuracy  
Multi classification QKAPPA Quadratic weighted Kappa  
Multi classification MAP_AT_3 Mean average precision @ 3  
Multi classification MAP_AT_5 Mean average precision @ 5  
Multi classification MAP_AT_10 Mean average precision @ 10  

All technicals formulas are available here : https://previsionio.readthedocs.io/fr/latest/_static/ressources/formula.pdf

Performances:

  • QUICK: Training is done faster but performance may be slightly lower. Ideal in iterative phase.
  • NORMAL: Intermediate value, suitable for most usecases on a later stage.
  • ADVANCED: The training is done in an optimal way. Though the performance will be more stable, the calculations will take longer to process. This is ideal when the model is put into production and the performance is discriminating.

Model Selection

_images/model_selection.png

In this part of the screen you can enable or disable model types, such as:

Note: The more model types you add in the training, the longer it will be.

Feature Engineering

_images/feature_engineering.png

In this part of the screen you can enable or disable feature engineering, such as:

  • Date features: dates are detected and operations such as information extraction (day, month, year, day of the week, etc.) and differences (if at least 2 dates are present) are automatically performed
  • Textual features: Textual features: textual features are detected and automatically converted into numbers using 3 techniques:

By default, only TF-IDF approach is used.

Note

For better performance, it is advisable to check the word embedding and sentence embedding options. Checking its additional options will increase the time required for feature engineering, modeling, and prediction

  • Categorical features:

    • Frequency encoding: modalities are converted to their respective frequencies
    • Target encoding: modalities are replaced by the average (TARGET, grouped by modality) for a regression and by the proportion of the modality for the target’s modalities in the context of a classification
  • Advanced features:

    • Polynomial features: features based on products of existing features are created. This can greatly help linear models since they do not naturally take interactions into account but are less usefull on tree based models
    • PCA: main components of the PCA
    • K-means: Cluster number comming from a K-means methode are added as new features
    • Row statistics: features based on row by row counts are added as new features (number of 0, number of missing values, …)

Note: The more feature engineering you add in the training, the longer it will be.

Feature Selection

_images/feature_selection.png

In this part of the screen you can chose to enable feature selection (off by default).

This operation is important when you have a high number of features (a couple hundreds) and can be critical when the number of features is above 1000 since the full Data Set won’t be able to hold in RAM.

You can chose to keep a percentage or a count of feature and you can give a time budget to Prevision.io’s to perform the search of optimal features given the TARGET and all other parameters. In this time, Prevision.io will subset the feature of the Data Set then start the classical process.

Time series

Time series is very similar to tabular usecase except:

  • There is no hold out
  • There is no weight
  • There is no fold (in this case, Prevision.io use temporal stratification)

However, you will find some new notions:

  • Temporal column: the feature that contain the time reference of the time series. Since date formats can be complex, Prevision.io supports ISO 8601 (https://fr.wikipedia. org/wiki/ISO_8601) as well as standard formats (e.g. DD/MM/YYYY or DD-MM-YYYY hh:mm).

  • Time step: period between 2 events (within the same group) from the temporal column (automatically detected)

  • Observation window: illustrate the period in the past that you have for each prediction

    • Start of observation window: the maximum time step multiple in the past that you’ll have data from for each prediction (inclusive, 30 by default)
    • Enf of the observation window: the last time step multiple in the past that you’ll have data from for each prediction (inclusive, 0 by default that means that the immediate values before the prediction time step is known)
  • Prediction window: illustrate the period in the future that you want to predict

    • Start of the prediction window: the first time step multiple you want to predict (inclusive, 1 by default which means we will predict starting at the next value)
    • End of the prediction window: the last time stemp multiple you want to predict (inclusive, 10 by default which means we will predict up to the 10th next value)
  • A priori features: features whose value is known in the future (customer number, calendar, public holidays, weather…)

  • Group features: features that identify a unique time serie (e.g. you want to predict your sales by store and by product. If you have 2 stores selling 3 products, there are 6 time series in your file. Selecting features « store » and « product in the group column allows Prevision.io to take into account these multiple series)

_images/train_ts.png

Once eveything set up, you can launch the training by clicking on the « create and train » button, located on the top right of the screen:

_images/create_train.png

Optionnally, there are advanced options reachable by clicking the tab in the top bar:

_images/train_advanced_options_ts.png

Example 1 : You want to predict day ahead value per hour and you have all data available 1 week in the past for each value

Time step = 1 hour

Start of observation window = 7 (days) * 24 (hours / day) - 1 (because this value is inclusive) = 167

End of observation window = 0 (we have the last known value before each prediction)

Start of prediction window = 1 (we predict the next immediate value)

End of prediction window = 1 (day) * 24 (hours) (we predict the next day, on a hour level)

Example 2 : You want to predict from day+2 to day+7 (= week ahead minus the first day) per day and you have all data available 4 weeks in the past for each value with a 1 week delay (which means you don’t know the last week value)

Time step = 1 day

Start of observation window = 4 (weeks) * 7 (days / week) - 1 (because this value is inclusive) = 27

End of observation window = 1 (week) * 7 (days / week) = 7 (we miss the last known week)

Start of prediction window = 2 (we predict the second immediat value)

End of prediction window = 7 (we predict up to the next 7th day)

Notes: The wider the window is, the longer the compute time will be. Also, please make sure to provide an observation window of reasonnable size. It most usecases, it should be a reasonnable multiple of the prediction window. (e.g. if you predict day ahead, don’t use more that a couple of weeks in the observation window).

Images

_images/new_usecase_images.png

Regression / classification / multi classification

To launch a regression / classification / multiclass classification project, the method is identical to tabular usecases with the exception that you need to:

  • Add in the tabular Data Set a relative path to the image, which will be specified in the interface.
  • Provide an image type Data Set whose paths correspond to those indicated in the previous Data Set.

It should be noted that the tabular Data Set may or may not contain exogenous features (e.g. geographical position of the camera, temperature, weather, etc.)

Once this step done you can proceed on cliquing the configure dataset button, located on the top right of the screen:

_images/configure_dataset.png

Data Set configuration

_images/train_images.png

On the left part of the screen, you will be able to fill the same columns than in tabular usecase but you’ll need to add the « image path » feature which link the tabular Data Set and the images folder.

Once done, you can launch the training by clicking on the « create and train » button, located on the top right of the screen:

_images/create_train.png

Optionnally, there are advanced options reachable by clicking the tab in the top bar:

_images/train_advanced_options_images.png

Advanced options

Advanced options do work exactly like for tabular usecases.

Object detection

_images/train_object_detector.png

Like any other images usecase, you need to specify 2 Data Sets (one tabular and one images).

There is a « quick » button that will allow to train a model faster (typically by a factor 5-10) with a little bit less of performance.

Note: While object detection use case can run on CPU’s, the training time will be very long. That’s why we recommand you to have a instance that has GPU in it.

Once this step done you can proceed on cliquing the « configure dataset » button, located on the top right of the screen:

_images/configure_dataset.png

Data Set configuration

_images/train_object_detector_configuration.png

In this usecase type, you’ll need to provide:

  • image path: the feature that link the tabular Data Set to the image folder
  • object class column: the feature that indicates the category of the object to detect
  • top: the top ordinate of the pixel that indicates the bounding boxe in which the object is
  • right: the right abscissa of the pixel that indicates the bounding boxe in which the object is
  • bottom: the bottom ordinate of the pixel that indicates the bounding boxe in which the object is
  • left: the left abscissa of the pixel that indicates the bounding boxe in which the object is

Note: The Data Set shouldn’t contains any other columns than the one required to launche the training

Once done, you can launch the training by clicking on the « create and train » button, located on the top right of the screen:

_images/create_train.png