NLP¶

NLP treatments can be applied to textual features. To do that, your dataset has to contain some textual features. Then, during advanced configuration of your use case, you can apply some feature engineering using the “textual features” options.

Textual features: textual features are detected and automatically converted into numbers using 3 techniques:

https://storage.cloud.google.com/prevision-doc/textual%20feature.png

Statistical analysis using Term frequency–inverse document frequency (TF-IDF). Words are mapped to numerics generated using tf-idf metric. The platform has integrated fast algorithms making it possible to keep all uni-grams and bi-grams tf-idf encoding without having to apply dimension reducing. More information about TF-IDF on https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Word embedding approach using Word2Vec/Glove. Words are projected to a dense vector space, where semantic distance between words are preserved: Prevision trains a word2vec algorithm on the actual input corpus, to generate their corresponding vectors. More information about Word embedding on https://en.wikipedia.org/wiki/Word_embedding
Sentence Embedding using Transformers approach. Prevision has integrated BERT-based transformers, as a pre-trained contextual model, that captures words relationships in a bidirectional way. BERT transformer makes it possible to generate more efficient vectors than word Embedding algorithms, it has a linguistic “representation” of its own. To make a text classification, we can use these vector representations as input to basic classifiers to make text classification. Bert (base/uncased) is used on english text and Multi Lingual (base/cased) is used on french text. More information about Transformers on https://en.wikipedia.org/wiki/Transformer_(machine_learning_model). The Python Package used is Sentence Transformers (https://www.sbert.net/docs/pretrained_models.html)

By default, only TF-IDF approach is used.

Advices:

For better performance, it is advisable to check the word embedding and sentence embedding options.
Checking its additional options will increase the time required for feature engineering, modeling, and prediction

You will find more information about NLP features and applications with prevision.io plateform with the following links :

https://www.youtube.com/watch?v=8Zu7mpdk528

https://prevision.io/fr/nouvelle-version-v10-13-performances-sur-le-textes-ameliorees-et-reconnaissance-dimage-en-temps-reel/

https://medium.com/prevision-io/automated-nlp-with-prevision-io-part1-naive-bayes-classifier-475fa8bd73de