Clone the code of rasa_nlu and checkout to lean-CRF brannch. This is only required until #1095 PR gets merged. Once they merge the PR, simple clone would do the trick.
$ git clone https://github.com/RasaHQ/rasa_nlu.git
$ git fetch
$ git checkout lean-crf
Create nlu data in the language of your choice using this online trainer with predefined entities and intents.
Create NLU config file with the following pipeline:
pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
Finally, train your bot using the training data and the config file created in step 2 and 3 respectively.
$ python -m rasa_nlu.train -d <training file> -c <config file> --path <output path> --debug
You can use the following command to run the bot with trained data:
$ python -m rasa_nlu.run -m <path/to/trained/model>
I tried it with a small Arabic dataset and it worked perfectly (identifying correct intents and entities).
{
"rasa_nlu_data": {
"common_examples": [
{
"text": "مرحبا",
"intent": "greet",
"entities": []
},
{
"text": "سلام",
"intent": "greet",
"entities": []
},
{
"text": "هل هناك أي مطاعم في شمال المدينة",
"intent": "restaurant_search",
"entities": [
{
"start": 20,
"end": 24,
"value": "شمال",
"entity": "location"
}
]
},
{
"text": "أنا أريد أن آكل البيتزا",
"intent": "restaurant_search",
"entities": [
{
"start": 16,
"end": 23,
"value": "البيتزا",
"entity": "cuisine"
}
]
},
{
"text": "وداعا",
"intent": "bye",
"entities": []
},
{
"text": "هل يوجد مطعم مكسيكي في كاليفورنيا؟",
"intent": "restaurant_search",
"entities": [
{
"start": 13,
"end": 19,
"value": "مكسيكي",
"entity": "cuisine"
},
{
"start": 23,
"end": 33,
"value": "كاليفورنيا",
"entity": "location"
}
]
},
{
"text": "أعطني بعض الطعام التايلاندية في وسط المدينة.",
"intent": "restaurant_search",
"entities": [
{
"start": 17,
"end": 28,
"value": "التايلاندية",
"entity": "cuisine"
},
{
"start": 32,
"end": 43,
"value": "وسط المدينة",
"entity": "location"
}
]
},
{
"text": "أراك لاحقاً",
"intent": "bye",
"entities": []
}
]
}
}
After training on this data and the above mentioned pipeline, I got the following output with rasa_nlu.run
:
2018-05-26 19:02:03 INFO __main__ - NLU model loaded. Type a message and press enter to parse it.
أعطني بعض الطعام التايلاندية في وسط المدينة
{
"intent": {
"name": "restaurant_search",
"confidence": 0.9609272480010986
},
"entities": [
{
"start": 17,
"end": 28,
"value": "\u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629",
"entity": "cuisine",
"confidence": 0.8295018967592525,
"extractor": "ner_crf"
}
],
"intent_ranking": [
{
"name": "restaurant_search",
"confidence": 0.9609272480010986
},
{
"name": "greet",
"confidence": 0.01041296124458313
},
{
"name": "bye",
"confidence": -0.04000069573521614
}
],
"text": "\u0623\u0639\u0637\u0646\u064a \u0628\u0639\u0636 \u0627\u0644\u0637\u0639\u0627\u0645 \u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629 \u0641\u064a \u0648\u0633\u0637 \u0627\u0644\u0645\u062f\u064a\u0646\u0629"
}
2018-05-26 19:02:06 INFO __main__ - Next message:
هل هناك أي مطاعم في شمال المدينة
{
"intent": {
"name": "restaurant_search",
"confidence": 0.95904541015625
},
"entities": [
{
"start": 20,
"end": 24,
"value": "\u0634\u0645\u0627\u0644",
"entity": "location",
"confidence": 0.7714837162104833,
"extractor": "ner_crf"
}
],
"intent_ranking": [
{
"name": "restaurant_search",
"confidence": 0.95904541015625
},
{
"name": "bye",
"confidence": 0.016841422766447067
},
{
"name": "greet",
"confidence": -0.01291605830192566
}
],
"text": "\u0647\u0644 \u0647\u0646\u0627\u0643 \u0623\u064a \u0645\u0637\u0627\u0639\u0645 \u0641\u064a \u0634\u0645\u0627\u0644 \u0627\u0644\u0645\u062f\u064a\u0646\u0629"
}
The output is being shown in unicode
for the Arabic text. I cross checked to confirm that these unicodes match the desired entity texts exactly. It isn't a problem to convert these unicode
strings to utf-8
character set.
>>> print("\u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629")
التايلاندية
>>> print("\u0634\u0645\u0627\u0644")
'شمال'