nuxeo-importer-core
contains several sample codes that can be adapted to run imports leveraging :
- thread-pooling
- batching (import several documents inside a given transaction)
- event processing filtering (enable bulk mode or skip some events)
This is the most efficient solution to run very fast imports.
However, the default implementation used to come with some limitations and constraints :
- extending the importer is done in Java
- this can be an issue for non Java developers
- multi-threading policy can be complex
- multi-threading policy depends on the source layout and dependencies between entries
- if import fails in the middle, then it must be restarted
The work on queue based importer and Kafka aims at addressing these limitations.
We want the importer infrastructure to promote a clear separation between the 2 sides of the import process :
- Reader / Producer : the one reading the imput data (from files, DB ...)
- Write / Consumer : the one writing the data into Nuxeo Repository
By decoupling the Reader and Writer, we have several gains :
- we can get the Writer/Consumer part very generic
- have a highly optimized importer engine
- we can run separately the producer and the consumer
- this means we can more easily re-run the import without being forced to re-run all the pre-processing
- developers "working on the import process" have mainly to work on the Reader/Producer part
- this part being mainly decoupled from Nuxeo, they do not have to be Nuxeo developers
In order to have this decoupling, the idea is to add a queue between the 2 parts of the importer:
Source data => Producer => Queue(s) => Consumer => Import Data in Nuxeo
This is a new implementation of the importer: nuxeo-importer-queues. We clearly split the importer flow in 2 sub parts and have the queue system externalizable.
- Import part 1
- read the data from the source
- build an import message (can include some transformation)
- en-queue the message
- Import part 2
- read the message from the queue
- create a document inside the repository based on the message
The queue in the middle also allows us to completly decouple the threading model between the 2 parts :
- part 1 can be mono-threaded if this simpler (since this is usually not the bottleneck)
- part 2 is by default multi-threaded and batched to increase performances
The queuing system can be proivided by different backend, the current nuxeo-importer-queues
currently supports 2 backends :
- ChronicalQueues
- in JVM but easy to setup
- Apache Kafka
- distributed MOM
Kafka may be a little big more complex to deploy but in exchange of the additional effort you need to do for the setup it does provide some additional benefit :
- you can scale the queue between several servers
- this means that you can run import on different Nuxeo Server nodes
- you are not limited by available memory
- Kafka queues on disk
- you can write the client part in any language supported by Kafka
- Java, JavaScript, Python, .Net, Ruby ...
More than the Java API, the real interface you need to implement on the producer side is the message format.
XXX
XXX describe principles.
- add marketplavce package
- setup kafka
- choose client
XXX
XXX
- threads
- document factory