These scripts were written to prepare the FTS database for use in Otacon507's Chinese Example sentences plugin: https://github.com/otacon507/zh-examples
The database importer will hopefully be rewritten in python and integrated into the above repo so end-users can easily create their own databases. I originally used Dr. Xian Qian's 2MParallelCorpus(https://github.com/qxred/2MParallelCorpus). Please note that though this data is of a generally high quality, it does contain some instances of non-English/non-Chinese sentences(est. 3000+ using langid), ~900+ html fragments, and an unknown amount of nonsense sentence fragments(but probably not too many).
Regarding the nonsense sentence fragments, we partially avoided this issue in the plugin by sorting results by their sentence length's distance from the average length of the result set. This is accomplished in two queries:
// pseudocode: