These scripts were written to prepare the FTS database for use in Otacon507's Chinese Example sentences plugin: https://github.com/otacon507/zh-examples
The database importer will hopefully be rewritten in python and integrated into the above repo so end-users can easily create their own databases. I originally used Dr. Xian Qian's 2MParallelCorpus(https://github.com/qxred/2MParallelCorpus). Please note that though this data is of a generally high quality, it does contain some instances of non-English/non-Chinese sentences(est. 3000+ using langid), ~900+ html fragments, and an unknown amount of nonsense sentence fragments(but probably not too many).
Regarding the nonsense sentence fragments, we partially avoided this issue in the plugin by sorting results by their sentence length's distance from the average length of the result set. This is accomplished in two queries:
// pseudocode:
str = '你好'
avg_len = db.query(`SELECT AVG(LENGTH(zh))
FROM example_sentences WHERE zh MATCH(?)`, str)
results = db.query("SELECT *
FROM example_sentences
WHERE zh MATCH(?)
ORDER BY ABS((#{avg_len} - LENGTH(zh))) ", str)
In the plugin these results are then further processed into html lists and returned to the user's flashcard.
Originally I wrote a simple prototype for the plugin using data from dict.cn when Otacon507 first approached me with the idea: https://gist.github.com/Takemikazuchi/9907875
There are several issues with this approach: 1, dependency on an external data source with no guarantees of continued existence. 2, slow response times(have to download an entire html page for a couple sentences). 3, many characters/words did not have any sentences.
At the end of the day relying on html scraping is too brittle to distribute in a plugin for end-users. After some extensive research and learning of the existence of bilingual corpora used in linguistics research, we shifted focus to acquiring high-quality copora with permissive licenses to distribute with the plugin. Although this does come with the space penalty of the large corpus files that need to be downloaded, we believe the speed and flexibility this affords us to process the data to be well worth it. We considered using the Glosbe API but unfortunately most of their Chinese data is from sources like the UN/EU corpus or extremely technical documents not suitable for educational use.
There are some issues with the corpora approach as well though, notably the fact that many of the high-quality ones are behind paywalls or only for academics to use, and conversely finding high-quality free copora, with conversational sentences is extremely difficult.
Step 1:
Prepare a CSV file with the following two-column format: english,chinese
Step 2:
Create a FTS4 virtual table in sqlite3 (http://www.sqlite.org/fts3.html)
Step 3:
Import the data from the CSV file into the Virtual Table. The import is done in batches of 500, multithreaded, and wrapped in a single transaction due to the size of the data.(Multithreading uses Ruby's green threads but still benefits from context-switching during wait-heavy operations).