Skip to content

Instantly share code, notes, and snippets.

@ezra100
Last active January 21, 2023 23:03
Show Gist options
  • Save ezra100/ba69ec42600b2baa7430dd53bec3f37c to your computer and use it in GitHub Desktop.
Save ezra100/ba69ec42600b2baa7430dd53bec3f37c to your computer and use it in GitHub Desktop.
How to create CLD-3 libraries to use in your own project

How to create an independent shared library of cld-3

If you wanna use cld-3 in your project but don't want to keep depending on Ninja/gn build system then here's how I did it - on Linux (Arch 4.16.13-1) and g++.

Disclaimer:

I'm not an expert on this topic, this is just how I (finally) managed to do it, not necessarily the best way to do it (esp. with the STD compatibility). if you have any suggestions about how to do it better let me know.

Static vs Shared Library

The benefit of a shared library is that you can copy it for wherever you want and use if from there, as for the static - I found out you can't move it, unless you move the whole out/Debug folder (or maybe just the out/Debug/obj folder, I'm not quite sure). There might be some differences in the performence, but they say it isn't significant.

Steps to create a shared/static library

  1. Check out the Chromium repository.
  2. Add the file language_identifier_lib.cc to /PATH/TO/chromium/src/third_party/cld_3/src/src/
  3. For a shared library add the following to /PATH/TO/chromium/src/third_party/cld_3/src/src/BUILD.gn:
shared_library("lang_identifier_so"){
  sources = [
    "language_identifier_lib.cc"
  ]
  deps = [
    ":cld_3"
  ]
}
  • For a static library add:
static_library("static_lang_identifier"){
  sources = [
    "language_identifier_lib.cc"
  ]
  deps = [
    ":cld_3",
  ]
}
  1. Run gn args out/Debug && ninja -C out/Debug third_party/cld_3/src/src:lang_identifier_so, in the document that opened add the following arg: use_custom_libcxx=false, save and close.
    • For Release (no debug symbols) add the argument is_debug = false
    • For static library replace :lang_identifier_so with :static_lang_identifier.
  2. Create a folder for your own project and put main.cpp and lIdentifier.h in it.
  3. From /chromium/src/third_party/cld_3/src/src/out/Debug copy libprotobuf_lite.so and liblang_identifier_so.so (for a static library) to your project folder.
  4. For a shared library - from your project folder run clang++ -g -std=c++17 -o a.out main.cpp -L . -l lang_identifier_so -l protobuf_lite
    • For Release remove the -g option
    • For a static library run clang++ -g -std=c++17 -o a.out main.cpp -L /PATH/TO/chromium/src/third_party/cld_3/src/src/out/Debug/obj/third_party/cld_3/src/src/ -L . -l static_lang_identifier -l cld_3 -l protos -l protobuf_lite
  5. Then run export LD_LIBRARY_PATH=``pwd`` (telling the linker to look here for the shared library).
  6. Run ./a.out "This piece of text is in English. Този текст е на Български." 3
  7. The output:
  language code: bg
  language name: Bulgarian
  probability: 0.917387
  reliable: 1
  proportion: 0.585366

  language code: en
  language name: English
  probability: 0.999979
  reliable: 1
  proportion: 0.414634

  language code: und
  language name: 
  probability: 0
  reliable: 0
  proportion: 0

STD compatibility

There were 2 issues with std - the first was with the function names (without adding use_custom_libcxx=false it was called std:__1 rather than std::__cxx11, which caused undefined reference error), but that was solved with the use_custom_libcxx=false, as said. The second issue is the vector, which was returned empty, that I solved by copying the vector into an array and returning it. There's a similar issue in stackoverflow , with an answer which requires to load a bunch of headers and configure a class or something. I decided it was just simpler for now to copy and return a vector, if you come up with a better solution - let me know. It's worth mentioning that Chromium's build-system uses it's own clang compiler (at /PATH/TO//chromium/src/third_party/llvm-build/Release+Asserts/bin/clang++) which was one vesion ahead of mine (6.0 vs 7.0), but using their compiler didn't solve the std::vector compatibility problem.

/* Copyright 2016 Google Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
#include <iostream>
#include <map>
#include <string>
#include "base.h"
#include "nnet_language_identifier.h"
using chrome_lang_id::NNetLanguageIdentifier;
using namespace std;
#ifdef __linux__
#define EXPORT __attribute__((visibility("default")))
#else
#if defined(_MSC_VER)
#define EXPORT __declspec(dllexport)
#else
#define EXPORT __attribute__((visibility("default")))
#endif
#endif
NNetLanguageIdentifier *lang_id = new NNetLanguageIdentifier(0, 1000);
// Min:
// Minimum number of bytes needed to make a prediction. If the default
// constructor is called, this variable is equal to kMinNumBytesToConsider.
// Max:
// Maximum number of bytes to use to make a prediction. If the default
// constructor is called, this variable is equal to kMaxNumBytesToConsider.
EXPORT void setMinMaxBytes(int min, int max) {
delete lang_id;
lang_id = new NNetLanguageIdentifier(min, max);
}
// @arr assumed to be be a pointer to an array with the size of @numberOfLangs
EXPORT void findTopNMostFreqLangs(const string &text, int numberOfLangs,
NNetLanguageIdentifier::Result *arr) {
auto results = lang_id->FindTopNMostFreqLangs(text, numberOfLangs);
std::copy(results.begin(), results.end(), arr);
}
EXPORT NNetLanguageIdentifier::Result *
findTopNMostFreqLangs(const string &text, int numberOfLangs) {
auto results = lang_id->FindTopNMostFreqLangs(text, numberOfLangs);
NNetLanguageIdentifier::Result *arr =
new NNetLanguageIdentifier::Result[numberOfLangs];
findTopNMostFreqLangs(text, numberOfLangs, arr);
return arr;
}
EXPORT NNetLanguageIdentifier::Result findLanguage(const string &text) {
return lang_id->FindLanguage(text);
}
#include <string>
#include <vector>
#include <map>
#ifdef __linux__
#define IMPORT
#define WMAIN main
#define string string
#define WCHAR char
#else
#define IMPORT __declspec(dllimport)
#define WMAIN main
#define WSTRING string
#define WCHAR char
#endif
using namespace std;
namespace chrome_lang_id {
namespace NNetLanguageIdentifier {
struct Result {
string language;
float probability = 0.0; // Language probability.
bool is_reliable = false; // Whether the prediction is reliable.
// Proportion of bytes associated with the language. If FindLanguage
// is called, this variable is set to 1.
float proportion = 0.0;
// Result(PointerResult pResult){
// this->language = std::string(pResult.language);
// this->probability = pResult.probability;
// this->proportion = pResult.proportion;
// this->is_reliable = pResult.is_reliable;
// }
};
} // namespace NNetLanguageIdentifier
} // namespace chrome_lang_id
using namespace chrome_lang_id::NNetLanguageIdentifier;
using namespace chrome_lang_id;
// Min:
// Minimum number of bytes needed to make a prediction. If the default
// constructor is called, this variable is equal to kMinNumBytesToConsider.
// Max:
// Maximum number of bytes to use to make a prediction. If the default
// constructor is called, this variable is equal to kMaxNumBytesToConsider.
IMPORT void setMinMaxBytes(int min, int max);
IMPORT NNetLanguageIdentifier::Result *findTopNMostFreqLangs(const string &text,
int numberOfLangs);
IMPORT void findTopNMostFreqLangs(const string &text,
int numberOfLangs, NNetLanguageIdentifier::Result *);
IMPORT Result findLanguage(const string &text);
map<std::string, std::string> codeToLangName{
{"ab", "Abkhazian"},
{"aa", "Afar"},
{"af", "Afrikaans"},
{"sq", "Albanian"},
{"am", "Amharic"},
{"ar", "Arabic"},
{"an", "Aragonese"},
{"hy", "Armenian"},
{"as", "Assamese"},
{"ae", "Avestan"},
{"ay", "Aymara"},
{"az", "Azerbaijani"},
{"ba", "Bashkir"},
{"eu", "Basque"},
{"be", "Belarusian"},
{"bn", "Bengali"},
{"bh", "Bihari"},
{"bi", "Bislama"},
{"bs", "Bosnian"},
{"br", "Breton"},
{"bg", "Bulgarian"},
{"my", "Burmese"},
{"ca", "Catalan"},
{"ch", "Chamorro"},
{"ce", "Chechen"},
{"zh", "Chinese"},
{"cu", "Church Slavic; Slavonic; Old Bulgarian"},
{"cv", "Chuvash"},
{"kw", "Cornish"},
{"co", "Corsican"},
{"hr", "Croatian"},
{"cs", "Czech"},
{"da", "Danish"},
{"dv", "Divehi; Dhivehi; Maldivian"},
{"nl", "Dutch"},
{"dz", "Dzongkha"},
{"en", "English"},
{"eo", "Esperanto"},
{"et", "Estonian"},
{"fo", "Faroese"},
{"fj", "Fijian"},
{"fi", "Finnish"},
{"fr", "French"},
{"gd", "Gaelic; Scottish Gaelic"},
{"gl", "Galician"},
{"ka", "Georgian"},
{"de", "German"},
{"el", "Greek, Modern (1453-)"},
{"gn", "Guarani"},
{"gu", "Gujarati"},
{"ht", "Haitian; Haitian Creole"},
{"ha", "Hausa"},
{"he", "Hebrew"},
{"hz", "Herero"},
{"hi", "Hindi"},
{"ho", "Hiri Motu"},
{"hu", "Hungarian"},
{"is", "Icelandic"},
{"io", "Ido"},
{"id", "Indonesian"},
{"ia", "Interlingua (International Auxiliary Language Association)"},
{"ie", "Interlingue"},
{"iu", "Inuktitut"},
{"ik", "Inupiaq"},
{"ga", "Irish"},
{"it", "Italian"},
{"ja", "Japanese"},
{"jv", "Javanese"},
{"kl", "Kalaallisut"},
{"kn", "Kannada"},
{"ks", "Kashmiri"},
{"kk", "Kazakh"},
{"km", "Khmer"},
{"ki", "Kikuyu; Gikuyu"},
{"rw", "Kinyarwanda"},
{"ky", "Kirghiz"},
{"kv", "Komi"},
{"ko", "Korean"},
{"kj", "Kuanyama; Kwanyama"},
{"ku", "Kurdish"},
{"lo", "Lao"},
{"la", "Latin"},
{"lv", "Latvian"},
{"li", "Limburgan; Limburger; Limburgish"},
{"ln", "Lingala"},
{"lt", "Lithuanian"},
{"lb", "Luxembourgish; Letzeburgesch"},
{"mk", "Macedonian"},
{"mg", "Malagasy"},
{"ms", "Malay"},
{"ml", "Malayalam"},
{"mt", "Maltese"},
{"gv", "Manx"},
{"mi", "Maori"},
{"mr", "Marathi"},
{"mh", "Marshallese"},
{"mo", "Moldavian"},
{"mn", "Mongolian"},
{"na", "Nauru"},
{"nv", "Navaho, Navajo"},
{"nd", "Ndebele, North"},
{"nr", "Ndebele, South"},
{"ng", "Ndonga"},
{"ne", "Nepali"},
{"se", "Northern Sami"},
{"no", "Norwegian"},
{"nb", "Norwegian Bokmal"},
{"nn", "Norwegian Nynorsk"},
{"ny", "Nyanja; Chichewa; Chewa"},
{"oc", "Occitan (post 1500); Provencal"},
{"or", "Oriya"},
{"om", "Oromo"},
{"os", "Ossetian; Ossetic"},
{"pi", "Pali"},
{"pa", "Panjabi"},
{"fa", "Persian"},
{"pl", "Polish"},
{"pt", "Portuguese"},
{"ps", "Pushto"},
{"qu", "Quechua"},
{"rm", "Raeto-Romance"},
{"ro", "Romanian"},
{"rn", "Rundi"},
{"ru", "Russian"},
{"sm", "Samoan"},
{"sg", "Sango"},
{"sa", "Sanskrit"},
{"sc", "Sardinian"},
{"sr", "Serbian"},
{"sn", "Shona"},
{"ii", "Sichuan Yi"},
{"sd", "Sindhi"},
{"si", "Sinhala; Sinhalese"},
{"sk", "Slovak"},
{"sl", "Slovenian"},
{"so", "Somali"},
{"st", "Sotho, Southern"},
{"es", "Spanish; Castilian"},
{"su", "Sundanese"},
{"sw", "Swahili"},
{"ss", "Swati"},
{"sv", "Swedish"},
{"tl", "Tagalog"},
{"ty", "Tahitian"},
{"tg", "Tajik"},
{"ta", "Tamil"},
{"tt", "Tatar"},
{"te", "Telugu"},
{"th", "Thai"},
{"bo", "Tibetan"},
{"ti", "Tigrinya"},
{"to", "Tonga (Tonga Islands)"},
{"ts", "Tsonga"},
{"tn", "Tswana"},
{"tr", "Turkish"},
{"tk", "Turkmen"},
{"tw", "Twi"},
{"ug", "Uighur"},
{"uk", "Ukrainian"},
{"ur", "Urdu"},
{"uz", "Uzbek"},
{"vi", "Vietnamese"},
{"vo", "Volapuk"},
{"wa", "Walloon"},
{"cy", "Welsh"},
{"fy", "Western Frisian"},
{"wo", "Wolof"},
{"xh", "Xhosa"},
{"yi", "Yiddish"},
{"yo", "Yoruba"},
{"za", "Zhuang; Chuang"},
{"zu", "Zulu"} ,
{"und", "Undefined Language"}
};
#pragma once
#include "lIdentifier.hpp"
#include <algorithm>
#include <codecvt>
#include <iostream>
#include <locale>
#include <map>
#include <vector>
#ifdef __linux__
#define WMAIN main
#define WSTRING string
#define WCHAR char
#else
#define WMAIN wmain
#define WSTRING wstring
#define WCHAR wchar_t
#endif
using namespace std;
void runTopFrequent(const string &text, int numOfLangs);
clock_t startTime;
// working with the already defined WMAIN WCHAR gave me errors
#ifdef __linux__
int main(int argc, char *argv[])
#else
int wmain(int argc, wchar_t *argv[])
#endif
{
if (argc < 2) {
std::cout << "usage: blah blach " << argv[0] << endl;
return 0;
}
#ifdef __linux__
std::string text(argv[1]);
#else
std::wstring wText(argv[1]);
// see https://stackoverflow.com/a/18374698/4483033
// use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string text =
std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(wText);
#endif // __linux__
if (argc < 3) {
Result result = findLanguage(text);
std::cout << "text: " << text << std::endl
<< " language: " << result.language << std::endl
<< " language name: " << codeToLangName[result.language]
<< std::endl
<< " probability: " << result.probability << std::endl
<< " reliable: " << result.is_reliable << std::endl
<< " proportion: " << result.proportion << std::endl
<< std::endl;
return 0;
}
int numOfLangs = stoi(argv[2]);
runTopFrequent(text, numOfLangs);
return 0;
}
void runTopFrequent(const string &text, int numOfLangs) {
Result *results = new Result[numOfLangs];
findTopNMostFreqLangs(text, numOfLangs, results);
for (int i = 0; i < numOfLangs; i++) {
auto result = results[i];
std::cout << " language code: " << result.language << std::endl
<< " language name: " << codeToLangName[result.language]
<< std::endl
<< " probability: " << result.probability << std::endl
<< " reliable: " << result.is_reliable << std::endl
<< " proportion: " << result.proportion << std::endl
<< std::endl;
}
delete[] results;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment