Docusaurus intro for Crowdin

Context

I work as a contractor for Facebook, on Docusaurus, a static site generator (Jamstack), based on React, to create documentation websites.

This is a widely used/popular project (20k Github stars), used by many users (not all requiring i18n), and also used by other popular open-source tools (basically, all tools published by Facebook, but many more).

We have all kind of users, from highly skilled open-source developers, to barely technical users that should be able to translate docs through an interface like Crowdin.

There are 2 versions:

v1 (doc): already supports i18n/Crowdin (https://docusaurus.io/docs/en/translation#crowdin) (I think all v1 sites use cli v2, not v3)
v2 (doc): in alpha for a long time. RC will be published once i18n/Crowdin integration is ready, and we have a migration plan for v1 users.

Examples of popular tools using Docusaurus v1 with i18n (that we should be able to migrate to v2):

Versioning particularities

The docs can be versioned, and we don't use a git-based workflow for that. One git branch can contain multiple docs versions.

We actually build a single site containing multiple versions, as a single-page-application, and allow user to seamlessly switch version through a dropdown, as it can be seen on http://v2.docusaurus.io/.

Docs folders:

"main/upstream" docs are at https://github.com/facebook/docusaurus/tree/slorber/i18n/website/docs
versioned docs: https://github.com/facebook/docusaurus/tree/slorber/i18n/website/versioned_docs

When we create a new version, we use a cli, that copies the upstream website/docs folder, and create a new folder in website/versioned_docs. For example yarn docusaurus docs:version 5.0.0 will somehow lead to a command like cp -R website/docs website/versioned_docs/version-5.0.0.

There is also something worth mentioning: we support multiple docs instance on the same site. Basically we allow users to maintain iOS SDK doc in v1/v2, and in parallel, maintain Android SDK doc in v2/v3.

v2 i18n goal

There is a RFC here: facebook/docusaurus#3317

Some goals are:

Make it easy to work with Crowdin by default
Provide clear migration path for v1 sites already using Crowdin
No SaaS lock-in: not everybody want to use Crowdin, some interesting reasons are in the RFC and its many links

v2 i18n work

We translate the markdown files of the site by using a FS-based convention to avoid any SaaS lock-in and also enable git-based translation workflows.

If there is a doc at website/docs/myDoc and we build the site in French, we'll look up for the same doc to eventually exist in website/i18n/docs/current/myDoc, and it would be picked up in priority, with a fallback to original doc.

The way Crowdin works (source/translations mapping) has been convenient to fit this FS-based convention, as I just have to tell Crowdin where to put the translated markdown files.

The ongoing v2 i18n work is being done here:

Git branch: https://github.com/facebook/docusaurus/tree/slorber/i18n
PR: facebook/docusaurus#3325
Crowdin config: https://github.com/facebook/docusaurus/blob/slorber/i18n/crowdin-v2.yaml
CI deploy script: https://github.com/facebook/docusaurus/blob/slorber/i18n/website/package.json#L22
Netlify deploy preview in English (default/source): https://deploy-preview-3325--docusaurus-2.netlify.app/classic/fr/
Netlify deploy preview in French: https://deploy-preview-3325--docusaurus-2.netlify.app/classic/fr/

Here are working examples displaying some french translations from Crowdin:

Questions for Crowdin

Read-only access-token?

I am wondering if it is possible to create a project access token with read-only permission, so that it's possible to download the Crowdin sources (to run them locally) without uploading the sources.

Such token could safely be made public so that all project open-source contributors can work with local translations if they want to.

Note: if we setup a Crowdin token in our CI env and enable deploy previews, users have the possibility to obtain such env variable by crafting a malicious PR, so a read-only token is likely better for such usecases.

CI integration

It is not clear to me what is the best way to integrate with modern CIs for Jamstack, like Netlify and Vercel, and ensure that we have a 100% fixed/reproductible env across builds. Many of our users will use such very convenient deployment platforms, on which you can't sudo nor run Docker, but you have a few things preinstalled (including Java).

So far I have been able to run Crowdin cli v3 by self hosting the jar and running it:

curl https://hardcore-ride-8fbb5a.netlify.app/crowdin-cli.jar --output crowdin-cli.jar
java -jar crowdin-cli.jar download --config ./crowdin-v2.yaml",

It works, but if you have any better alternative to share that would be great to know. Tried to add the jar to git, but it lead to "corrupt jar" error when trying to run it, weird...

Crowdin branches

It is not totally clear to me how to use Crowdin branches for Docusaurus.

What I see is that our v1 recommended integration does not use branches, according to existing configs I was able to found:

Our old/legacy v1 doc also seem to use the master branch: https://docusaurus.io/docs/en/translation#setup-the-crowdin-scripts

Do you think we should leverage Crowdin branches for our usecase and versioning patterns?

Deploy previews integration

It is not clear to me how we should integrate Crowdin with deploy previews (provided on each branch/PR by Netlify/Vercel).

I guess not all branches/PRs should lead to an upload of Crowdin sources, as all pending PRs would override the sources of each others.

Git

Should we download the Crowdin translations in the repo and add them to git?

Are translations supposed to be integrated in Git as part of our release process, or should they just be pulled lazily by the prod deployment pipeline, and Crowdin would just trigger deployment builds with webhooks?

I personnally like the idea to integrate translations in Git, as for me, markdown docs and i18n are part of a release. I mean, bad MessageFormat/ICU patterns can lead to mobile app crashes. We are less subject to crashes as we are a website though, and the deploy CI/CD would just fail, but to me there is always a security risk to have a malicious translator do weird such as adding bad html tags and frontmatters to translated docs...

I guess we don't necessarily have to enforce a specific Crowdin integration to our users, but probably should recommend at least one full translation workflow that make sense for our project by default.

Github contribution graph

If we integrate with Git, we have many open-source projects using Docusaurus. One concern that we have seen discussed is that developers are incentivized to contribute to open-source due to their Github contribution graph. They are less willing to contribute to translations if that does not show up in their Github history. This was one of the reasons for ReactJS website to not use Crowdin.

Was wondering if Crowdin could author commit-messages with the co-authored-by prefix that Github is able to understand?

https://docs.github.com/en/enterprise/2.13/user/articles/creating-a-commit-with-multiple-authors#creating-co-authored-commits-on-github-enterprise

Source of truth?

As far as I understand, it is both possible to upload and download translations.

If we have the default language/upstream docs translations on Crowdin, what is the source of truth for the default language? Git or Crowdin?

Where should users modify the default laguage? on Crowdin? on Git? Both at the same time? How are resolved conflicts?

Creating a version

As I explained, when we want to create v5, we run a command like:

cp -R website/docs website/versioned_docs/version-5.0.0

This means that we have a whole new folder of source markdown files.

I wonder how it is possible to make it so that these files do not end up 0% translated on crowdin?

For recall, existing config is:

    {
      'source': '/website/docs/**/*',
      'translation': '/website/i18n/%two_letters_code%/docs/current/**/%original_file_name%',
    },
    {
      'source': '/website/versioned_docs/**/*',
      'translation': '/website/i18n/%two_letters_code%/docs/**/%original_file_name%',
    },

I was thinking of updating the version creation process like this:

download translations
Copy the source docs as v5: cp -R website/docs website/versioned_docs/version-5.0.0
Copy the translated docs as v5 (for each lang): cp -R website/i18n/<lang>/docs/current website/i18n/<lang>/docs/version-5.0.0
upload sources
upload translations

This probably works, however this put significant more burden/risk on the user side, and not sure it could easily be automated (eventually we could create a docusaurus cli cmd that calls the crowdin cli).

Do you see any other option so that v5 translations do not end up being untranslated?

There's also another problem: if users start to translate v5, and there is a work in progress on v5.1 in the "upstream docs", how can we make sure v5 translations will also be available in v5.1?

Migration

As explained, we have to migrate from v1 to v2, and to avoid being too disruptive to users, a v1 site using Crowdin should also be able to use Crowdin in v2.

v2 markdown parser (MDX) is a little bit different from v1 parser, so in the migration process, some markdown files will likely be edited a bit (mostly html blocks).

Also, document paths (sources/translations) are likely to be different.

I don't know yet how we can migrate v1 translated sites, and have a few questions:

should we create a new Crowdin project for the site in v2 (seems like a good idea to avoid messing things up by mistake)
should we edit the md docs locally, and re-upload them with the crowdin upload translations cli command?
can we keep the translation "history" in this process, ie keep contributions granted to the existing v1 translators?
any advise?

Upload files with many extensions

I tried regexp/glob patterns like this without success:

  "source" : "/docs/**/*.(md|mdx)",

  "source" : "/docs/**/*.{md,mdx}",

Is there a syntax for all files with a list of extensions?

For now I'm just using 'source': '/website/docs/**/*', and eventually use ignore when needed.

Fail-fast

There are many things that seem to lead to annoying warnings, and the translation not being downloaded. Basically, whenever I change the config translations path, if I don't re-upload with the new config, I get things like:

⚠️  Downloaded translations don't match the current project configuration. The translations for the following sources will be omitted (use --verbose to get the list of the omitted translations):

⚠️  Due to missing respective sources, the following translations will be omitted:

(--ignore-match is not what I'm looking for)

Wouldn't it be better to have a fail-fast option in the cli? Because these warnings will end-up unnoticed in the CI, and the site would be published untranslated silently.

Even such errors do not seem to lead to a failure or a non-zero exit code:

Upload
❌ File 'website/docs/styling-layout.md'
❌ null: Operation timed out (Read failed)

sept. 08, 2020 3:17:29 PM org.apache.http.impl.execchain.RetryExec execute
INFOS: I/O exception (java.net.SocketException) caught when processing request to {s}->https://api.crowdin.com:443: Operation timed out (Read failed)
sept. 08, 2020 3:17:29 PM org.apache.http.impl.execchain.RetryExec execute
INFOS: Retrying request to {s}->https://api.crowdin.com:443

Base path

It is not totally clear to me how behave 'base_path': '.'

Is . the folder of the yaml file, or is it the folder in which we run the cli? I expected it to be resolved against the config file but that does not seem to be the case.

crowdin download --config ./crowdin-v2.yaml
=> works

cd website && crowdin download --config ../crowdin-v2.yaml
=> fails

Slow uploads?

In Docusaurus, users are able to co-locate assets (images, more heavy zip files) in the same folder as the markdown files.

So I think the following make sense to capture such assets: 'source': '/website/docs/**/*', (Hope Crowdin is able to infer the file type properly because this can catch any kind of file the user put there)

These assets can indeed be versioned, and uploaded/downloaded from Crowdin for localization. So the asset may end up being duplicated quite a bit on the site (* nVersion for source upload, nVersionnLocales for translation download).

I found upload quite slow, can we do something about it, and skip files if local file hash match remote file hash or something, instead of reuploading everything everytime (s3 and CDN clis usually do that kind of optimizations).

slorber/crowdin-questions.md Secret