starrocks/docs/translation
Dan Roscigno dcfbdf588e
[Doc] Update Japanese translation file (#57829)
Signed-off-by: DanRoscigno <dan@roscigno.com>
2025-04-09 10:05:33 -07:00
..
en-to-ja-configs [Doc] Update Japanese translation file (#57829) 2025-04-09 10:05:33 -07:00
en-to-zh-configs [Doc] Translate from Chinese to English (#52259) 2024-10-29 20:40:22 +08:00
scripts [Doc] Japanese translation (#55725) 2025-02-26 08:37:49 +08:00
zh-to-en-configs [Doc] Translate from Chinese to English (#52259) 2024-10-29 20:40:22 +08:00
.env.sample [Doc] update translation process (#52481) 2024-10-31 07:28:51 +08:00
.gitignore [Doc] Translate from Chinese to English (#52259) 2024-10-29 20:40:22 +08:00
README.md [Doc] Japanese translation (#55725) 2025-02-26 08:37:49 +08:00
translation.Dockerfile [Doc] Translation container (#49935) 2024-08-18 21:07:48 -07:00

README.md

Translating docs with GPT

This README describes using GPT-4o to translate from Chinese to English, or from English to Chinese or Japanese. The system used is specific to Docusaurus Markdown and MDX. We are using code provided by Weights and Biases, as they also use Docusaurus and have expertise with LLMs.

1. Build the Docker image

See the README in the DanRoscigno/gpt-translate repo. Build the Docker image from there.

2. Set up the environment

There are three environment variables that need to be set in the file starrocks/docs/translation/.env:

Tip

Copy the file starrocks/docs/translation/.env.sample to starrocks/docs/translation/.env and edit with your API keys, or contact the Doc team leader for keys.

  • OPENAI_API_KEY
  • WANDB_API_KEY
  • GIT_PYTHON_REFRESH

GIT_PYTHON_REFRESH should be set to quiet because we are not interacting with Git within the container. The other two environment variables will be provided by the Documentation Team leader.

These should be set in the file in starrocks/docs/translation/.env

This is the format:

OPENAI_API_KEY=sk-proj-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
WANDB_API_KEY=bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
GIT_PYTHON_REFRESH=quiet

3. Identify the files to translate

Provide the paths of the files to translate in the file starrocks/docs/translation/files.txt

The entries in the file should be relative to the starrocks/docs/translation/ folder, for example:

Tip

The only difference in the files.txt content below is the name of the directory, zh or en. Add the path to the files that need to be translated.

From English to Chinese or Japanese

starrocks/docs/translation/files.txt

../en/quick_start/routine-load.md
../en/deployment/helm.md

From Chinese to English

starrocks/docs/translation/files.txt

../zh/quick_start/routine-load.md
../zh/deployment/helm.md

Tip:

Use find to get all of the Markdown files under a directory. For example, to translate alll of the include files under docs/en/_assets/:

find ../en/_assets -type f -name "*\.md*" > files.txt

Also:

Note:

Known issues:

Pydantic error about serialization: There is an error coming up after the translation is complete. I have not debugged this yet, I will see if the author of the tranlation package can help, he knows way more about Python than I do.

4. Translate the docs

Change dir back up to the starrocks folder so that you can mount the docs/ folder in the container.

cd ../../

Translate the files:

From English to Chinese

docker run -v ./docs:/docs \
  --env-file ./docs/translation/.env \
  translate \
  bash /docs/translation/scripts/en-to-zh.translate.sh

From English to Japanese

docker run -v ./docs:/docs \
  --env-file ./docs/translation/.env \
  translate \
  bash /docs/translation/scripts/en-to-ja.translate.sh

From Chinese to English

docker run -v ./docs:/docs \
  --env-file ./docs/translation/.env \
  translate \
  bash /docs/translation/scripts/zh-to-en.translate.sh

Sample output

You should see output similar to this if you are translating a single file:

Logged in as Weights & Biases user: danroscigno.
View Weave data at https://wandb.ai/danroscigno-starrocks/gpt-translate/weave
[14:56:11] INFO     config_folder: ./configs                           cli.py:70
                    debug: false
                    do_translate_header_description: true
                    input_file: ./files.txt
                    input_folder: ../en
                    language: zh
                    limit: null
                    max_openai_concurrent_calls: 7
                    max_tokens: 4096
                    model: gpt-4o
                    out_file: ' intro_ja.md'
                    out_folder: ../zh
                    remove_comments: true
                    replace: true
                    silence_openai: true
                    temperature: 0.2
                    weave_project: gpt-translate

           INFO     Reading ./files.txt                         translate.py:195
           INFO     Translating 1 file                          translate.py:202
Translating files:   0%|          | 0/1 [00:00<?, ?it/s][14:56:34] INFO     ✅ Translated file saved to                 translate.py:169
                    ../zh/developers/trace-tools/Trace.md
Translating files: 100%|██████████| 1/1 [00:22<00:00, 22.63s/it]
📦 Published to https://wandb.ai/danroscigno-starrocks/gpt-translate/weave/objects/Translation-zh/versions/UaI7t2Vtn8iI2Zcw7bQVHV1BMPUHYDY8cqBfcTW3QYQ
🍩 https://wandb.ai/danroscigno-starrocks/gpt-translate/r/call/0192dded-553f-7492-a82b-11539ad42bfa

Check the files

Once the translation is complete the container will exit. Check the status with git status and check the translated file(s).

Sample git status

My starrocks/docs/translation/files.txt contained a single path to the English Trace.md file, so a translated file was produced at docs/zh/developers/trace-tools/Trace.md:

$ git status

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   docs/zh/developers/trace-tools/Trace.md