Lord Ajax
i write software and shitty poetry
Home About

MobTranslate: A Technical Tutorial on Preserving Aboriginal Languages

text: AI code: AI

In today’s digital era, preserving endangered languages is both a cultural imperative and a technical challenge. MobTranslate is an open-source project that builds digital dictionaries for Aboriginal languages—integrating curated linguistic data with AI-powered translations. In this post, we’ll explore the technical details behind MobTranslate, including its architecture, API design, integration with OpenAI’s models, and the format of our dictionary data. For the full source code, please visit the GitHub repository.


1. The Importance of Preserving Aboriginal Languages

Aboriginal languages carry thousands of years of history, tradition, and cultural wisdom. Digitizing these languages does more than make them accessible—it lays the foundation for revitalization and education. By converting these linguistic treasures into digital dictionaries, MobTranslate provides:

According to UNESCO, approximately 40% of the world’s languages are in danger of disappearing. Digital preservation projects like MobTranslate play a critical role in language documentation efforts worldwide.


2. Project Overview and Repository Structure

MobTranslate is built with modern technologies to ensure scalability and maintainability:

Repository Layout

mobtranslate.com/
    ├── apps/
    │   └── web/                # Main Next.js application
    │       ├── app/            # Next.js App Router (dictionary pages & API endpoints)
    │       └── public/         # Static assets (images, fonts, etc.)
    ├── ui/                     # Shared UI components and utilities
    │   ├── components/         # Reusable UI elements (cards, inputs, etc.)
    │   └── lib/                # UI helper functions
    ├── dictionaries/           # Dictionary data files and models (formatted in YAML)
    ├── package.json            # Project configuration and scripts
    ├── pnpm-workspace.yaml     # Workspace definitions for PNPM
    └── turbo.json              # Turborepo configuration
    

This structure cleanly separates the core web application from shared UI components and dictionary data, making the project easier to manage and extend. It follows modern monorepo best practices for maintaining complex JavaScript applications.


3. Public Dictionary Browsing Structure

MobTranslate uses Next.js to create a comprehensive browsing experience for Aboriginal language dictionaries. The site architecture offers several benefits:

The implementation leverages Next.js App Router architecture, which provides enhanced routing capabilities and more granular control over the browsing experience.


4. RESTful API for Dictionary Data

The project exposes a comprehensive RESTful API to serve dictionary data and support translation services. Key endpoints include:

Dictionary Endpoints

Translation Endpoint

Example: Streaming Translation Request

const response = await fetch("/api/translate/kuku_yalanji", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        text: "Hello, how are you today?",
        stream: true,
      }),
    });
    
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let translation = "";
    
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      translation += decoder.decode(value, { stream: true });
    }
    
    console.log("Final Translation:", translation);
    

This endpoint is implemented as a Next.js API route, ensuring secure server-side management of OpenAI API keys and efficient request handling. For more on API security best practices, see the OWASP API Security Project.

5. Integrating Dictionary Data with OpenAI

A standout feature of MobTranslate is its ability to generate context-aware translations by integrating dictionary data into the translation process.

How It Works

Fetching Dictionary Context: When a translation request is received, the system retrieves relevant dictionary entries (definitions, usage examples, etc.) from the API.

Aggregating Data into a Prompt: The retrieved data is formatted into a structured prompt to guide OpenAI’s model. For example:

Using the following dictionary context:
    Word: "babaji" — Definition: "ask. 'Ngayu nyungundu babajin, Wanju nyulu?' means 'I asked him, Who is he?'"
    Translate the sentence: "Hello, how are you today?"
    

This helps steer the model to produce culturally sensitive and accurate translations using techniques from prompt engineering research.

Server-Side Translation Processing: The aggregated prompt is sent to OpenAI’s API, and the response is streamed back in real time, providing an interactive translation experience.

Token Management: All prompt and response token usage is logged and managed server-side, ensuring efficient resource utilization and cost monitoring in line with OpenAI’s usage guidelines.

For more on prompt engineering, see OpenAI’s documentation.

6. Dictionary Format and Supported Languages

MobTranslate uses YAML files to store dictionary data. Each dictionary is maintained in its own folder within the dictionaries/ directory. For instance, the Kuku Yalanji dictionary is defined in the dictionaries/kuku_yalanji/dictionary.yaml file.

Example YAML Structure

The YAML file for Kuku Yalanji is structured as follows:

meta: Contains metadata about the dictionary, such as the language name.

meta:
      name: Kuku Yalanji
    

words: A list of word entries. Each entry includes:

words:
      - word: ba
        type: intransitive-verb
        definitions:
          - come. Baby talk, usually used with very small children only. Used only as a command.
        translations:
          - come
      - word: babaji
        type: transitive-verb
        definitions:
          - ask. "Ngayu nyungundu babajin, Wanju nyulu?" "I asked him, Who is he?"
        translations:
          - ask
          - asked
    

This structure is inspired by lexicographical best practices from projects like Lexonomy and the Open Dictionary Format.

Supported Languages

So far, the repository includes dictionaries for:

Each language’s dictionary follows a similar YAML structure, ensuring consistency across the project while respecting the unique linguistic features of each language.

7. Development Workflow

Prerequisites

Ensure you have the following installed:

Getting Started

Clone the Repository:

git clone https://github.com/australia/mobtranslate.com.git
    cd mobtranslate.com
    

Install Dependencies:

pnpm install
    

Start the Development Server:

pnpm dev
    

Build the Project for Production:

pnpm build
    

This workflow leverages Turborepo for parallel builds and efficient dependency management, streamlining development across all workspaces. For more on modern JavaScript build workflows, see the Web Performance Working Group resources.

8. Contributing to the Project

MobTranslate welcomes contributions from developers, linguists, and community members. Here are ways to get involved:

For contribution guidelines, please refer to our CONTRIBUTING.md file.

9. Future Roadmap

The MobTranslate project has several exciting developments planned:

These initiatives align with global efforts in computational linguistics such as the ELDP (Endangered Languages Documentation Programme).

10. Conclusion

MobTranslate exemplifies how modern web technologies and AI can be combined to support the preservation of endangered languages. By merging curated dictionary data (stored in a consistent YAML format) with OpenAI’s translation capabilities, MobTranslate delivers context-aware translations that honor the cultural richness of Aboriginal languages.

If you’re interested in contributing or exploring the code further, please visit our GitHub repository. Together, we can ensure these languages continue to thrive in the digital age.

For more information on Aboriginal language preservation efforts, please visit:

Happy coding!