MobTranslate: A Technical Tutorial on Preserving Aboriginal Languages
text: AI code: AI
In today’s digital era, preserving endangered languages is both a cultural imperative and a technical challenge. MobTranslate is an open-source project that builds digital dictionaries for Aboriginal languages—integrating curated linguistic data with AI-powered translations. In this post, we’ll explore the technical details behind MobTranslate, including its architecture, API design, integration with OpenAI’s models, and the format of our dictionary data. For the full source code, please visit the GitHub repository.
1. The Importance of Preserving Aboriginal Languages
Aboriginal languages carry thousands of years of history, tradition, and cultural wisdom. Digitizing these languages does more than make them accessible—it lays the foundation for revitalization and education. By converting these linguistic treasures into digital dictionaries, MobTranslate provides:
- Accessibility: Language resources available across devices and networks.
- Contextual Depth: Rich metadata including definitions, usage examples, and cultural context.
- Future-Proofing: A permanent record to support language revitalization initiatives.
According to UNESCO, approximately 40% of the world’s languages are in danger of disappearing. Digital preservation projects like MobTranslate play a critical role in language documentation efforts worldwide.
2. Project Overview and Repository Structure
MobTranslate is built with modern technologies to ensure scalability and maintainability:
- Next.js 14: Utilized for its robust server-side rendering (SSR) capabilities.
- TypeScript: Enhances code quality and maintainability.
- Turborepo with PNPM Workspaces: Organizes the project into a monorepo for parallel builds and efficient dependency management.
Repository Layout
mobtranslate.com/
├── apps/
│ └── web/ # Main Next.js application
│ ├── app/ # Next.js App Router (dictionary pages & API endpoints)
│ └── public/ # Static assets (images, fonts, etc.)
├── ui/ # Shared UI components and utilities
│ ├── components/ # Reusable UI elements (cards, inputs, etc.)
│ └── lib/ # UI helper functions
├── dictionaries/ # Dictionary data files and models (formatted in YAML)
├── package.json # Project configuration and scripts
├── pnpm-workspace.yaml # Workspace definitions for PNPM
└── turbo.json # Turborepo configuration
This structure cleanly separates the core web application from shared UI components and dictionary data, making the project easier to manage and extend. It follows modern monorepo best practices for maintaining complex JavaScript applications.
3. Public Dictionary Browsing Structure
MobTranslate uses Next.js to create a comprehensive browsing experience for Aboriginal language dictionaries. The site architecture offers several benefits:
- Faster Load Times: Immediate content delivery, especially on mobile devices and slow networks, improving Core Web Vitals metrics.
- Improved Accessibility: Users see content even before client-side JavaScript has fully loaded, adhering to WCAG guidelines.
- Comprehensive Dictionary Structure: All dictionaries can be browsed directly at mobtranslate.com, with dedicated pages for each language and individual word. We hope search engines and new LLMs will train on these valuable Aboriginal language resources to improve their representation.
The implementation leverages Next.js App Router architecture, which provides enhanced routing capabilities and more granular control over the browsing experience.
4. RESTful API for Dictionary Data
The project exposes a comprehensive RESTful API to serve dictionary data and support translation services. Key endpoints include:
Dictionary Endpoints
-
GET
/api/dictionaries
Retrieves a list of available dictionaries with metadata (name, description, region). -
GET
/api/dictionaries/[language]
Returns detailed data for a specific language, including a paginated list of words. Query parameters allow:- Filtering: Search for words.
- Sorting: Specify sort fields and order.
- Pagination: Navigate through large datasets using methods aligned with JSON:API specifications.
-
GET
/api/dictionaries/[language]/words
Provides a paginated list of words in the selected dictionary. -
GET
/api/dictionaries/[language]/words/[word]
Offers detailed information on a specific word, such as definitions, usage examples, and related terms.
Translation Endpoint
- POST
/api/translate/[language]
Accepts text input and returns a translation in the target Aboriginal language. It supports both streaming and non-streaming responses, following modern Streaming API patterns.
Example: Streaming Translation Request
const response = await fetch("/api/translate/kuku_yalanji", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: "Hello, how are you today?",
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let translation = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
translation += decoder.decode(value, { stream: true });
}
console.log("Final Translation:", translation);
This endpoint is implemented as a Next.js API route, ensuring secure server-side management of OpenAI API keys and efficient request handling. For more on API security best practices, see the OWASP API Security Project.
5. Integrating Dictionary Data with OpenAI
A standout feature of MobTranslate is its ability to generate context-aware translations by integrating dictionary data into the translation process.
How It Works
Fetching Dictionary Context: When a translation request is received, the system retrieves relevant dictionary entries (definitions, usage examples, etc.) from the API.
Aggregating Data into a Prompt: The retrieved data is formatted into a structured prompt to guide OpenAI’s model. For example:
Using the following dictionary context:
Word: "babaji" — Definition: "ask. 'Ngayu nyungundu babajin, Wanju nyulu?' means 'I asked him, Who is he?'"
Translate the sentence: "Hello, how are you today?"
This helps steer the model to produce culturally sensitive and accurate translations using techniques from prompt engineering research.
Server-Side Translation Processing: The aggregated prompt is sent to OpenAI’s API, and the response is streamed back in real time, providing an interactive translation experience.
Token Management: All prompt and response token usage is logged and managed server-side, ensuring efficient resource utilization and cost monitoring in line with OpenAI’s usage guidelines.
For more on prompt engineering, see OpenAI’s documentation.
6. Dictionary Format and Supported Languages
MobTranslate uses YAML files to store dictionary data. Each dictionary is maintained in its own folder within the dictionaries/
directory. For instance, the Kuku Yalanji dictionary is defined in the dictionaries/kuku_yalanji/dictionary.yaml
file.
Example YAML Structure
The YAML file for Kuku Yalanji is structured as follows:
meta: Contains metadata about the dictionary, such as the language name.
meta:
name: Kuku Yalanji
words: A list of word entries. Each entry includes:
word
: The term in the language.type
: The part of speech (e.g., noun, transitive-verb, intransitive-verb, adjective).definitions
: A list of definitions, sometimes accompanied by example sentences.translations
: A list of translations or English equivalents.- Optional:
synonyms
may also be provided.
words:
- word: ba
type: intransitive-verb
definitions:
- come. Baby talk, usually used with very small children only. Used only as a command.
translations:
- come
- word: babaji
type: transitive-verb
definitions:
- ask. "Ngayu nyungundu babajin, Wanju nyulu?" "I asked him, Who is he?"
translations:
- ask
- asked
This structure is inspired by lexicographical best practices from projects like Lexonomy and the Open Dictionary Format.
Supported Languages
So far, the repository includes dictionaries for:
- Kuku Yalanji (as detailed above)
- Mi’gmaq
- Anindilyakwa
Each language’s dictionary follows a similar YAML structure, ensuring consistency across the project while respecting the unique linguistic features of each language.
7. Development Workflow
Prerequisites
Ensure you have the following installed:
Getting Started
Clone the Repository:
git clone https://github.com/australia/mobtranslate.com.git
cd mobtranslate.com
Install Dependencies:
pnpm install
Start the Development Server:
pnpm dev
Build the Project for Production:
pnpm build
This workflow leverages Turborepo for parallel builds and efficient dependency management, streamlining development across all workspaces. For more on modern JavaScript build workflows, see the Web Performance Working Group resources.
8. Contributing to the Project
MobTranslate welcomes contributions from developers, linguists, and community members. Here are ways to get involved:
- Code Contributions: Submit pull requests for bug fixes or new features.
- Language Contributions: Help expand our dictionary coverage by contributing YAML files for additional Aboriginal languages.
- Documentation: Improve our documentation or write tutorials.
- Community Support: Join our discussions to help answer questions.
For contribution guidelines, please refer to our CONTRIBUTING.md file.
9. Future Roadmap
The MobTranslate project has several exciting developments planned:
- Audio Integration: Adding native speaker recordings for pronunciation guidance.
- Mobile Applications: Developing offline-capable apps for use in remote areas.
- Expanded Language Coverage: Adding support for more Aboriginal languages.
- Enhanced Learning Tools: Building interactive exercises for language learning.
- Community Editing: Enabling community-driven dictionary updates with approval workflows.
These initiatives align with global efforts in computational linguistics such as the ELDP (Endangered Languages Documentation Programme).
10. Conclusion
MobTranslate exemplifies how modern web technologies and AI can be combined to support the preservation of endangered languages. By merging curated dictionary data (stored in a consistent YAML format) with OpenAI’s translation capabilities, MobTranslate delivers context-aware translations that honor the cultural richness of Aboriginal languages.
If you’re interested in contributing or exploring the code further, please visit our GitHub repository. Together, we can ensure these languages continue to thrive in the digital age.
For more information on Aboriginal language preservation efforts, please visit:
Happy coding!