MotteRL - Reinforcement Learning

Train smarter agents

MotteRL enables you to fine-tune language models using reinforcement learning with custom reward functions.

Overview

MotteRL simplifies the process of training AI models using reinforcement learning from human feedback (RLHF) and custom reward functions. Instead of requiring complex ML infrastructure, MotteRL leverages cloud APIs to make model training accessible to developers without deep ML expertise.

Key Features

Custom Reward Functions

Define your own reward logic in Python to train models for specific tasks and behaviors.

Cloud-Based Training

No local GPU required - training runs on cloud infrastructure with automatic scaling.

Multi-Model Support

Train various models including Qwen, Phi, Llama, and custom architectures.

JSONL Data Format

Simple data format for training examples with expected outputs and metadata.

Getting Started

1. Prepare Training Data

Create a JSONL file with your training examples. Each line should contain a JSON object with the prompt and expected result:

{"prompt": "What is the capital of France?", "expected_result": "Paris"}
{"prompt": "Explain photosynthesis briefly", "expected_result": "Plants convert sunlight into energy"}
{"prompt": "Write a haiku about coding", "expected_result": "Code flows like water\nBugs hide in silent shadows\nDebug brings the light"}

2. Define Reward Function

Create a Python function that evaluates model responses and returns a score (higher is better):

def reward_fn(completion, **kwargs):
    response = completion[0].get('content', '')
    expected = kwargs.get('expected_result', '')
    
    # Exact match gets full reward
    if expected.lower() in response.lower():
        return 1.0
    
    # Partial match gets partial reward
    words_match = len(set(expected.lower().split()) & 
                     set(response.lower().split()))
    total_words = len(expected.split())
    
    return words_match / total_words if total_words > 0 else 0.0

3. Configure Training

Training Parameters:

Learning Rate: 0.001 - 0.1 (start with 0.01)
Iterations: 50-500 (depends on data size)
Model: Choose from available models
Multi-step: Enable for complex environments

Advanced Features

Multi-Step Reinforcement

Enable multi-step reinforcement for agents that need to perform multiple actions or have complex decision trees:

Use cases: Conversational agents, multi-turn problem solving, sequential decision making, game playing agents.

Custom Reward Functions

Accuracy-Based Reward

def reward_fn(completion, **kwargs):
    response = completion[0].get('content', '')
    expected = kwargs.get('expected_result', '')
    return 1.0 if expected in response else 0.0

Length-Based Reward

def reward_fn(completion, **kwargs):
    response = completion[0].get('content', '')
    target_length = kwargs.get('target_length', 100)
    length_diff = abs(len(response) - target_length)
    return max(0, 1 - length_diff / target_length)

Sentiment-Based Reward

def reward_fn(completion, **kwargs):
    response = completion[0].get('content', '')
    target_sentiment = kwargs.get('sentiment', 'positive')
    
    # Simple sentiment analysis
    positive_words = ['good', 'great', 'excellent', 'amazing']
    negative_words = ['bad', 'terrible', 'awful', 'horrible']
    
    pos_count = sum(1 for word in positive_words if word in response.lower())
    neg_count = sum(1 for word in negative_words if word in response.lower())
    
    if target_sentiment == 'positive':
        return pos_count / (pos_count + neg_count + 1)
    else:
        return neg_count / (pos_count + neg_count + 1)

Best Practices

Start Small

Begin with 50-100 training examples and simple reward functions before scaling up.

Balanced Rewards

Ensure your reward function returns values between 0 and 1 for consistent training.

Monitor Training

Watch training progress and adjust learning rates if the model isn't improving.

Test Thoroughly

Always test your trained model on held-out data to ensure it generalizes well.

API Reference

Start Training Job

POST /api/training/start
{
  "model": "qwen-2.5-7b",
  "learningRate": 0.01,
  "iterations": "100",
  "rewardFunction": "def reward_fn(completion, **kwargs): ..."
}

Starts a new training job with the specified parameters.

Next Steps

Try MotteRL

Start training your first model with custom rewards.

Training Tutorial

Follow a complete example of training a support agent.