MotteRL - Reinforcement Learning

Train smarter agents with custom rewards

MotteRL enables fine-tuning of language models using reinforcement learning with custom reward functions, making advanced AI training accessible without deep ML expertise.

Core Components

Reward Functions

Define custom reward logic in Python to train models for specific tasks and behaviors.

def reward_fn(completion, **kwargs):
    response = completion[0].get('content', '')
    expected = kwargs.get('expected_result', '')
    return 1.0 if expected in response else 0.0

Training Data

Use JSONL format for training examples with prompts and expected results.

{"prompt": "What is AI?", "expected_result": "Artificial Intelligence"}
{"prompt": "Define ML", "expected_result": "Machine Learning"}

Cloud Training

No local GPU required - training runs on cloud infrastructure with automatic scaling and monitoring.

Multi-Model Support

Train various models including Qwen, Phi, Llama, and custom architectures with consistent APIs.

Supported Models

Model	Parameters	Best For	Training Time
Qwen 2.5 7B	7 billion	General purpose, multilingual	2-4 hours
Phi 3.5 Mini	3.8 billion	Fast inference, mobile	1-2 hours
Llama 3.1 8B	8 billion	Code generation, reasoning	3-5 hours
Custom Models	Variable	Specialized tasks	Variable

Training Parameters

Learning Rate

Range: 0.001 - 0.1 | Default: 0.01 | Recommendation: Start with 0.01, decrease if training is unstable

Iterations

Range: 50-500 | Default: 100 | Recommendation: Scale with dataset size (1 iteration per 10 examples)

Multi-step Reinforcement

Use for: Conversational agents, multi-turn problem solving, sequential decision making

Batch Size

Range: 1-32 | Default: 8 | Recommendation: Increase for larger datasets, decrease for memory constraints

Advanced Reward Functions

Semantic Similarity Reward

def semantic_reward(completion, **kwargs):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    response = completion[0].get('content', '')
    expected = kwargs.get('expected_result', '')
    
    embeddings = model.encode([response, expected])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    
    return max(0, similarity)

Uses sentence transformers to measure semantic similarity between response and expected result.

Multi-Criteria Reward

def multi_criteria_reward(completion, **kwargs):
    response = completion[0].get('content', '')
    
    # Accuracy score
    expected = kwargs.get('expected_result', '')
    accuracy = 1.0 if expected.lower() in response.lower() else 0.0
    
    # Length score (prefer concise responses)
    target_length = kwargs.get('target_length', 100)
    length_score = max(0, 1 - abs(len(response) - target_length) / target_length)
    
    # Sentiment score
    positive_words = ['good', 'great', 'excellent', 'helpful']
    sentiment_score = sum(1 for word in positive_words if word in response.lower()) / len(positive_words)
    
    # Weighted combination
    return 0.5 * accuracy + 0.3 * length_score + 0.2 * sentiment_score

Combines multiple criteria with weighted scoring for comprehensive evaluation.

API Endpoints

Start Training Job

POST /api/training/start
Content-Type: application/json

{
  "model": "qwen-2.5-7b",
  "learningRate": 0.01,
  "iterations": "100",
  "batchSize": 8,
  "multiStep": false,
  "rewardFunction": "def reward_fn(completion, **kwargs): ...",
  "trainingData": [
    {"prompt": "...", "expected_result": "..."},
    ...
  ]
}

Starts a new training job with specified parameters and data.

Check Training Status

GET /api/training/status/{job_id}

Response:
{
  "status": "training|completed|failed",
  "progress": 0.75,
  "currentIteration": 75,
  "totalIterations": 100,
  "metrics": {
    "averageReward": 0.82,
    "loss": 0.15
  }
}

Monitor training progress and performance metrics.

Next Steps

Try MotteRL

Start training your first model with custom rewards.

Training Tutorial

Follow a complete example of training a support agent.