MotteRL enables fine-tuning of language models using reinforcement learning with custom reward functions, making advanced AI training accessible without deep ML expertise.
Define custom reward logic in Python to train models for specific tasks and behaviors.
def reward_fn(completion, **kwargs): response = completion[0].get('content', '') expected = kwargs.get('expected_result', '') return 1.0 if expected in response else 0.0
Use JSONL format for training examples with prompts and expected results.
{"prompt": "What is AI?", "expected_result": "Artificial Intelligence"} {"prompt": "Define ML", "expected_result": "Machine Learning"}
No local GPU required - training runs on cloud infrastructure with automatic scaling and monitoring.
Train various models including Qwen, Phi, Llama, and custom architectures with consistent APIs.
Model | Parameters | Best For | Training Time |
---|---|---|---|
Qwen 2.5 7B | 7 billion | General purpose, multilingual | 2-4 hours |
Phi 3.5 Mini | 3.8 billion | Fast inference, mobile | 1-2 hours |
Llama 3.1 8B | 8 billion | Code generation, reasoning | 3-5 hours |
Custom Models | Variable | Specialized tasks | Variable |
Range: 0.001 - 0.1 | Default: 0.01 | Recommendation: Start with 0.01, decrease if training is unstable
Range: 50-500 | Default: 100 | Recommendation: Scale with dataset size (1 iteration per 10 examples)
Use for: Conversational agents, multi-turn problem solving, sequential decision making
Range: 1-32 | Default: 8 | Recommendation: Increase for larger datasets, decrease for memory constraints
def semantic_reward(completion, **kwargs): from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') response = completion[0].get('content', '') expected = kwargs.get('expected_result', '') embeddings = model.encode([response, expected]) similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0] return max(0, similarity)
Uses sentence transformers to measure semantic similarity between response and expected result.
def multi_criteria_reward(completion, **kwargs): response = completion[0].get('content', '') # Accuracy score expected = kwargs.get('expected_result', '') accuracy = 1.0 if expected.lower() in response.lower() else 0.0 # Length score (prefer concise responses) target_length = kwargs.get('target_length', 100) length_score = max(0, 1 - abs(len(response) - target_length) / target_length) # Sentiment score positive_words = ['good', 'great', 'excellent', 'helpful'] sentiment_score = sum(1 for word in positive_words if word in response.lower()) / len(positive_words) # Weighted combination return 0.5 * accuracy + 0.3 * length_score + 0.2 * sentiment_score
Combines multiple criteria with weighted scoring for comprehensive evaluation.
POST /api/training/start Content-Type: application/json { "model": "qwen-2.5-7b", "learningRate": 0.01, "iterations": "100", "batchSize": 8, "multiStep": false, "rewardFunction": "def reward_fn(completion, **kwargs): ...", "trainingData": [ {"prompt": "...", "expected_result": "..."}, ... ] }
Starts a new training job with specified parameters and data.
GET /api/training/status/{job_id} Response: { "status": "training|completed|failed", "progress": 0.75, "currentIteration": 75, "totalIterations": 100, "metrics": { "averageReward": 0.82, "loss": 0.15 } }
Monitor training progress and performance metrics.