MotteRL enables you to fine-tune language models using reinforcement learning with custom reward functions.
MotteRL simplifies the process of training AI models using reinforcement learning from human feedback (RLHF) and custom reward functions. Instead of requiring complex ML infrastructure, MotteRL leverages cloud APIs to make model training accessible to developers without deep ML expertise.
Define your own reward logic in Python to train models for specific tasks and behaviors.
No local GPU required - training runs on cloud infrastructure with automatic scaling.
Train various models including Qwen, Phi, Llama, and custom architectures.
Simple data format for training examples with expected outputs and metadata.
Create a JSONL file with your training examples. Each line should contain a JSON object with the prompt and expected result:
{"prompt": "What is the capital of France?", "expected_result": "Paris"} {"prompt": "Explain photosynthesis briefly", "expected_result": "Plants convert sunlight into energy"} {"prompt": "Write a haiku about coding", "expected_result": "Code flows like water\nBugs hide in silent shadows\nDebug brings the light"}
Create a Python function that evaluates model responses and returns a score (higher is better):
def reward_fn(completion, **kwargs): response = completion[0].get('content', '') expected = kwargs.get('expected_result', '') # Exact match gets full reward if expected.lower() in response.lower(): return 1.0 # Partial match gets partial reward words_match = len(set(expected.lower().split()) & set(response.lower().split())) total_words = len(expected.split()) return words_match / total_words if total_words > 0 else 0.0
Enable multi-step reinforcement for agents that need to perform multiple actions or have complex decision trees:
Use cases: Conversational agents, multi-turn problem solving, sequential decision making, game playing agents.
def reward_fn(completion, **kwargs): response = completion[0].get('content', '') expected = kwargs.get('expected_result', '') return 1.0 if expected in response else 0.0
def reward_fn(completion, **kwargs): response = completion[0].get('content', '') target_length = kwargs.get('target_length', 100) length_diff = abs(len(response) - target_length) return max(0, 1 - length_diff / target_length)
def reward_fn(completion, **kwargs): response = completion[0].get('content', '') target_sentiment = kwargs.get('sentiment', 'positive') # Simple sentiment analysis positive_words = ['good', 'great', 'excellent', 'amazing'] negative_words = ['bad', 'terrible', 'awful', 'horrible'] pos_count = sum(1 for word in positive_words if word in response.lower()) neg_count = sum(1 for word in negative_words if word in response.lower()) if target_sentiment == 'positive': return pos_count / (pos_count + neg_count + 1) else: return neg_count / (pos_count + neg_count + 1)
Begin with 50-100 training examples and simple reward functions before scaling up.
Ensure your reward function returns values between 0 and 1 for consistent training.
Watch training progress and adjust learning rates if the model isn't improving.
Always test your trained model on held-out data to ensure it generalizes well.
POST /api/training/start { "model": "qwen-2.5-7b", "learningRate": 0.01, "iterations": "100", "rewardFunction": "def reward_fn(completion, **kwargs): ..." }
Starts a new training job with the specified parameters.