Generative AI
The Generative AI API provides an OpenAI-compatible interface for running LLM models on-device. This guide covers chat completions, embeddings, and model cache management.
Prerequisites
Before making inference requests:
- mimOE runtime is running (Quick Start)
- At least one model provisioned and ready in the Model Registry (upload guide)
Base URL
http://localhost:8083/mimik-ai/openai/v1
All endpoints require authentication:
Authorization: Bearer 1234
Chat Completions
Generate text responses using LLM models. The API follows the OpenAI chat completions format.
Basic Chat Completion
- cURL
- JavaScript
- Python
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
}'
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});
const response = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [
{ role: 'user', content: 'Complete this sentence: AI is like a' }
]
});
console.log(response.choices[0].message.content);
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
response = client.chat.completions.create(
model="smollm2-360m",
messages=[
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
)
print(response.choices[0].message.content)
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1702742400,
"model": "smollm2-360m",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
}
}
System Messages
Guide the model's behavior with a system message:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant. Provide concise, accurate code examples."},
{"role": "user", "content": "Write a Python function to reverse a string"}
]
}'
Multi-Turn Conversation
Include conversation history in the messages array:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What is its population?"}
]
}'
The API is stateless. You must include the full conversation history in each request. The model doesn't remember previous requests.
Streaming Responses
Get responses token-by-token for real-time UX:
- JavaScript
- Python
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});
const stream = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Complete this sentence: Programming is like a' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
stream = client.chat.completions.create(
model="smollm2-360m",
messages=[{"role": "user", "content": "Complete this sentence: Programming is like a"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Generation Parameters
Control the output with generation parameters:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Tell me a creative story"}],
"max_tokens": 200,
"temperature": 0.8,
"top_p": 0.95
}'
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens | integer | model limit | Maximum tokens to generate |
temperature | float (0-2) | 1.0 | Randomness (lower = more focused) |
top_p | float (0-1) | 1.0 | Nucleus sampling threshold |
Temperature Guidelines:
0.0-0.3: Factual, deterministic responses0.4-0.7: Balanced creativity and coherence0.8-1.5: Creative, diverse outputs
Embeddings
Generate vector embeddings from text using embedding models (kind: embed).
- cURL
- JavaScript
- Python
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog."
}'
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});
// Single text
const response = await client.embeddings.create({
model: 'nomic-embed-text',
input: 'The quick brown fox jumps over the lazy dog.'
});
console.log(response.data[0].embedding.slice(0, 5)); // First 5 dimensions
// Batch embeddings
const batchResponse = await client.embeddings.create({
model: 'nomic-embed-text',
input: [
'First document to embed',
'Second document to embed',
'Third document to embed'
]
});
batchResponse.data.forEach(item => {
console.log(`Document ${item.index}: ${item.embedding.length} dimensions`);
});
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
# Single text
response = client.embeddings.create(
model="nomic-embed-text",
input="The quick brown fox jumps over the lazy dog."
)
print(response.data[0].embedding[:5]) # First 5 dimensions
# Batch embeddings
response = client.embeddings.create(
model="nomic-embed-text",
input=[
"First document to embed",
"Second document to embed",
"Third document to embed"
]
)
for item in response.data:
print(f"Document {item.index}: {len(item.embedding)} dimensions")
Model Cache Management
The Inference API maintains a runtime cache of loaded models. Models are automatically loaded on first inference, or you can pre-load them.
List Loaded Models
curl "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Authorization: Bearer 1234"
Response:
{
"object": "list",
"data": [
{
"id": "smollm2-360m",
"object": "model",
"created": 1729591200,
"owned_by": "local"
}
]
}
Pre-Load a Model
Load a model into cache before first inference:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'
This is useful for warming up the cache to avoid first-inference latency.
Unload a Model
Free memory by unloading a model from cache:
curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/smollm2-360m" \
-H "Authorization: Bearer 1234"
The model file remains in the Model Registry and can be reloaded later.
Error Handling
Common error responses:
| Status | Cause | Solution |
|---|---|---|
| 400 | Model not ready | Complete model provisioning in Model Registry |
| 401 | Invalid API key | Check Authorization header |
| 404 | Model not found | Provision model in Model Registry first |
Error Response Format:
{
"error": {
"code": 404,
"message": "Model 'unknown-model' not found in store"
}
}
Handling Errors with OpenAI SDK
- JavaScript
- Python
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});
try {
const response = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
} catch (error) {
if (error instanceof OpenAI.APIConnectionError) {
console.error('Could not connect to mimOE. Is the runtime running?');
} else if (error instanceof OpenAI.APIError) {
console.error(`API error: ${error.message}`);
}
}
from openai import OpenAI, APIError, APIConnectionError
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
try:
response = client.chat.completions.create(
model="smollm2-360m",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
except APIConnectionError:
print("Could not connect to mimOE. Is the runtime running?")
except APIError as e:
print(f"API error: {e.message}")
Performance Tips
First Request Latency
The first inference request loads the model into memory. This can take 5-30 seconds depending on model size. Subsequent requests are much faster.
Solution: Pre-load models on startup:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'
Context Size
Stay within the model's context window. If you exceed it, responses may be cut off or the request may fail.
function trimConversation(messages, maxTokens = 2048) {
// Simple estimation: ~4 chars per token
let totalChars = 0;
const trimmed = [];
for (let i = messages.length - 1; i >= 0; i--) {
const msgChars = messages[i].content.length;
if (totalChars + msgChars > maxTokens * 4) break;
trimmed.unshift(messages[i]);
totalChars += msgChars;
}
return trimmed;
}
Memory Management
Unload models you're not using to free memory:
curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/old-model" \
-H "Authorization: Bearer 1234"
Troubleshooting
Slow First Request
Symptom: First inference takes 30+ seconds
Cause: Model loading into memory
Solution: This is expected. Pre-load models to avoid user-facing latency.
Out of Memory
Symptom: Error about insufficient memory
Solution:
- Use a smaller quantized model (Q4 instead of Q8)
- Unload other models from cache
- Close other memory-intensive applications
Token Limit Exceeded
Symptom: Response is cut off or error about context length
Cause: Every model has a maximum context size. The total tokens (input + output) cannot exceed this limit.
Solution:
- Trim conversation history to reduce input tokens
- Adjust
max_tokensin the chat completion request to limit output tokens - Increase
initContextSizewhen provisioning the model (see Model Registry API)
Next Steps
- Chat with SmolLM2: Build a complete chat application
- Semantic Search: Use embeddings for search
- Model Registry API: Manage your models
- Inference API Reference: Complete API specification