Skip to main content

Generative AI

The Generative AI API provides an OpenAI-compatible interface for running LLM models on-device. This guide covers chat completions, embeddings, and model cache management.

Prerequisites

Before making inference requests:

  • mimOE runtime is running (Quick Start)
  • At least one model provisioned and ready in the Model Registry (upload guide)

Base URL

http://localhost:8083/mimik-ai/openai/v1

All endpoints require authentication:

Authorization: Bearer 1234

Chat Completions

Generate text responses using LLM models. The API follows the OpenAI chat completions format.

Basic Chat Completion

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
}'

Response:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1702742400,
"model": "smollm2-360m",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 28,
"total_tokens": 36
}
}

System Messages

Guide the model's behavior with a system message:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant. Provide concise, accurate code examples."},
{"role": "user", "content": "Write a Python function to reverse a string"}
]
}'

Multi-Turn Conversation

Include conversation history in the messages array:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What is its population?"}
]
}'
Conversation Management

The API is stateless. You must include the full conversation history in each request. The model doesn't remember previous requests.

Streaming Responses

Get responses token-by-token for real-time UX:

import OpenAI from 'openai';

const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});

const stream = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Complete this sentence: Programming is like a' }],
stream: true
});

for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}

Generation Parameters

Control the output with generation parameters:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Tell me a creative story"}],
"max_tokens": 200,
"temperature": 0.8,
"top_p": 0.95
}'
ParameterTypeDefaultDescription
max_tokensintegermodel limitMaximum tokens to generate
temperaturefloat (0-2)1.0Randomness (lower = more focused)
top_pfloat (0-1)1.0Nucleus sampling threshold

Temperature Guidelines:

  • 0.0-0.3: Factual, deterministic responses
  • 0.4-0.7: Balanced creativity and coherence
  • 0.8-1.5: Creative, diverse outputs

Embeddings

Generate vector embeddings from text using embedding models (kind: embed).

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog."
}'

Model Cache Management

The Inference API maintains a runtime cache of loaded models. Models are automatically loaded on first inference, or you can pre-load them.

List Loaded Models

curl "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Authorization: Bearer 1234"

Response:

{
"object": "list",
"data": [
{
"id": "smollm2-360m",
"object": "model",
"created": 1729591200,
"owned_by": "local"
}
]
}

Pre-Load a Model

Load a model into cache before first inference:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'

This is useful for warming up the cache to avoid first-inference latency.

Unload a Model

Free memory by unloading a model from cache:

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/smollm2-360m" \
-H "Authorization: Bearer 1234"

The model file remains in the Model Registry and can be reloaded later.

Error Handling

Common error responses:

StatusCauseSolution
400Model not readyComplete model provisioning in Model Registry
401Invalid API keyCheck Authorization header
404Model not foundProvision model in Model Registry first

Error Response Format:

{
"error": {
"code": 404,
"message": "Model 'unknown-model' not found in store"
}
}

Handling Errors with OpenAI SDK

import OpenAI from 'openai';

const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});

try {
const response = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
} catch (error) {
if (error instanceof OpenAI.APIConnectionError) {
console.error('Could not connect to mimOE. Is the runtime running?');
} else if (error instanceof OpenAI.APIError) {
console.error(`API error: ${error.message}`);
}
}

Performance Tips

First Request Latency

The first inference request loads the model into memory. This can take 5-30 seconds depending on model size. Subsequent requests are much faster.

Solution: Pre-load models on startup:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'

Context Size

Stay within the model's context window. If you exceed it, responses may be cut off or the request may fail.

function trimConversation(messages, maxTokens = 2048) {
// Simple estimation: ~4 chars per token
let totalChars = 0;
const trimmed = [];

for (let i = messages.length - 1; i >= 0; i--) {
const msgChars = messages[i].content.length;
if (totalChars + msgChars > maxTokens * 4) break;
trimmed.unshift(messages[i]);
totalChars += msgChars;
}

return trimmed;
}

Memory Management

Unload models you're not using to free memory:

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/old-model" \
-H "Authorization: Bearer 1234"

Troubleshooting

Slow First Request

Symptom: First inference takes 30+ seconds

Cause: Model loading into memory

Solution: This is expected. Pre-load models to avoid user-facing latency.

Out of Memory

Symptom: Error about insufficient memory

Solution:

  1. Use a smaller quantized model (Q4 instead of Q8)
  2. Unload other models from cache
  3. Close other memory-intensive applications

Token Limit Exceeded

Symptom: Response is cut off or error about context length

Cause: Every model has a maximum context size. The total tokens (input + output) cannot exceed this limit.

Solution:

  • Trim conversation history to reduce input tokens
  • Adjust max_tokens in the chat completion request to limit output tokens
  • Increase initContextSize when provisioning the model (see Model Registry API)

Next Steps