Generative AI

The Generative AI API provides an OpenAI-compatible interface for running LLM models on-device. This guide covers chat completions, embeddings, and model cache management.

Prerequisites

Before making inference requests:

mimOE runtime is running (Quick Start)
At least one model provisioned and ready in the Model Registry (upload guide)

Base URL

http://localhost:8083/mimik-ai/openai/v1

All endpoints require authentication:

Authorization: Bearer 1234

Chat Completions

Generate text responses using LLM models. The API follows the OpenAI chat completions format.

Basic Chat Completion

cURL
JavaScript
Python

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "user", "content": "Complete this sentence: AI is like a"}
    ]
  }'

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'
});

const response = await client.chat.completions.create({
  model: 'smollm2-360m',
  messages: [
    { role: 'user', content: 'Complete this sentence: AI is like a' }
  ]
});

console.log(response.choices[0].message.content);

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

response = client.chat.completions.create(
    model="smollm2-360m",
    messages=[
        {"role": "user", "content": "Complete this sentence: AI is like a"}
    ]
)

print(response.choices[0].message.content)

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1702742400,
  "model": "smollm2-360m",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 28,
    "total_tokens": 36
  }
}

System Messages

Guide the model's behavior with a system message:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant. Provide concise, accurate code examples."},
      {"role": "user", "content": "Write a Python function to reverse a string"}
    ]
  }'

Multi-Turn Conversation

Include conversation history in the messages array:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris."},
      {"role": "user", "content": "What is its population?"}
    ]
  }'

Conversation Management

The API is stateless. You must include the full conversation history in each request. The model doesn't remember previous requests.

Streaming Responses

Get responses token-by-token for real-time UX:

JavaScript
Python

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'
});

const stream = await client.chat.completions.create({
  model: 'smollm2-360m',
  messages: [{ role: 'user', content: 'Complete this sentence: Programming is like a' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

stream = client.chat.completions.create(
    model="smollm2-360m",
    messages=[{"role": "user", "content": "Complete this sentence: Programming is like a"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Generation Parameters

Control the output with generation parameters:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [{"role": "user", "content": "Tell me a creative story"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_p": 0.95
  }'

Parameter	Type	Default	Description
`max_tokens`	integer	model limit	Maximum tokens to generate
`temperature`	float (0-2)	1.0	Randomness (lower = more focused)
`top_p`	float (0-1)	1.0	Nucleus sampling threshold

Temperature Guidelines:

0.0-0.3: Factual, deterministic responses
0.4-0.7: Balanced creativity and coherence
0.8-1.5: Creative, diverse outputs

Embeddings

Generate vector embeddings from text using embedding models (kind: embed).

cURL
JavaScript
Python

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "nomic-embed-text",
    "input": "The quick brown fox jumps over the lazy dog."
  }'

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'
});

// Single text
const response = await client.embeddings.create({
  model: 'nomic-embed-text',
  input: 'The quick brown fox jumps over the lazy dog.'
});
console.log(response.data[0].embedding.slice(0, 5));  // First 5 dimensions

// Batch embeddings
const batchResponse = await client.embeddings.create({
  model: 'nomic-embed-text',
  input: [
    'First document to embed',
    'Second document to embed',
    'Third document to embed'
  ]
});
batchResponse.data.forEach(item => {
  console.log(`Document ${item.index}: ${item.embedding.length} dimensions`);
});

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

# Single text
response = client.embeddings.create(
    model="nomic-embed-text",
    input="The quick brown fox jumps over the lazy dog."
)
print(response.data[0].embedding[:5])  # First 5 dimensions

# Batch embeddings
response = client.embeddings.create(
    model="nomic-embed-text",
    input=[
        "First document to embed",
        "Second document to embed",
        "Third document to embed"
    ]
)
for item in response.data:
    print(f"Document {item.index}: {len(item.embedding)} dimensions")

Model Cache Management

The Inference API maintains a runtime cache of loaded models. Models are automatically loaded on first inference, or you can pre-load them.

List Loaded Models

curl "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Authorization: Bearer 1234"

Response:

{
  "object": "list",
  "data": [
    {
      "id": "smollm2-360m",
      "object": "model",
      "created": 1729591200,
      "owned_by": "local"
    }
  ]
}

Pre-Load a Model

Load a model into cache before first inference:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{"model": "smollm2-360m"}'

This is useful for warming up the cache to avoid first-inference latency.

Unload a Model

Free memory by unloading a model from cache:

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/smollm2-360m" \
  -H "Authorization: Bearer 1234"

The model file remains in the Model Registry and can be reloaded later.

Error Handling

Common error responses:

Status	Cause	Solution
400	Model not ready	Complete model provisioning in Model Registry
401	Invalid API key	Check Authorization header
404	Model not found	Provision model in Model Registry first

Error Response Format:

{
  "error": {
    "code": 404,
    "message": "Model 'unknown-model' not found in store"
  }
}

Handling Errors with OpenAI SDK

JavaScript
Python

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'
});

try {
  const response = await client.chat.completions.create({
    model: 'smollm2-360m',
    messages: [{ role: 'user', content: 'Hello!' }]
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.APIConnectionError) {
    console.error('Could not connect to mimOE. Is the runtime running?');
  } else if (error instanceof OpenAI.APIError) {
    console.error(`API error: ${error.message}`);
  }
}

from openai import OpenAI, APIError, APIConnectionError

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

try:
    response = client.chat.completions.create(
        model="smollm2-360m",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(response.choices[0].message.content)
except APIConnectionError:
    print("Could not connect to mimOE. Is the runtime running?")
except APIError as e:
    print(f"API error: {e.message}")

Performance Tips

First Request Latency

The first inference request loads the model into memory. This can take 5-30 seconds depending on model size. Subsequent requests are much faster.

Solution: Pre-load models on startup:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{"model": "smollm2-360m"}'

Context Size

Stay within the model's context window. If you exceed it, responses may be cut off or the request may fail.

function trimConversation(messages, maxTokens = 2048) {
  // Simple estimation: ~4 chars per token
  let totalChars = 0;
  const trimmed = [];

  for (let i = messages.length - 1; i >= 0; i--) {
    const msgChars = messages[i].content.length;
    if (totalChars + msgChars > maxTokens * 4) break;
    trimmed.unshift(messages[i]);
    totalChars += msgChars;
  }

  return trimmed;
}

Memory Management

Unload models you're not using to free memory:

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models/old-model" \
  -H "Authorization: Bearer 1234"

Troubleshooting

Slow First Request

Symptom: First inference takes 30+ seconds

Cause: Model loading into memory

Solution: This is expected. Pre-load models to avoid user-facing latency.

Out of Memory

Symptom: Error about insufficient memory

Solution:

Use a smaller quantized model (Q4 instead of Q8)
Unload other models from cache
Close other memory-intensive applications

Token Limit Exceeded

Symptom: Response is cut off or error about context length

Cause: Every model has a maximum context size. The total tokens (input + output) cannot exceed this limit.

Solution:

Trim conversation history to reduce input tokens
Adjust max_tokens in the chat completion request to limit output tokens
Increase initContextSize when provisioning the model (see Model Registry API)

Next Steps

Chat with SmolLM2: Build a complete chat application
Semantic Search: Use embeddings for search
Model Registry API: Manage your models
Inference API Reference: Complete API specification

Prerequisites​

Base URL​

Chat Completions​

Basic Chat Completion​

System Messages​

Multi-Turn Conversation​

Streaming Responses​

Generation Parameters​

Embeddings​

Model Cache Management​

List Loaded Models​

Pre-Load a Model​

Unload a Model​

Error Handling​

Handling Errors with OpenAI SDK​

Performance Tips​

First Request Latency​

Context Size​

Memory Management​

Troubleshooting​

Slow First Request​

Out of Memory​

Token Limit Exceeded​

Next Steps​