Skip to main content

Inference API

The Inference API (mILM) provides OpenAI-compatible inference capabilities for the AI Foundation Package. This service supports chat completions, text embeddings, and model cache management for GGUF models stored in the Model Registry.

Base URL

http://localhost:8083/mimik-ai/openai/v1
info

The Inference API follows the OpenAI API format, making it compatible with existing OpenAI client libraries and tools.

Authentication

All endpoints require a Bearer token in the Authorization header:

Authorization: Bearer 1234

The default API key is 1234, configured in the [milm-v1] section of the addon .ini file. See Addon Configuration for details.

Quick Reference

MethodEndpointDescription
POST/chat/completionsGenerate chat response
POST/embeddingsGenerate text embeddings
GET/modelsList loaded models
POST/modelsLoad model into cache
DELETE/models?modelId={id}Unload model from cache

Chat Completions

Generate chat responses using LLM or VLM models.

Create Chat Completion

Request

POST /chat/completions

Headers

HeaderRequiredValue
Content-TypeYesapplication/json
AuthorizationYesBearer <token>

Request Body

FieldTypeRequiredDescription
modelstringYesModel ID from the Model Registry
messagesarrayYesConversation messages
streambooleanNoEnable streaming responses (default: false)
temperaturenumberNoSampling temperature 0.0-2.0 (default: 1.0)
top_pnumberNoNucleus sampling threshold 0.0-1.0 (default: 1.0)
max_tokensintegerNoMaximum tokens to generate

Message Object

FieldTypeDescription
rolestringMessage role: system, user, assistant, tool
contentstringMessage content

Example: Basic Chat

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
}'

Response (200 OK)

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1729591200,
"model": "smollm2-360m",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 32,
"total_tokens": 52
}
}

Streaming Responses

Enable real-time token streaming for better UX:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Complete this sentence: AI is like a"}],
"stream": true
}'

Streaming Response (SSE)

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":"Silicon"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":" minds"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

JavaScript Streaming Example

const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer 1234'
},
body: JSON.stringify({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Complete this sentence: AI is like a' }],
stream: true
})
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
const { done, value } = await reader.read();
if (done) break;

const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') break;

const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
}

Multi-Turn Conversations

Include conversation history in the messages array:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What is its population?"}
]
}'
Conversation Management

The API is stateless. Include the full conversation history in each request. The model doesn't remember previous requests.

Tool Calls

mILM parses <tool_call> tags from model output and returns structured tool calls:

Model Output with Tool Call

I'll check the weather for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>

Parsed Response

{
"choices": [
{
"message": {
"role": "assistant",
"content": "I'll check the weather for you.",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}

Generation Parameters

ParameterTypeRangeDefaultDescription
temperaturenumber0.0-2.01.0Randomness (higher = more creative)
top_pnumber0.0-1.01.0Nucleus sampling threshold
max_tokensinteger1-∞model limitMaximum response tokens

Example with Parameters

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Tell me a creative story"}],
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 500
}'

Embeddings

Generate vector embeddings from text using embedding models.

Create Embeddings

Request

POST /embeddings

Headers

HeaderRequiredValue
Content-TypeYesapplication/json
AuthorizationYesBearer <token>

Request Body

FieldTypeRequiredDescription
modelstringYesEmbedding model ID
inputstring or arrayYesText(s) to embed

Example: Single Input

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog."
}'

Response (200 OK)

{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0094, 0.0152, ...]
}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 10,
"total_tokens": 10
}
}

Example: Batch Input

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": [
"First text to embed",
"Second text to embed",
"Third text to embed"
]
}'

Batch Response

{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [...]},
{"object": "embedding", "index": 1, "embedding": [...]},
{"object": "embedding", "index": 2, "embedding": [...]}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 15,
"total_tokens": 15
}
}

Batch Constraints

ConstraintLimitDescription
Maximum items50Maximum number of input strings per request
Input typestring or string[]Each element must be a string
Token limitModel-specificEach input string is subject to the model's maximum token limit

Batch Error Codes

CodeCause
400Input array exceeds 50 items
400Input array contains non-string elements

Model Cache Management

mILM maintains a runtime cache of loaded models. Models are loaded from the Model Registry on demand.

List Loaded Models

List models currently loaded in the runtime cache.

Request

GET /models

Example

curl "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Authorization: Bearer 1234"

Response (200 OK)

{
"data": [
{
"id": "smollm2-360m",
"object": "model",
"created": 1769534258,
"owned_by": "mimik",
"info": {
"kind": "llm",
"chat_template_hint": "chatml",
"n_gpu_layers": 99,
"max_context": 2048,
"n_vocab": 49152,
"n_ctx_train": 8192,
"n_embd": 960,
"n_params": 361821120,
"model_size": 384618240
},
"metrics": {
"inference_count": 12,
"last_used": 1769534258,
"loaded_at": 1769530800,
"tokens_per_second": 227.43,
"avg_tokens_per_second": 198.65
}
}
],
"object": "list"
}

Model Info Fields

FieldTypeDescription
kindstringModel type: "llm", "vlm", or "embed"
chat_template_hintstringChat template applied during loading
n_gpu_layersintegerNumber of layers offloaded to GPU
max_contextintegerMaximum context size used at load time
n_vocabintegerVocabulary size
n_ctx_trainintegerTraining context length
n_embdintegerEmbedding dimension size
n_paramsintegerTotal parameter count
model_sizeintegerModel file size in bytes

Model Metrics by Kind

MetricLLM / VLMEmbedDescription
inference_countYesYesTotal number of inference calls
tokens_per_secondYesNoToken throughput of the most recent inference
avg_tokens_per_secondYesNoAverage token throughput across all inferences
last_latency_msNoYesLatency of the most recent inference in milliseconds
avg_latency_msNoYesAverage latency across all inferences in milliseconds
last_usedYesYesUnix epoch timestamp (seconds) of last inference
loaded_atYesYesUnix epoch timestamp (seconds) when the model was loaded

Example: Embed Model Response

{
"data": [
{
"id": "nomic-embed-text-v1.5.Q8_0",
"object": "model",
"created": 1769534320,
"owned_by": "mimik",
"info": {
"kind": "embed",
"chat_template_hint": "",
"n_gpu_layers": -1,
"max_context": 2048,
"n_vocab": 30522,
"n_ctx_train": 2048,
"n_embd": 768,
"n_params": 136727040,
"model_size": 145389792
},
"metrics": {
"inference_count": 1,
"last_used": 1769534324,
"loaded_at": 1769534320,
"last_latency_ms": 10.61,
"avg_latency_ms": 10.61
}
}
],
"object": "list"
}

Load Model

Load a model from the Model Registry into the runtime cache.

Request

POST /models

Request Body

FieldTypeRequiredDescription
modelstringYesModel ID from the Model Registry
chatTemplateHintstringNoOverride the chat template (e.g., "chatml", "llama3", "gemma")
initParamsobjectNoModel initialization overrides
initParams.contextSizeintegerNoOverride the default context window size
initParams.gpuLayerSizeintegerNoNumber of layers to offload to GPU

Example

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m"
}'

Response (201 Created)

The response is streamed as Server-Sent Events (SSE) with Content-Type: text/event-stream.

Progress events are emitted while the model loads:

data: {"progress":"<|loading_model|> 0%"}

data: {"progress":"<|loading_model|> 25%"}

data: {"progress":"<|loading_model|> 50%"}

data: {"progress":"<|loading_model|> 100%"}

Final event contains the loaded model object:

{
"id": "smollm2-360m",
"object": "model",
"created": 1769534258,
"owned_by": "mimik",
"info": {
"kind": "llm",
"chat_template_hint": "chatml",
"n_gpu_layers": 99,
"max_context": 2048,
"n_vocab": 49152,
"n_ctx_train": 8192,
"n_embd": 960,
"n_params": 361821120,
"model_size": 384618240
},
"metrics": {
"inference_count": 0,
"tokens_per_second": 0,
"avg_tokens_per_second": 0,
"last_used": null,
"loaded_at": 1769534258
}
}
Auto-Load

Models are automatically loaded on first inference request. Explicit loading is optional but useful for warming up the cache.

Auto-load can fail with the following errors:

  • 404: The model ID was not found in the Model Registry.
  • 400: The model exists but is not ready (readyToUse: false). Complete the model provisioning by uploading or downloading the model file first.

Unload Model

Remove a model from the runtime cache. The model files remain in the Model Registry and can be reloaded later.

Request

DELETE /models?modelId={id}
ParameterLocationRequiredDescription
modelIdqueryYesModel ID to unload

Example

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models?modelId=smollm2-360m" \
-H "Authorization: Bearer 1234"

Response (200 OK)

{
"id": "smollm2-360m",
"object": "model",
"deleted": true
}
Memory Management

Unload models you're not actively using to free memory for other models. The model files remain in the Model Registry and can be reloaded later.


Model Lifecycle

The following diagram shows the relationship between the Model Registry and the Inference API:


Error Responses

CodeDescription
400Bad request (invalid input or model not ready)
401Unauthorized (missing or invalid API key)
404Not found (model not in Model Registry)
500Internal server error

Error Format

{
"message": "Model 'unknown-model' not found in store",
"statusCode": 404
}
warning

This error format deviates from the standard OpenAI API error format, which wraps errors in an error object. Handle errors by checking for message and statusCode at the top level of the response body.

Common Errors

Model Not Found

{
"message": "Model 'smollm2-360m' not found in store",
"statusCode": 404
}

Cause: Model doesn't exist in the Model Registry. Solution: Create the model in the Model Registry first.

Model Not Ready

{
"message": "Model 'smollm2-360m' is not ready (readyToUse: false)",
"statusCode": 400
}

Cause: Model metadata exists but file hasn't been uploaded/downloaded. Solution: Complete the model provisioning by uploading or downloading the file.


OpenAI SDK Compatibility

The Inference API is compatible with the official OpenAI Python and JavaScript SDKs:

Python

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234" # Your mimOE API key
)

# Chat completion
response = client.chat.completions.create(
model="smollm2-360m",
messages=[{"role": "user", "content": "Hello!"}]
)

# Embeddings
embeddings = client.embeddings.create(
model="nomic-embed-text",
input="Hello, world!"
)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234' // Your mimOE API key
});

// Chat completion
const response = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Hello!' }]
});

// Embeddings
const embeddings = await client.embeddings.create({
model: 'nomic-embed-text',
input: 'Hello, world!'
});