Inference API
The Inference API (mILM) provides OpenAI-compatible inference capabilities for the AI Foundation Package. This service supports chat completions, text embeddings, and model cache management for GGUF models stored in the Model Registry.
Base URL
http://localhost:8083/mimik-ai/openai/v1
The Inference API follows the OpenAI API format, making it compatible with existing OpenAI client libraries and tools.
Authentication
All endpoints require a Bearer token in the Authorization header:
Authorization: Bearer 1234
The default API key is 1234, configured in the [milm-v1] section of the addon .ini file. See Addon Configuration for details.
Quick Reference
| Method | Endpoint | Description |
|---|---|---|
| POST | /chat/completions | Generate chat response |
| POST | /embeddings | Generate text embeddings |
| GET | /models | List loaded models |
| POST | /models | Load model into cache |
| DELETE | /models?modelId={id} | Unload model from cache |
Chat Completions
Generate chat responses using LLM or VLM models.
Create Chat Completion
Request
POST /chat/completions
Headers
| Header | Required | Value |
|---|---|---|
Content-Type | Yes | application/json |
Authorization | Yes | Bearer <token> |
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID from the Model Registry |
messages | array | Yes | Conversation messages |
stream | boolean | No | Enable streaming responses (default: false) |
temperature | number | No | Sampling temperature 0.0-2.0 (default: 1.0) |
top_p | number | No | Nucleus sampling threshold 0.0-1.0 (default: 1.0) |
max_tokens | integer | No | Maximum tokens to generate |
Message Object
| Field | Type | Description |
|---|---|---|
role | string | Message role: system, user, assistant, tool |
content | string | Message content |
Example: Basic Chat
- cURL
- JavaScript
- Python
- OpenAI SDK
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
}'
const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer 1234'
},
body: JSON.stringify({
model: 'smollm2-360m',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Complete this sentence: AI is like a' }
]
})
});
const result = await response.json();
console.log(result.choices[0].message.content);
import requests
response = requests.post(
"http://localhost:8083/mimik-ai/openai/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer 1234"
},
json={
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
response = client.chat.completions.create(
model="smollm2-360m",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Complete this sentence: AI is like a"}
]
)
print(response.choices[0].message.content)
Response (200 OK)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1729591200,
"model": "smollm2-360m",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 32,
"total_tokens": 52
}
}
Streaming Responses
Enable real-time token streaming for better UX:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Complete this sentence: AI is like a"}],
"stream": true
}'
Streaming Response (SSE)
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":"Silicon"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":" minds"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
JavaScript Streaming Example
const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer 1234'
},
body: JSON.stringify({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Complete this sentence: AI is like a' }],
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') break;
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
}
Multi-Turn Conversations
Include conversation history in the messages array:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What is its population?"}
]
}'
The API is stateless. Include the full conversation history in each request. The model doesn't remember previous requests.
Tool Calls
mILM parses <tool_call> tags from model output and returns structured tool calls:
Model Output with Tool Call
I'll check the weather for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>
Parsed Response
{
"choices": [
{
"message": {
"role": "assistant",
"content": "I'll check the weather for you.",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}
Generation Parameters
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
temperature | number | 0.0-2.0 | 1.0 | Randomness (higher = more creative) |
top_p | number | 0.0-1.0 | 1.0 | Nucleus sampling threshold |
max_tokens | integer | 1-∞ | model limit | Maximum response tokens |
Example with Parameters
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [{"role": "user", "content": "Tell me a creative story"}],
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 500
}'
Embeddings
Generate vector embeddings from text using embedding models.
Create Embeddings
Request
POST /embeddings
Headers
| Header | Required | Value |
|---|---|---|
Content-Type | Yes | application/json |
Authorization | Yes | Bearer <token> |
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Embedding model ID |
input | string or array | Yes | Text(s) to embed |
Example: Single Input
- cURL
- JavaScript
- Python
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog."
}'
const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/embeddings', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer 1234'
},
body: JSON.stringify({
model: 'nomic-embed-text',
input: 'The quick brown fox jumps over the lazy dog.'
})
});
const result = await response.json();
console.log('Embedding dimensions:', result.data[0].embedding.length);
import requests
response = requests.post(
"http://localhost:8083/mimik-ai/openai/v1/embeddings",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer 1234"
},
json={
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog."
}
)
result = response.json()
print(f"Embedding dimensions: {len(result['data'][0]['embedding'])}")
Response (200 OK)
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0094, 0.0152, ...]
}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 10,
"total_tokens": 10
}
}
Example: Batch Input
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "nomic-embed-text",
"input": [
"First text to embed",
"Second text to embed",
"Third text to embed"
]
}'
Batch Response
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [...]},
{"object": "embedding", "index": 1, "embedding": [...]},
{"object": "embedding", "index": 2, "embedding": [...]}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 15,
"total_tokens": 15
}
}
Batch Constraints
| Constraint | Limit | Description |
|---|---|---|
| Maximum items | 50 | Maximum number of input strings per request |
| Input type | string or string[] | Each element must be a string |
| Token limit | Model-specific | Each input string is subject to the model's maximum token limit |
Batch Error Codes
| Code | Cause |
|---|---|
| 400 | Input array exceeds 50 items |
| 400 | Input array contains non-string elements |
Model Cache Management
mILM maintains a runtime cache of loaded models. Models are loaded from the Model Registry on demand.
List Loaded Models
List models currently loaded in the runtime cache.
Request
GET /models
Example
curl "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Authorization: Bearer 1234"
Response (200 OK)
{
"data": [
{
"id": "smollm2-360m",
"object": "model",
"created": 1769534258,
"owned_by": "mimik",
"info": {
"kind": "llm",
"chat_template_hint": "chatml",
"n_gpu_layers": 99,
"max_context": 2048,
"n_vocab": 49152,
"n_ctx_train": 8192,
"n_embd": 960,
"n_params": 361821120,
"model_size": 384618240
},
"metrics": {
"inference_count": 12,
"last_used": 1769534258,
"loaded_at": 1769530800,
"tokens_per_second": 227.43,
"avg_tokens_per_second": 198.65
}
}
],
"object": "list"
}
Model Info Fields
| Field | Type | Description |
|---|---|---|
kind | string | Model type: "llm", "vlm", or "embed" |
chat_template_hint | string | Chat template applied during loading |
n_gpu_layers | integer | Number of layers offloaded to GPU |
max_context | integer | Maximum context size used at load time |
n_vocab | integer | Vocabulary size |
n_ctx_train | integer | Training context length |
n_embd | integer | Embedding dimension size |
n_params | integer | Total parameter count |
model_size | integer | Model file size in bytes |
Model Metrics by Kind
| Metric | LLM / VLM | Embed | Description |
|---|---|---|---|
inference_count | Yes | Yes | Total number of inference calls |
tokens_per_second | Yes | No | Token throughput of the most recent inference |
avg_tokens_per_second | Yes | No | Average token throughput across all inferences |
last_latency_ms | No | Yes | Latency of the most recent inference in milliseconds |
avg_latency_ms | No | Yes | Average latency across all inferences in milliseconds |
last_used | Yes | Yes | Unix epoch timestamp (seconds) of last inference |
loaded_at | Yes | Yes | Unix epoch timestamp (seconds) when the model was loaded |
Example: Embed Model Response
{
"data": [
{
"id": "nomic-embed-text-v1.5.Q8_0",
"object": "model",
"created": 1769534320,
"owned_by": "mimik",
"info": {
"kind": "embed",
"chat_template_hint": "",
"n_gpu_layers": -1,
"max_context": 2048,
"n_vocab": 30522,
"n_ctx_train": 2048,
"n_embd": 768,
"n_params": 136727040,
"model_size": 145389792
},
"metrics": {
"inference_count": 1,
"last_used": 1769534324,
"loaded_at": 1769534320,
"last_latency_ms": 10.61,
"avg_latency_ms": 10.61
}
}
],
"object": "list"
}
Load Model
Load a model from the Model Registry into the runtime cache.
Request
POST /models
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID from the Model Registry |
chatTemplateHint | string | No | Override the chat template (e.g., "chatml", "llama3", "gemma") |
initParams | object | No | Model initialization overrides |
initParams.contextSize | integer | No | Override the default context window size |
initParams.gpuLayerSize | integer | No | Number of layers to offload to GPU |
Example
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m"
}'
Response (201 Created)
The response is streamed as Server-Sent Events (SSE) with Content-Type: text/event-stream.
Progress events are emitted while the model loads:
data: {"progress":"<|loading_model|> 0%"}
data: {"progress":"<|loading_model|> 25%"}
data: {"progress":"<|loading_model|> 50%"}
data: {"progress":"<|loading_model|> 100%"}
Final event contains the loaded model object:
{
"id": "smollm2-360m",
"object": "model",
"created": 1769534258,
"owned_by": "mimik",
"info": {
"kind": "llm",
"chat_template_hint": "chatml",
"n_gpu_layers": 99,
"max_context": 2048,
"n_vocab": 49152,
"n_ctx_train": 8192,
"n_embd": 960,
"n_params": 361821120,
"model_size": 384618240
},
"metrics": {
"inference_count": 0,
"tokens_per_second": 0,
"avg_tokens_per_second": 0,
"last_used": null,
"loaded_at": 1769534258
}
}
Models are automatically loaded on first inference request. Explicit loading is optional but useful for warming up the cache.
Auto-load can fail with the following errors:
- 404: The model ID was not found in the Model Registry.
- 400: The model exists but is not ready (
readyToUse: false). Complete the model provisioning by uploading or downloading the model file first.
Unload Model
Remove a model from the runtime cache. The model files remain in the Model Registry and can be reloaded later.
Request
DELETE /models?modelId={id}
| Parameter | Location | Required | Description |
|---|---|---|---|
modelId | query | Yes | Model ID to unload |
Example
curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models?modelId=smollm2-360m" \
-H "Authorization: Bearer 1234"
Response (200 OK)
{
"id": "smollm2-360m",
"object": "model",
"deleted": true
}
Unload models you're not actively using to free memory for other models. The model files remain in the Model Registry and can be reloaded later.
Model Lifecycle
The following diagram shows the relationship between the Model Registry and the Inference API:
Error Responses
| Code | Description |
|---|---|
| 400 | Bad request (invalid input or model not ready) |
| 401 | Unauthorized (missing or invalid API key) |
| 404 | Not found (model not in Model Registry) |
| 500 | Internal server error |
Error Format
{
"message": "Model 'unknown-model' not found in store",
"statusCode": 404
}
This error format deviates from the standard OpenAI API error format, which wraps errors in an error object. Handle errors by checking for message and statusCode at the top level of the response body.
Common Errors
Model Not Found
{
"message": "Model 'smollm2-360m' not found in store",
"statusCode": 404
}
Cause: Model doesn't exist in the Model Registry. Solution: Create the model in the Model Registry first.
Model Not Ready
{
"message": "Model 'smollm2-360m' is not ready (readyToUse: false)",
"statusCode": 400
}
Cause: Model metadata exists but file hasn't been uploaded/downloaded. Solution: Complete the model provisioning by uploading or downloading the file.
OpenAI SDK Compatibility
The Inference API is compatible with the official OpenAI Python and JavaScript SDKs:
Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234" # Your mimOE API key
)
# Chat completion
response = client.chat.completions.create(
model="smollm2-360m",
messages=[{"role": "user", "content": "Hello!"}]
)
# Embeddings
embeddings = client.embeddings.create(
model="nomic-embed-text",
input="Hello, world!"
)
JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234' // Your mimOE API key
});
// Chat completion
const response = await client.chat.completions.create({
model: 'smollm2-360m',
messages: [{ role: 'user', content: 'Hello!' }]
});
// Embeddings
const embeddings = await client.embeddings.create({
model: 'nomic-embed-text',
input: 'Hello, world!'
});