Inference API

The Inference API (mILM) provides OpenAI-compatible inference capabilities for the AI Foundation Package. This service supports chat completions, text embeddings, and model cache management for GGUF models stored in the Model Registry.

Base URL

http://localhost:8083/mimik-ai/openai/v1

info

The Inference API follows the OpenAI API format, making it compatible with existing OpenAI client libraries and tools.

Authentication

All endpoints require a Bearer token in the Authorization header:

Authorization: Bearer 1234

The default API key is 1234, configured in the [milm-v1] section of the addon .ini file. See Addon Configuration for details.

Quick Reference

Method	Endpoint	Description
POST	`/chat/completions`	Generate chat response
POST	`/embeddings`	Generate text embeddings
GET	`/models`	List loaded models
POST	`/models`	Load model into cache
DELETE	`/models?modelId={id}`	Unload model from cache

Chat Completions

Generate chat responses using LLM or VLM models.

Create Chat Completion

Request

POST /chat/completions

Headers

Header	Required	Value
`Content-Type`	Yes	`application/json`
`Authorization`	Yes	`Bearer <token>`

Request Body

Field	Type	Required	Description
`model`	string	Yes	Model ID from the Model Registry
`messages`	array	Yes	Conversation messages
`stream`	boolean	No	Enable streaming responses (default: false)
`temperature`	number	No	Sampling temperature 0.0-2.0 (default: 1.0)
`top_p`	number	No	Nucleus sampling threshold 0.0-1.0 (default: 1.0)
`max_tokens`	integer	No	Maximum tokens to generate

Message Object

Field	Type	Description
`role`	string	Message role: `system`, `user`, `assistant`, `tool`
`content`	string	Message content

Example: Basic Chat

cURL
JavaScript
Python
OpenAI SDK

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Complete this sentence: AI is like a"}
    ]
  }'

const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer 1234'
  },
  body: JSON.stringify({
    model: 'smollm2-360m',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Complete this sentence: AI is like a' }
    ]
  })
});

const result = await response.json();
console.log(result.choices[0].message.content);

import requests

response = requests.post(
    "http://localhost:8083/mimik-ai/openai/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer 1234"
    },
    json={
        "model": "smollm2-360m",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Complete this sentence: AI is like a"}
        ]
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

response = client.chat.completions.create(
    model="smollm2-360m",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Complete this sentence: AI is like a"}
    ]
)

print(response.choices[0].message.content)

Response (200 OK)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1729591200,
  "model": "smollm2-360m",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Running AI locally offers privacy (your data stays on-device), lower latency (no network round-trips), offline capability, and reduced cloud costs."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 32,
    "total_tokens": 52
  }
}

Streaming Responses

Enable real-time token streaming for better UX:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [{"role": "user", "content": "Complete this sentence: AI is like a"}],
    "stream": true
  }'

Streaming Response (SSE)

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":"Silicon"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{"content":" minds"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729591200,"model":"smollm2-360m","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

JavaScript Streaming Example

const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer 1234'
  },
  body: JSON.stringify({
    model: 'smollm2-360m',
    messages: [{ role: 'user', content: 'Complete this sentence: AI is like a' }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

  for (const line of lines) {
    const data = line.slice(6);
    if (data === '[DONE]') break;

    const parsed = JSON.parse(data);
    const content = parsed.choices[0]?.delta?.content;
    if (content) {
      process.stdout.write(content);
    }
  }
}

Multi-Turn Conversations

Include conversation history in the messages array:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris."},
      {"role": "user", "content": "What is its population?"}
    ]
  }'

Conversation Management

The API is stateless. Include the full conversation history in each request. The model doesn't remember previous requests.

Tool Calls

mILM parses <tool_call> tags from model output and returns structured tool calls:

Model Output with Tool Call

I'll check the weather for you.
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>

Parsed Response

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "I'll check the weather for you.",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

Generation Parameters

Parameter	Type	Range	Default	Description
`temperature`	number	0.0-2.0	1.0	Randomness (higher = more creative)
`top_p`	number	0.0-1.0	1.0	Nucleus sampling threshold
`max_tokens`	integer	1-∞	model limit	Maximum response tokens

Example with Parameters

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [{"role": "user", "content": "Tell me a creative story"}],
    "temperature": 0.8,
    "top_p": 0.9,
    "max_tokens": 500
  }'

Embeddings

Generate vector embeddings from text using embedding models.

Create Embeddings

Request

POST /embeddings

Headers

Header	Required	Value
`Content-Type`	Yes	`application/json`
`Authorization`	Yes	`Bearer <token>`

Request Body

Field	Type	Required	Description
`model`	string	Yes	Embedding model ID
`input`	string or array	Yes	Text(s) to embed

Example: Single Input

cURL
JavaScript
Python

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "nomic-embed-text",
    "input": "The quick brown fox jumps over the lazy dog."
  }'

const response = await fetch('http://localhost:8083/mimik-ai/openai/v1/embeddings', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer 1234'
  },
  body: JSON.stringify({
    model: 'nomic-embed-text',
    input: 'The quick brown fox jumps over the lazy dog.'
  })
});

const result = await response.json();
console.log('Embedding dimensions:', result.data[0].embedding.length);

import requests

response = requests.post(
    "http://localhost:8083/mimik-ai/openai/v1/embeddings",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer 1234"
    },
    json={
        "model": "nomic-embed-text",
        "input": "The quick brown fox jumps over the lazy dog."
    }
)

result = response.json()
print(f"Embedding dimensions: {len(result['data'][0]['embedding'])}")

Response (200 OK)

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0094, 0.0152, ...]
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10
  }
}

Example: Batch Input

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/embeddings" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "nomic-embed-text",
    "input": [
      "First text to embed",
      "Second text to embed",
      "Third text to embed"
    ]
  }'

Batch Response

{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [...]},
    {"object": "embedding", "index": 1, "embedding": [...]},
    {"object": "embedding", "index": 2, "embedding": [...]}
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 15,
    "total_tokens": 15
  }
}

Batch Constraints

Constraint	Limit	Description
Maximum items	50	Maximum number of input strings per request
Input type	`string` or `string[]`	Each element must be a string
Token limit	Model-specific	Each input string is subject to the model's maximum token limit

Batch Error Codes

Code	Cause
400	Input array exceeds 50 items
400	Input array contains non-string elements

Model Cache Management

mILM maintains a runtime cache of loaded models. Models are loaded from the Model Registry on demand.

List Loaded Models

List models currently loaded in the runtime cache.

Request

GET /models

Example

curl "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Authorization: Bearer 1234"

Response (200 OK)

{
  "data": [
    {
      "id": "smollm2-360m",
      "object": "model",
      "created": 1769534258,
      "owned_by": "mimik",
      "info": {
        "kind": "llm",
        "chat_template_hint": "chatml",
        "n_gpu_layers": 99,
        "max_context": 2048,
        "n_vocab": 49152,
        "n_ctx_train": 8192,
        "n_embd": 960,
        "n_params": 361821120,
        "model_size": 384618240
      },
      "metrics": {
        "inference_count": 12,
        "last_used": 1769534258,
        "loaded_at": 1769530800,
        "tokens_per_second": 227.43,
        "avg_tokens_per_second": 198.65
      }
    }
  ],
  "object": "list"
}

Model Info Fields

Field	Type	Description
`kind`	string	Model type: `"llm"`, `"vlm"`, or `"embed"`
`chat_template_hint`	string	Chat template applied during loading
`n_gpu_layers`	integer	Number of layers offloaded to GPU
`max_context`	integer	Maximum context size used at load time
`n_vocab`	integer	Vocabulary size
`n_ctx_train`	integer	Training context length
`n_embd`	integer	Embedding dimension size
`n_params`	integer	Total parameter count
`model_size`	integer	Model file size in bytes

Model Metrics by Kind

Metric	LLM / VLM	Embed	Description
`inference_count`	Yes	Yes	Total number of inference calls
`tokens_per_second`	Yes	No	Token throughput of the most recent inference
`avg_tokens_per_second`	Yes	No	Average token throughput across all inferences
`last_latency_ms`	No	Yes	Latency of the most recent inference in milliseconds
`avg_latency_ms`	No	Yes	Average latency across all inferences in milliseconds
`last_used`	Yes	Yes	Unix epoch timestamp (seconds) of last inference
`loaded_at`	Yes	Yes	Unix epoch timestamp (seconds) when the model was loaded

Example: Embed Model Response

{
  "data": [
    {
      "id": "nomic-embed-text-v1.5.Q8_0",
      "object": "model",
      "created": 1769534320,
      "owned_by": "mimik",
      "info": {
        "kind": "embed",
        "chat_template_hint": "",
        "n_gpu_layers": -1,
        "max_context": 2048,
        "n_vocab": 30522,
        "n_ctx_train": 2048,
        "n_embd": 768,
        "n_params": 136727040,
        "model_size": 145389792
      },
      "metrics": {
        "inference_count": 1,
        "last_used": 1769534324,
        "loaded_at": 1769534320,
        "last_latency_ms": 10.61,
        "avg_latency_ms": 10.61
      }
    }
  ],
  "object": "list"
}

Load Model

Load a model from the Model Registry into the runtime cache.

Request

POST /models

Request Body

Field	Type	Required	Description
`model`	string	Yes	Model ID from the Model Registry
`chatTemplateHint`	string	No	Override the chat template (e.g., `"chatml"`, `"llama3"`, `"gemma"`)
`initParams`	object	No	Model initialization overrides
`initParams.contextSize`	integer	No	Override the default context window size
`initParams.gpuLayerSize`	integer	No	Number of layers to offload to GPU

Example

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m"
  }'

Response (201 Created)

The response is streamed as Server-Sent Events (SSE) with Content-Type: text/event-stream.

Progress events are emitted while the model loads:

data: {"progress":"<|loading_model|> 0%"}

data: {"progress":"<|loading_model|> 25%"}

data: {"progress":"<|loading_model|> 50%"}

data: {"progress":"<|loading_model|> 100%"}

Final event contains the loaded model object:

{
  "id": "smollm2-360m",
  "object": "model",
  "created": 1769534258,
  "owned_by": "mimik",
  "info": {
    "kind": "llm",
    "chat_template_hint": "chatml",
    "n_gpu_layers": 99,
    "max_context": 2048,
    "n_vocab": 49152,
    "n_ctx_train": 8192,
    "n_embd": 960,
    "n_params": 361821120,
    "model_size": 384618240
  },
  "metrics": {
    "inference_count": 0,
    "tokens_per_second": 0,
    "avg_tokens_per_second": 0,
    "last_used": null,
    "loaded_at": 1769534258
  }
}

Auto-Load

Models are automatically loaded on first inference request. Explicit loading is optional but useful for warming up the cache.

Auto-load can fail with the following errors:

404: The model ID was not found in the Model Registry.
400: The model exists but is not ready (readyToUse: false). Complete the model provisioning by uploading or downloading the model file first.

Unload Model

Remove a model from the runtime cache. The model files remain in the Model Registry and can be reloaded later.

Request

DELETE /models?modelId={id}

Parameter	Location	Required	Description
`modelId`	query	Yes	Model ID to unload

Example

curl -X DELETE "http://localhost:8083/mimik-ai/openai/v1/models?modelId=smollm2-360m" \
  -H "Authorization: Bearer 1234"

Response (200 OK)

{
  "id": "smollm2-360m",
  "object": "model",
  "deleted": true
}

Memory Management

Unload models you're not actively using to free memory for other models. The model files remain in the Model Registry and can be reloaded later.

Model Lifecycle

The following diagram shows the relationship between the Model Registry and the Inference API:

Error Responses

Code	Description
400	Bad request (invalid input or model not ready)
401	Unauthorized (missing or invalid API key)
404	Not found (model not in Model Registry)
500	Internal server error

Error Format

{
  "message": "Model 'unknown-model' not found in store",
  "statusCode": 404
}

warning

This error format deviates from the standard OpenAI API error format, which wraps errors in an error object. Handle errors by checking for message and statusCode at the top level of the response body.

Common Errors

Model Not Found

{
  "message": "Model 'smollm2-360m' not found in store",
  "statusCode": 404
}

Cause: Model doesn't exist in the Model Registry. Solution: Create the model in the Model Registry first.

Model Not Ready

{
  "message": "Model 'smollm2-360m' is not ready (readyToUse: false)",
  "statusCode": 400
}

Cause: Model metadata exists but file hasn't been uploaded/downloaded. Solution: Complete the model provisioning by uploading or downloading the file.

OpenAI SDK Compatibility

The Inference API is compatible with the official OpenAI Python and JavaScript SDKs:

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"  # Your mimOE API key
)

# Chat completion
response = client.chat.completions.create(
    model="smollm2-360m",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Embeddings
embeddings = client.embeddings.create(
    model="nomic-embed-text",
    input="Hello, world!"
)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'  // Your mimOE API key
});

// Chat completion
const response = await client.chat.completions.create({
  model: 'smollm2-360m',
  messages: [{ role: 'user', content: 'Hello!' }]
});

// Embeddings
const embeddings = await client.embeddings.create({
  model: 'nomic-embed-text',
  input: 'Hello, world!'
});

Base URL​

Authentication​

Quick Reference​

Chat Completions​

Create Chat Completion​

Streaming Responses​

Multi-Turn Conversations​

Tool Calls​

Generation Parameters​

Embeddings​

Create Embeddings​

Model Cache Management​

List Loaded Models​

Load Model​

Unload Model​

Model Lifecycle​

Error Responses​

Common Errors​

OpenAI SDK Compatibility​

Related​

Base URL

Authentication

Quick Reference

Chat Completions

Create Chat Completion

Streaming Responses

Multi-Turn Conversations

Tool Calls

Generation Parameters

Embeddings

Create Embeddings

Model Cache Management

List Loaded Models

Load Model

Unload Model

Model Lifecycle

Error Responses

Common Errors

OpenAI SDK Compatibility

Related