Skip to main content

Chat with SmolLM2

Build a conversational AI application using the SmolLM2-360M model. This example walks through setting up the model and creating a complete chat interface.

Overview

This example demonstrates:

  • Provisioning a GGUF model with two-step provisioning
  • Making chat completion requests
  • Handling streaming responses
  • Building a multi-turn conversation

Prerequisites

  • mimOE AI Foundation Package running (Quick Start)
  • Node.js 18+ or Python 3.8+ (for code examples)

Step 1: Provision SmolLM2

First, create the model metadata:

curl -X POST "http://localhost:8083/mimik-ai/store/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"id": "smollm2-360m",
"version": "1.0.0",
"kind": "llm"
}'

Then download the model:

curl -X POST "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m/download" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"url": "https://huggingface.co/lmstudio-community/SmolLM2-360M-Instruct-GGUF/resolve/main/SmolLM2-360M-Instruct-Q8_0.gguf?download=true"
}'

Verify the model is ready:

curl "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m"

Confirm readyToUse: true before proceeding.

Step 2: Basic Chat

Send a simple chat request:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "Hello! Can you introduce yourself?"}
]
}'

Step 3: Build a Chat Interface

JavaScript/Node.js

Install the OpenAI SDK:

npm install openai readline

Create a chat application:

chat.js
import OpenAI from 'openai';
import readline from 'readline';

const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});

const messages = [
{ role: 'system', content: 'You are a helpful AI assistant.' }
];

const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});

console.log('Chat with SmolLM2 (type "exit" to quit)\n');

async function chat() {
rl.question('You: ', async (input) => {
if (input.toLowerCase() === 'exit') {
console.log('Goodbye!');
rl.close();
return;
}

messages.push({ role: 'user', content: input });

process.stdout.write('SmolLM2: ');

const stream = await client.chat.completions.create({
model: 'smollm2-360m',
messages: messages,
stream: true
});

let fullResponse = '';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
fullResponse += content;
}
}

console.log('\n');
messages.push({ role: 'assistant', content: fullResponse });
chat();
});
}

chat();

Run with:

node chat.js

Python

Install the OpenAI SDK:

pip install openai

Create a chat application:

chat.py
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)

messages = [
{"role": "system", "content": "You are a helpful AI assistant."}
]

print("Chat with SmolLM2 (type 'exit' to quit)\n")

while True:
user_input = input("You: ")
if user_input.lower() == "exit":
print("Goodbye!")
break

messages.append({"role": "user", "content": user_input})

print("SmolLM2: ", end="", flush=True)

stream = client.chat.completions.create(
model="smollm2-360m",
messages=messages,
stream=True
)

full_response = ""
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
full_response += content

print("\n")
messages.append({"role": "assistant", "content": full_response})

Run with:

python chat.py

Step 4: Advanced Features

Adjusting Response Style

Use temperature and system prompts to control output:

# Creative responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a short story about a robot."}
],
"temperature": 0.9,
"max_tokens": 500
}'
# Factual responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a technical assistant. Be precise and accurate."},
{"role": "user", "content": "Explain how HTTPS works."}
],
"temperature": 0.2
}'

Context Management

Trim conversation history to stay within context limits:

function trimMessages(messages, maxTokens = 1500) {
const systemMessage = messages.find(m => m.role === 'system');
const nonSystemMessages = messages.filter(m => m.role !== 'system');

// Estimate tokens (rough: 4 chars per token)
let totalChars = systemMessage ? systemMessage.content.length : 0;
const trimmed = [];

// Keep most recent messages
for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
const msgChars = nonSystemMessages[i].content.length;
if (totalChars + msgChars > maxTokens * 4) break;
trimmed.unshift(nonSystemMessages[i]);
totalChars += msgChars;
}

return systemMessage ? [systemMessage, ...trimmed] : trimmed;
}

Troubleshooting

Slow First Response

The first request loads the model into memory. Pre-load it:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'

Out of Memory

SmolLM2-360M requires only ~1GB RAM. If you still have issues:

  1. Close other memory-intensive applications
  2. Reduce initContextSize when creating the model

Poor Response Quality

  • Verify chatTemplateHint: "chatml" matches the model
  • Add a clear system prompt
  • Adjust temperature (lower for factual, higher for creative)

Next Steps