Chat with SmolLM2
Build a conversational AI application using the SmolLM2-360M model. This example walks through setting up the model and creating a complete chat interface.
Overview
This example demonstrates:
- Provisioning a GGUF model with two-step provisioning
- Making chat completion requests
- Handling streaming responses
- Building a multi-turn conversation
Prerequisites
- mimOE AI Foundation Package running (Quick Start)
- Node.js 18+ or Python 3.8+ (for code examples)
Step 1: Provision SmolLM2
First, create the model metadata:
curl -X POST "http://localhost:8083/mimik-ai/store/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"id": "smollm2-360m",
"version": "1.0.0",
"kind": "llm"
}'
Then download the model:
curl -X POST "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m/download" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"url": "https://huggingface.co/lmstudio-community/SmolLM2-360M-Instruct-GGUF/resolve/main/SmolLM2-360M-Instruct-Q8_0.gguf?download=true"
}'
Verify the model is ready:
curl "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m"
Confirm readyToUse: true before proceeding.
Step 2: Basic Chat
Send a simple chat request:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "user", "content": "Hello! Can you introduce yourself?"}
]
}'
Step 3: Build a Chat Interface
JavaScript/Node.js
Install the OpenAI SDK:
npm install openai readline
Create a chat application:
chat.js
import OpenAI from 'openai';
import readline from 'readline';
const client = new OpenAI({
baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
apiKey: '1234'
});
const messages = [
{ role: 'system', content: 'You are a helpful AI assistant.' }
];
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
console.log('Chat with SmolLM2 (type "exit" to quit)\n');
async function chat() {
rl.question('You: ', async (input) => {
if (input.toLowerCase() === 'exit') {
console.log('Goodbye!');
rl.close();
return;
}
messages.push({ role: 'user', content: input });
process.stdout.write('SmolLM2: ');
const stream = await client.chat.completions.create({
model: 'smollm2-360m',
messages: messages,
stream: true
});
let fullResponse = '';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
fullResponse += content;
}
}
console.log('\n');
messages.push({ role: 'assistant', content: fullResponse });
chat();
});
}
chat();
Run with:
node chat.js
Python
Install the OpenAI SDK:
pip install openai
Create a chat application:
chat.py
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8083/mimik-ai/openai/v1",
api_key="1234"
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."}
]
print("Chat with SmolLM2 (type 'exit' to quit)\n")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
print("Goodbye!")
break
messages.append({"role": "user", "content": user_input})
print("SmolLM2: ", end="", flush=True)
stream = client.chat.completions.create(
model="smollm2-360m",
messages=messages,
stream=True
)
full_response = ""
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
full_response += content
print("\n")
messages.append({"role": "assistant", "content": full_response})
Run with:
python chat.py
Step 4: Advanced Features
Adjusting Response Style
Use temperature and system prompts to control output:
# Creative responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a short story about a robot."}
],
"temperature": 0.9,
"max_tokens": 500
}'
# Factual responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{
"model": "smollm2-360m",
"messages": [
{"role": "system", "content": "You are a technical assistant. Be precise and accurate."},
{"role": "user", "content": "Explain how HTTPS works."}
],
"temperature": 0.2
}'
Context Management
Trim conversation history to stay within context limits:
function trimMessages(messages, maxTokens = 1500) {
const systemMessage = messages.find(m => m.role === 'system');
const nonSystemMessages = messages.filter(m => m.role !== 'system');
// Estimate tokens (rough: 4 chars per token)
let totalChars = systemMessage ? systemMessage.content.length : 0;
const trimmed = [];
// Keep most recent messages
for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
const msgChars = nonSystemMessages[i].content.length;
if (totalChars + msgChars > maxTokens * 4) break;
trimmed.unshift(nonSystemMessages[i]);
totalChars += msgChars;
}
return systemMessage ? [systemMessage, ...trimmed] : trimmed;
}
Troubleshooting
Slow First Response
The first request loads the model into memory. Pre-load it:
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1234" \
-d '{"model": "smollm2-360m"}'
Out of Memory
SmolLM2-360M requires only ~1GB RAM. If you still have issues:
- Close other memory-intensive applications
- Reduce
initContextSizewhen creating the model
Poor Response Quality
- Verify
chatTemplateHint: "chatml"matches the model - Add a clear system prompt
- Adjust temperature (lower for factual, higher for creative)
Next Steps
- Semantic Search: Use embeddings for similarity search
- Inference API Reference: Complete API documentation