Chat with SmolLM2

Build a conversational AI application using the SmolLM2-360M model. This example walks through setting up the model and creating a complete chat interface.

Overview

This example demonstrates:

Provisioning a GGUF model with two-step provisioning
Making chat completion requests
Handling streaming responses
Building a multi-turn conversation

Prerequisites

mimOE AI Foundation Package running (Quick Start)
Node.js 18+ or Python 3.8+ (for code examples)

Step 1: Provision SmolLM2

First, create the model metadata:

curl -X POST "http://localhost:8083/mimik-ai/store/v1/models" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "id": "smollm2-360m",
    "version": "1.0.0",
    "kind": "llm"
  }'

Then download the model:

curl -X POST "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m/download" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "url": "https://huggingface.co/lmstudio-community/SmolLM2-360M-Instruct-GGUF/resolve/main/SmolLM2-360M-Instruct-Q8_0.gguf?download=true"
  }'

Verify the model is ready:

curl "http://localhost:8083/mimik-ai/store/v1/models/smollm2-360m"

Confirm readyToUse: true before proceeding.

Step 2: Basic Chat

Send a simple chat request:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "user", "content": "Hello! Can you introduce yourself?"}
    ]
  }'

Step 3: Build a Chat Interface

JavaScript/Node.js

Install the OpenAI SDK:

npm install openai readline

Create a chat application:

chat.js
import OpenAI from 'openai';
import readline from 'readline';

const client = new OpenAI({
  baseURL: 'http://localhost:8083/mimik-ai/openai/v1',
  apiKey: '1234'
});

const messages = [
  { role: 'system', content: 'You are a helpful AI assistant.' }
];

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout
});

console.log('Chat with SmolLM2 (type "exit" to quit)\n');

async function chat() {
  rl.question('You: ', async (input) => {
    if (input.toLowerCase() === 'exit') {
      console.log('Goodbye!');
      rl.close();
      return;
    }

    messages.push({ role: 'user', content: input });

    process.stdout.write('SmolLM2: ');

    const stream = await client.chat.completions.create({
      model: 'smollm2-360m',
      messages: messages,
      stream: true
    });

    let fullResponse = '';
    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content;
      if (content) {
        process.stdout.write(content);
        fullResponse += content;
      }
    }

    console.log('\n');
    messages.push({ role: 'assistant', content: fullResponse });
    chat();
  });
}

chat();

Run with:

node chat.js

Python

Install the OpenAI SDK:

pip install openai

Create a chat application:

chat.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8083/mimik-ai/openai/v1",
    api_key="1234"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."}
]

print("Chat with SmolLM2 (type 'exit' to quit)\n")

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        print("Goodbye!")
        break

    messages.append({"role": "user", "content": user_input})

    print("SmolLM2: ", end="", flush=True)

    stream = client.chat.completions.create(
        model="smollm2-360m",
        messages=messages,
        stream=True
    )

    full_response = ""
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
            full_response += content

    print("\n")
    messages.append({"role": "assistant", "content": full_response})

Run with:

python chat.py

Step 4: Advanced Features

Adjusting Response Style

Use temperature and system prompts to control output:

# Creative responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "system", "content": "You are a creative storyteller."},
      {"role": "user", "content": "Write a short story about a robot."}
    ],
    "temperature": 0.9,
    "max_tokens": 500
  }'

# Factual responses
curl -X POST "http://localhost:8083/mimik-ai/openai/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{
    "model": "smollm2-360m",
    "messages": [
      {"role": "system", "content": "You are a technical assistant. Be precise and accurate."},
      {"role": "user", "content": "Explain how HTTPS works."}
    ],
    "temperature": 0.2
  }'

Context Management

Trim conversation history to stay within context limits:

function trimMessages(messages, maxTokens = 1500) {
  const systemMessage = messages.find(m => m.role === 'system');
  const nonSystemMessages = messages.filter(m => m.role !== 'system');

  // Estimate tokens (rough: 4 chars per token)
  let totalChars = systemMessage ? systemMessage.content.length : 0;
  const trimmed = [];

  // Keep most recent messages
  for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
    const msgChars = nonSystemMessages[i].content.length;
    if (totalChars + msgChars > maxTokens * 4) break;
    trimmed.unshift(nonSystemMessages[i]);
    totalChars += msgChars;
  }

  return systemMessage ? [systemMessage, ...trimmed] : trimmed;
}

Troubleshooting

Slow First Response

The first request loads the model into memory. Pre-load it:

curl -X POST "http://localhost:8083/mimik-ai/openai/v1/models" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1234" \
  -d '{"model": "smollm2-360m"}'

Out of Memory

SmolLM2-360M requires only ~1GB RAM. If you still have issues:

Close other memory-intensive applications
Reduce initContextSize when creating the model

Poor Response Quality

Verify chatTemplateHint: "chatml" matches the model
Add a clear system prompt
Adjust temperature (lower for factual, higher for creative)

Next Steps

Semantic Search: Use embeddings for similarity search
Inference API Reference: Complete API documentation

Overview​

Prerequisites​

Step 1: Provision SmolLM2​

Step 2: Basic Chat​

Step 3: Build a Chat Interface​

JavaScript/Node.js​

Python​

Step 4: Advanced Features​

Adjusting Response Style​

Context Management​

Troubleshooting​

Slow First Response​

Out of Memory​

Poor Response Quality​

Next Steps​