Skip to main content

Finding Models on Hugging Face

Hugging Face is the largest repository of AI models, with thousands of pre-trained models available for free. This guide shows you how to find compatible models for mimOE, evaluate their characteristics, and download them for on-device inference.

Understanding Model Formats

Before searching, understand which format you need:

GGUF Models

Use when: You need generative AI capabilities (chat, text generation, code completion)

File extension: .gguf

Common use cases:

  • Conversational AI
  • Text generation
  • Code completion
  • Question answering
  • Creative writing

Typical size: 1.5GB - 8GB (quantized versions)

ONNX Models

Use when: You need predictive AI capabilities (classification, detection, embeddings)

File extension: .onnx

Common use cases:

  • Image classification
  • Object detection
  • Text embeddings
  • Sentiment analysis
  • Regression models

Typical size: 10MB - 500MB

Finding GGUF Models

Method 1: Search with Filters

  1. Go to Hugging Face Models

  2. In the search filters, select: : Format: GGUF : Libraries: transformers (optional) : Tasks: Text Generation

  3. Sort by: : Most downloads: Popular, well-tested models : Most likes: Community favorites : Trending: Recently popular models

Method 2: Search by Keywords

Use the search bar with specific keywords:

phi-3 gguf
llama-3 gguf
mistral gguf
gemma gguf

Here are recommended models that work well on-device:

ModelSizeContextBest For
Phi-3-mini-4k2.4GB (Q4)4K tokensGeneral chat, coding, Q&A
Gemma-2B1.8GB (Q4)8K tokensFast responses, low memory
TinyLlama-1.1B0.6GB (Q4)2K tokensUltra-fast, resource-constrained

Direct links:

Medium Models (4-8GB): For Capable Devices

ModelSizeContextBest For
Llama-3-8B4.7GB (Q4)8K tokensHigh-quality responses
Mistral-7B4.1GB (Q4)8K tokensInstruction following, reasoning
Phi-3-medium7.6GB (Q4)128K tokensLong context, complex tasks

Direct links:

Understanding Quantization Levels

GGUF models come in different quantization levels. Lower quantization = smaller size but slightly lower quality.

QuantizationSize vs OriginalQualityUse When
Q2~25%AcceptableExtremely limited memory
Q3~33%GoodLimited memory
Q4~40%Very GoodRecommended default
Q5~50%ExcellentYou have extra memory
Q6~60%Near-perfectQuality is paramount
Q8~80%IdenticalResearch/benchmarking
Recommended Quantization

Q4 (Q4_K_M variant) offers the best balance of size, speed, and quality for most use cases.

Downloading GGUF Models

Once you've found a model, download it:

Option 1: Click "Download" on model page

Navigate to the Files tab and click the download icon next to the .gguf file.

Option 2: Use curl (faster for large files)

# Example: Download Phi-3-mini Q4
curl -L -o phi-3-mini-4k-instruct-q4.gguf \
"https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf?download=true"

Option 3: Use Hugging Face CLI

# Install HF CLI
pip install huggingface-hub

# Download model
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
Phi-3-mini-4k-instruct-q4.gguf \
--local-dir ./models

Finding ONNX Models

Method 1: Search with Filters

  1. Go to Hugging Face Models

  2. In the search filters, select: : Format: ONNX : Libraries: transformers or onnx : Tasks: Select your task (Image Classification, Object Detection, etc.)

Method 2: Search by Task + "ONNX"

mobilenet onnx
resnet onnx
bert onnx
yolo onnx

Image Classification

ModelSizeInput SizeAccuracySpeed
MobileNetV214MB224x22472%Very Fast
ResNet-5098MB224x22476%Fast
EfficientNet-B020MB224x22477%Fast

Direct links:

Object Detection

ModelSizeInput SizeUse Case
YOLOv8n6MB640x640Real-time detection (fast)
YOLOv8s22MB640x640Better accuracy
YOLOv8m52MB640x640High accuracy

Direct links:

Text Embeddings

ModelSizeEmbedding DimUse Case
all-MiniLM-L6-v290MB384Fast semantic search
BERT-base420MB768Higher quality embeddings

Direct links:

Downloading ONNX Models

Option 1: Direct download

Click "Download" on the model's Files tab.

Option 2: Use curl

# Example: Download MobileNetV2
curl -L -o mobilenet_v2.onnx \
"https://huggingface.co/onnx-community/mobilenet_v2_1.0_224/resolve/main/model.onnx?download=true"

Option 3: Python script to export

Many models don't have pre-exported ONNX versions. You can export them:

from transformers import AutoModel
from optimum.onnxruntime import ORTModelForImageClassification

# Load and export to ONNX
model = ORTModelForImageClassification.from_pretrained(
"microsoft/resnet-50",
export=True
)
model.save_pretrained("./resnet-50-onnx")

Evaluating Models

Before downloading, check these characteristics:

Model Card

Every model has a "Model Card" describing:

  • Purpose: What the model is designed for
  • Training data: What data it was trained on
  • Limitations: Known weaknesses
  • License: Usage restrictions

Files Tab

Check the model's files:

  • Size: Will it fit in your available memory?
  • Format: Is it .gguf or .onnx?
  • Variants: Multiple quantization levels available?

Community Activity

Indicators of model quality:

  • Downloads: More downloads = more tested
  • Likes: Community endorsement
  • Discussions: Active community support
  • Recent updates: Is it maintained?

Model Selection Criteria

For GGUF (Generative AI)

Choose based on:

1. Memory constraints

Rule of thumb: You need ~1.5-2x the model file size in available RAM.

  • 4GB RAM → Up to 2B parameters (TinyLlama 1.1B, Gemma-2B)
  • 8GB RAM → Up to 4B parameters (Phi-3-mini, SmolLM2)
  • 16GB RAM → Up to 7B parameters comfortably (Llama-3-8B, Mistral-7B)
  • 32GB RAM → Up to 13B parameters (Llama-3-13B, CodeLlama-13B)

2. Task complexity

  • Simple chat → Phi-3-mini, Gemma-2B
  • Code generation → Phi-3-mini, CodeLlama
  • Long documents → Phi-3-medium (128K context)
  • Maximum quality → Llama-3-8B, Mistral-7B

3. Response speed

  • Fastest → TinyLlama (0.6GB)
  • Fast → Phi-3-mini (2.4GB)
  • Balanced → Llama-3-8B Q4 (4.7GB)

For ONNX (Predictive AI)

Choose based on:

1. Task type

  • Image classification → MobileNetV2, ResNet-50
  • Object detection → YOLOv8n, YOLOv8s
  • Text embeddings → all-MiniLM-L6-v2
  • Sentiment analysis → DistilBERT

2. Accuracy vs. Speed

  • Speed priority → MobileNetV2, YOLOv8n, MiniLM
  • Accuracy priority → ResNet-50, YOLOv8m, BERT-base
  • Balanced → EfficientNet-B0, YOLOv8s

3. Input constraints

  • Limited preprocessing → Models with 224x224 input
  • High resolution → Models with 640x640+ input

Common Pitfalls to Avoid

GGUF Models

  • Don't download multiple quantizations: Pick one (Q4 recommended)
  • Check context length: Longer isn't always better (more memory)
  • Verify it's an instruct model: Base models aren't fine-tuned for chat

ONNX Models

  • Check input preprocessing: Must match model expectations
  • Verify output format: Some models return logits, others probabilities
  • Look for complete examples: Preprocessing is critical

General

  • Read the license: Some models have commercial restrictions
  • Check file integrity: Large downloads can get corrupted
  • Test on sample data first: Before deploying

Example Model Searches

"I want a chatbot"

Search: phi-3 gguf or llama-3 gguf

Recommendation: Phi-3-mini-4k-instruct Q4 (2.4GB)

  • Fast responses
  • Good quality
  • Works on most devices

"I want to classify images"

Search: mobilenet onnx or resnet onnx

Recommendation: MobileNetV2 ONNX (14MB)

  • Very fast
  • Good accuracy for common objects
  • Small size

Search: sentence-transformers onnx or all-MiniLM onnx

Recommendation: all-MiniLM-L6-v2 (90MB)

  • Fast embedding generation
  • Good semantic understanding
  • Widely used and tested

"I want code completion"

Search: phi-3 gguf or codellama gguf

Recommendation: Phi-3-mini-4k-instruct Q4 (2.4GB)

  • Excellent coding capabilities
  • Fast
  • Reasonable size

Next Steps

Now that you know how to find models:

Resources