The Technology Innovation Institute (TII) in Abu Dhabi has released Falcon, a new family of state-of-the-art language models under the Apache 2.0 license. The standout is Falcon-40B, which rivals many closed-source models while being truly open. This opens up exciting possibilities for developers, researchers, and businesses.
This article dives into what makes Falcon unique and how to easily use it with tools from the Hugging Face ecosystem—covering inference, quantization, fine-tuning, and more.
The Falcon Models
The Falcon family includes two base models: Falcon-40B and its smaller sibling Falcon-7B. At launch, Falcon-40B topped the Open LLM Leaderboard, and Falcon-7B was the best in its weight class.
Falcon-40B requires about 90GB of GPU memory—less than LLaMA-65B, which it outperforms. Falcon-7B needs only ~15GB, making it accessible on consumer hardware. TII also provides instruct versions (Falcon-7B-Instruct and Falcon-40B-Instruct) fine-tuned for conversational tasks.
The key to Falcon's quality is its training data, primarily based on RefinedWeb (>80%), a massive web dataset from CommonCrawl. TII scaled and improved web data quality via deduplication and filtering, reducing reliance on curated sources. They also released a 600 billion token extract of RefinedWeb for community use.
Falcon uses multiquery attention, which shares key and value embeddings across attention heads. This reduces the K,V-cache during inference by 10-100 times, lowering memory costs and enabling optimizations.
| Model | License | Commercial Use | Pretraining Tokens | Pretraining Compute (PF-days) | Leaderboard Score | K,V-cache Size (2K context) |
|---|---|---|---|---|---|---|
| StableLM-Alpha-7B | CC-BY-SA-4.0 | Yes | 1,500B | 700 | 34.37 | 800MB |
| LLaMA-7B | LLaMA | No | 1,000B | 500 | 45.65 | 1,100MB |
| MPT-7B | Apache 2.0 | Yes | 1,000B | 500 | 44.28 | 1,100MB |
| Falcon-7B | Apache 2.0 | Yes | 1,500B | 700 | 44.17 | 20MB |
| LLaMA-33B | LLaMA | No | 1,500B | 3200 | - | 3,300MB |
| LLaMA-65B | LLaMA | No | 1,500B | 6300 | 61.19 | 5,400MB |
| Falcon-40B | Apache 2.0 | Yes | 1,000B | 2800 | 58.07 | 240MB |
Demo
Try Falcon-40B in this Space or the embedded playground, powered by Hugging Face's Text Generation Inference—the same technology behind HuggingChat.
A Core ML version of the 7B instruct model runs on an M1 MacBook Pro, with a Swift library for easy integration. Download Core ML weights from the repo.
Inference
Use the transformers API with the model in bfloat16 (requires recent CUDA). Allow remote code execution since the architecture is new. Example code:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Evaluation
Falcon models excel on standard benchmarks. As of this writing, Falcon-40B leads the Open LLM Leaderboard, outperforming models like LLaMA-65B and MPT-30B while using less compute.
Fine-tuning with PEFT
Fine-tune Falcon with Parameter-Efficient Fine-Tuning (PEFT) using LoRA. Example for Falcon-7B:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
tokenizer.pad_token = tokenizer.eos_token
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["query_key_value"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
# Train on your dataset...
This approach drastically reduces memory requirements, enabling fine-tuning on a single GPU.
Conclusion
Falcon brings high-performance, truly open language models to the community. With Apache 2.0 licensing and integration with Hugging Face tools, Falcon is poised to accelerate innovation in NLP.