Laravel

The Technology Innovation Institute (TII) in Abu Dhabi has released Falcon, a new family of state-of-the-art language models under the Apache 2.0 license. The standout is Falcon-40B, which rivals many closed-source models while being truly open. This opens up exciting possibilities for developers, researchers, and businesses.

This article dives into what makes Falcon unique and how to easily use it with tools from the Hugging Face ecosystem—covering inference, quantization, fine-tuning, and more.

The Falcon Models

The Falcon family includes two base models: Falcon-40B and its smaller sibling Falcon-7B. At launch, Falcon-40B topped the Open LLM Leaderboard, and Falcon-7B was the best in its weight class.

Falcon-40B requires about 90GB of GPU memory—less than LLaMA-65B, which it outperforms. Falcon-7B needs only ~15GB, making it accessible on consumer hardware. TII also provides instruct versions (Falcon-7B-Instruct and Falcon-40B-Instruct) fine-tuned for conversational tasks.

The key to Falcon's quality is its training data, primarily based on RefinedWeb (>80%), a massive web dataset from CommonCrawl. TII scaled and improved web data quality via deduplication and filtering, reducing reliance on curated sources. They also released a 600 billion token extract of RefinedWeb for community use.

Falcon uses multiquery attention, which shares key and value embeddings across attention heads. This reduces the K,V-cache during inference by 10-100 times, lowering memory costs and enabling optimizations.

Model	License	Commercial Use	Pretraining Tokens	Pretraining Compute (PF-days)	Leaderboard Score	K,V-cache Size (2K context)
StableLM-Alpha-7B	CC-BY-SA-4.0	Yes	1,500B	700	34.37	800MB
LLaMA-7B	LLaMA	No	1,000B	500	45.65	1,100MB
MPT-7B	Apache 2.0	Yes	1,000B	500	44.28	1,100MB
Falcon-7B	Apache 2.0	Yes	1,500B	700	44.17	20MB
LLaMA-33B	LLaMA	No	1,500B	3200	-	3,300MB
LLaMA-65B	LLaMA	No	1,500B	6300	61.19	5,400MB
Falcon-40B	Apache 2.0	Yes	1,000B	2800	58.07	240MB

Demo

Try Falcon-40B in this Space or the embedded playground, powered by Hugging Face's Text Generation Inference—the same technology behind HuggingChat.

A Core ML version of the 7B instruct model runs on an M1 MacBook Pro, with a Swift library for easy integration. Download Core ML weights from the repo.

Inference

Use the transformers API with the model in bfloat16 (requires recent CUDA). Allow remote code execution since the architecture is new. Example code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation

Falcon models excel on standard benchmarks. As of this writing, Falcon-40B leads the Open LLM Leaderboard, outperforming models like LLaMA-65B and MPT-30B while using less compute.

Fine-tuning with PEFT

Fine-tune Falcon with Parameter-Efficient Fine-Tuning (PEFT) using LoRA. Example for Falcon-7B:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["query_key_value"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)

# Train on your dataset...

This approach drastically reduces memory requirements, enabling fine-tuning on a single GPU.

Conclusion

Falcon brings high-performance, truly open language models to the community. With Apache 2.0 licensing and integration with Hugging Face tools, Falcon is poised to accelerate innovation in NLP.

Falcon AI Models Land on Hugging Face: Powerful Open-Source Language Models Now Accessible