Getting a Vision Language Model (VLM) up and running on an Intel CPU is easier than you might think. Here's a straightforward guide to get you started in three simple steps.
Step 1: Install Required Dependencies
First, ensure you have Python 3.8 or later installed. Then, install the necessary libraries:
pip install transformers accelerate
Step 2: Choose and Load a VLM
Select a model from Hugging Face, such as llava-hf/llava-1.5-7b-hf. Load it with:
from transformers import LlavaProcessor, LlavaForConditionalGeneration
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
Step 3: Run Inference on an Image
Provide an image and a prompt, then generate a response:
from PIL import Image
image = Image.open("path/to/your/image.jpg")
prompt = "Describe this image in detail."
inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(outputs[0], skip_special_tokens=True))
That's it! You now have a working VLM on your Intel CPU. Experiment with different models and prompts to explore the capabilities of vision-language AI.