Laravel

Hugging Face recently launched Inference Endpoints, a managed service designed to simplify deploying transformers in production. It allows you to deploy almost any model from the Hugging Face Hub to AWS, Azure, or GCP on a range of instance types, including GPU. Our team decided to migrate some of our CPU-based ML models to this service. Here’s why we made the switch—and why you might want to consider it too.

What Were We Doing Before?

Previously, our models were self-managed on AWS Elastic Container Service (ECS) backed by AWS Fargate. The workflow was cumbersome:

Train model on a GPU instance (using CML and transformers)
Upload to Hugging Face Hub
Build a FastAPI-based serving API
Wrap the API in a Docker container
Upload the container to AWS Elastic Container Registry (ECR)
Deploy to an ECS cluster

While ECS wasn't ideal for ML serving, it allowed our models to coexist with other containerized services, reducing cognitive load.

What’s the New Workflow?

With Inference Endpoints, the process simplifies dramatically:

Train model on a GPU instance
Upload to Hugging Face Hub
Deploy directly via Hugging Face Inference Endpoints

This eliminates the need for custom APIs, Docker containers, and ECR/ECS management. Other managed services like SageMaker, Seldon, or BentoML are options, but since we already use the Hub as our model registry and are invested in Hugging Face’s ecosystem (transformers, AutoTrain), Inference Endpoints fit naturally.

Latency and Stability: No Compromises

We benchmarked different CPU endpoint types using Apache Bench (ab). For ECS, a large container had ~200ms latency in-region. For Inference Endpoints, we tested a RoBERTa-based text classification model under the following conditions:

Requester region: us-east-1
Requester instance: t3-medium
Endpoint region: us-east-1
Replicas: 1
Concurrent connections: 1
Requests: 1000

Results for Intel Ice Lake CPU endpoints:

size   | vCPU | Memory (GB) | ECS (ms) | Hugging Face (ms)
---------------------------------------------------------
small  | 1    | 2           | -        | ~296
medium | 2    | 4           | -        | 156 ± 51
large  | 4    | 8           | ~200     | 80 ± 30
xlarge | 8    | 16          | -        | 43 ± 31

Our Hugging Face container was more than twice as fast as our custom ECS container. The slowest large endpoint response was just 108ms—well within our requirements for real-time serving.

Cost Analysis: Paying for Convenience

Managed solutions often cost more, but the trade-off in time and effort can be worthwhile. Here’s a monthly cost comparison for equivalent instances:

size   | vCPU | Memory | ECS + Fargate | Hugging Face | % Difference
-------------------------------------------------------------------
small  | 1    | 2 GB   | $33.18        | $43.80       | 24%
medium | 2    | 4 GB   | $60.38        | $87.61       | 31%
large  | 4    | 8 GB   | $114.78       | $175.22      | 34%
xlarge | 8    | 16 GB  | $223.59       | $350.44      | 50%

At our current scale, the extra ~$60/month for a large instance is trivial compared to the time saved on API and container management. For hundreds of microservices, the premium might warrant reconsideration, but for now, the simplicity wins.

Notes and Caveats

Pricing displayed in the deployment GUI may differ from the official pricing page; I used the GUI values, which are higher.
ECS+Fargate costs are underestimated—they exclude data transfer and ECR storage fees.

Other Considerations

Inference Endpoints offer deployment options (e.g., protected endpoints) and support hosting multiple models on a single endpoint, which can further optimize costs. These features add flexibility as our needs grow.

To Conclude…

Migrating to Hugging Face Inference Endpoints has reduced our deployment complexity from a multi-step DevOps process to a few clicks. The slight cost increase is far outweighed by the gains in developer productivity and reduced cognitive load. If you’re already using the Hugging Face Hub and want a streamlined path to production, this service is worth considering.

From DIY to Managed: Why We Migrated to Hugging Face Inference Endpoints