9 Model Fusion
What if you could take a model that excels at creative writing, another that is brilliant at coding, and a third that dominates at math, and combine them into a single model that does all three? No additional training, no extra data, no GPU time. Just load the weights, merge them, save the result. This is the promise of model fusion (also called model merging), and it has become one of the most fascinating and practically useful techniques in the open-source AI community.
If you look at the top of the Open LLM Leaderboard on HuggingFace, you will notice that many of the highest-ranking models are not trained from scratch. They are merges: combinations of existing fine-tuned models, blended together using techniques described in this chapter. Model merging has become a competitive sport in the open-source community, with practitioners “breeding” better models through creative combinations.
9.1 Why Merge Models?
Fine-tuning a base model on different datasets produces specialized models: one might excel at code, another at creative writing, a third at mathematical reasoning. Model merging combines these specializations into a single model without any additional training.
The appeal is irresistible:
- Zero additional compute: You only need enough RAM to load and save the weights. No GPU training required.
- Combining capabilities: Merge a code model with a chat model with a math model, and (if things go well) you get a model that can do all three.
- Reducing toxicity: You can subtract undesirable behaviors using task arithmetic (more on this below).
- Democratic AI development: Anyone with a laptop can create competitive models by merging existing checkpoints.
9.2 How Merging Works: The Intuition
Fine-tuning a pre-trained model changes its weights, but typically only by a small amount. The task vector \(\tau = \theta_{\text{fine-tuned}} - \theta_{\text{base}}\) captures exactly what the fine-tuning “learned.” If two models were fine-tuned from the same base, their task vectors represent different skills acquired from different data.
The key insight is that these task vectors often live in different “regions” of parameter space: coding skills modify different weights than creative writing skills. When the changes are sufficiently disjoint, you can simply add them together without interference.
Picture a photo with Instagram filters layered on top. One filter adjusts the color temperature, another adjusts the contrast, a third adds a vignette. Because they affect different aspects of the image, you can layer them one on top of another without interference. Model merging works the same way: as long as different fine-tuning runs modify different weights, you can stack the changes.
9.3 Merging Techniques
9.3.1 Linear Interpolation (Model Soups)
The simplest approach: take a weighted average of the parameters of two or more models (Wortsman et al. 2022): \[\theta_{\text{merged}} = \alpha \cdot \theta_A + (1-\alpha) \cdot \theta_B\]
Wortsman et al. showed that averaging multiple fine-tuned variants of the same base model (“model soups”) often outperforms the best individual model, especially on out-of-distribution data. This works because averaging smooths out the idiosyncratic overfitting of each model while preserving the shared learned features.
9.3.2 SLERP (Spherical Linear Interpolation)
Linear interpolation in weight space can shrink the magnitude of weight vectors (the average of two unit vectors is shorter than either one). SLERP interpolates along the geodesic on a hypersphere, preserving magnitude. This tends to produce smoother, more stable merges.
SLERP can only merge exactly two models at a time (not three or more directly), but it often produces higher-quality results than linear interpolation for pairs of models.
9.3.3 Task Arithmetic
Task vectors \(\tau = \theta_{\text{ft}} - \theta_{\text{base}}\) capture what fine-tuning “learned.” Task arithmetic manipulates these vectors directly:
\[\theta_{\text{merged}} = \theta_{\text{base}} + \lambda_1 \tau_1 + \lambda_2 \tau_2 + ...\]
This allows powerful operations:
- Adding skills: \(\theta_{\text{base}} + \tau_{\text{code}} + \tau_{\text{math}}\) combines coding and math abilities.
- Removing behaviors: \(\theta_{\text{base}} + \tau_{\text{chat}} - \tau_{\text{toxic}}\) subtracts toxicity from a chat model.
- Scaling: Adjusting \(\lambda\) controls how much of each skill to incorporate.
One of the most elegant applications of task arithmetic is the ability to subtract undesirable behaviors. If you fine-tune a model on toxic data to produce a “toxic expert,” you can compute its task vector and subtract it from another model. The result is measurably less toxic. You are literally doing algebra on learned behaviors.
9.3.4 TIES-Merging
TIES-Merging (Yadav et al. 2023) addresses the fundamental problem of interference: when different models push the same weight in opposite directions, their changes cancel out during averaging, destroying useful information.
TIES operates in three steps:
- Trim: Set small-magnitude task vector components to zero. These are likely noise, not signal.
- Elect signs: For each parameter, choose the sign (positive or negative) that has the largest total magnitude across all models being merged.
- Disjoint merge: Average only the components that agree with the elected sign. Components that disagree are excluded.
This selective merging preserves the strongest signals from each model while filtering out noise and conflicting changes.
9.3.5 DARE (Drop and Rescale)
DARE takes a radical approach: randomly drop a large fraction (e.g., 90%) of the delta weights and rescale the remaining ones to preserve the expected magnitude. The intuition is that fine-tuning produces many small, redundant weight changes, and only a sparse subset carries the essential information.
When combined with TIES, DARE produces some of the best merges in practice. The combination is known as DARE-TIES.
9.4 MergeKit: The Practitioner's Toolkit
MergeKit (Goddard et al. 2024) is the most widely used open-source toolkit for model merging. It supports all the techniques above (linear, SLERP, TIES, DARE, task arithmetic) and operates out-of-core: it processes weights layer-by-layer from disk, enabling merges of models that are too large to fit in RAM.
Using MergeKit is straightforward: write a YAML configuration file specifying the models, the merge method, and the parameters, then run the merge command. The result is a new model that you can upload to HuggingFace or convert to GGUF for local use.
Here is a typical MergeKit workflow: start with a strong base model (e.g., LLaMA 3 8B), find two fine-tuned variants on HuggingFace (one optimized for conversation, one for code), and merge them using DARE-TIES. Evaluate the result on benchmarks. If one skill is too weak, increase its weight. If the model is incoherent, try SLERP instead. Model merging is as much art as science, and experimentation is the key.
9.5 Mixture of Experts as Soft Merging
An alternative to hard weight merging is routing different inputs to different expert models at inference time. This is the Mixture of Experts (MoE) approach.
Mixtral (Jiang et al. 2024) uses a learned router to select 2 of 8 experts per token. Each expert is a separate FFN, and the router is a small neural network that decides which experts are most relevant for each input. This means the model has 47B total parameters but only activates 13B per forward pass.
The open-source community has created “frankenmerges”: MoE models assembled from independently fine-tuned dense models. You take several 7B models (each fine-tuned for a different task), use them as experts, and train a small router to select the right expert for each query. This creates a modular, extensible system where new skills can be added by training a new expert and adding it to the roster.
9.6 When Does Merging Work (and When Does It Fail)?
Model merging is not magic. It has clear success conditions and failure modes:
Merging works well when:
- All models share the same base model and architecture.
- Fine-tuning tasks are complementary rather than conflicting.
- Delta weights are sparse (most parameters barely changed during fine-tuning, as is typical with LoRA fine-tuning).
Merging fails when:
- Models have diverged too far from the base (e.g., after extensive continued pre-training on very different data).
- Tasks are fundamentally incompatible (e.g., a model trained to always refuse harmful requests merged with a model trained to never refuse).
- The merge ratios are wrong: too much weight on one model drowns out the others.
Sakana AI introduced evolutionary model merging, which uses evolutionary algorithms to search for the optimal merge configuration (which layers to merge, what weights to use, which technique to apply at each layer). Instead of manually tuning merge parameters, the algorithm explores the space of possible merges and selects configurations that perform best on a benchmark. This automated approach has produced models that outperform any individual component.
9.7 Exercises
- Install MergeKit and merge two LoRA fine-tunes of the same base model using linear interpolation. Evaluate the merged model alongside both individual models on a benchmark (e.g., MMLU or HumanEval). Does the merge retain both capabilities?
- Experiment with task arithmetic: add a “math” task vector and subtract a “toxicity” task vector from a base model. Evaluate whether math ability improves and toxic outputs decrease.
- Try merging the same two models using SLERP, TIES, and DARE-TIES. Compare the results on a benchmark. Which technique works best for your particular combination?
- Create a simple MoE system: take three independently fine-tuned 7B models (code, chat, math) and use a small classifier to route queries to the most appropriate model. Compare this with a TIES merge of the same three models.
- Explore the HuggingFace Open LLM Leaderboard. Identify three top-performing models that are merges. What base models and techniques were used? Can you reproduce or improve on their results?