Depth from focus (DFF) estimates scene depth by analyzing images captured at different focus distances. Recent deep learning–based DFF methods can predict depth at a metric scale; however, their accuracy is often limited by the relatively small amount of available training data. To overcome this limitation, we propose a more accurate DFF framework that leverages prior knowledge from monocular depth estimation (MDE) models trained on large-scale datasets. Specifically, at test time, the output of a existing DFF method is used as a reference, and the parameters of the MDE model are optimized on a per-scene basis. Experiments on synthetic and real-world datasets demonstrate that the proposed method improves both depth accuracy and structural quality, and achieves consistent improvement across a wide range of scenes compared to existing DFF approaches.