Deepchecks enables evaluating LLM-based applications during research, ci/cd and production. It enables robust automatic scoring, version comparison, and auto-calculated metrics for properties like relevance and grounded-in-context. It’s built upon the foundations of our open-source ML testing package, and addresses concerns related both to quality and to risk. Deepchecks’ standout feature is the ability to accurately estimate the expected annotation for LLM responses, surpassing common methods like GPT-as-a-judge. The solution is designed to enable both "coders" and "clickers" to iterate quickly while taking full control of both quality and compliance for their LLM-based applications.