At Inspired Cognition, we’re on a mission to make it easy to build reliable AI-driven systems. To do so, are building the next generation of tools that allow AI developers to deploy their systems with confidence. Today, we are happy to announce Critique, a new tool in the arsenal for reliable AI. Critique is a quality control tool that allows AI developers using generative AI systems (systems that generate outputs like text and images) to assess whether the outputs produced by these systems are high-quality and trustworthy.
We’ve all been amazed with the recent progress in AI. We now have AI tools that allow us to concisely summarize long meetings, translate between languages, chat about various topics, write programs, and generate images. But at the same time, AI is not perfect. Sometimes it makes mistakes, and sometimes it makes mistakes that are very bad. Some famous examples of this are chatbots that generate racist and sexist comments, question answering systems making up information, or mistranslations that cause serious misunderstandings. While it’s easy to build a quick and impressive prototype that gets it right 70% of the time, it’s much harder to build a system that’s reliable and safe.
Critique helps solve this problem by automatically detecting when the output of a generative AI system may not be trustworthy. The main way to access Critique is through its simple API that can assess generated text in dialogue, summarization, translation, and more! It’s simple to use; just add a few lines of code to your existing programs like the code snippet below. Check out the getting started page for details.
import os
from inspiredco.critique import Critique
client = Critique(api_key=os.environ["INSPIREDCO_API_KEY"])
dataset = [
{"target": "This is a really nice test sentence."},
{"target": "This sentence not so good."},
]
results = client.evaluate(
metric="uni_eval",
config={"task": "summarization", "evaluation_aspect": "fluency"},
dataset=dataset,
)
You can also play with Critique in the Critique Playground to see how it works on the text of your choice, demonstration below:
Critique has many use cases in real-world applications. For example, AI app builders can use it for:
- Output Filtering: Inform users when the generation might be bad, or avoid showing unreliable outputs at all.
- System Comparison: Compare the performance of multiple systems to decide which one to put into production.
- System Monitoring: Monitor outputs of the production system to make sure that there are no quality regressions and that the system is performing well for all user segments.
- Selective Annotation: Send flagged outputs to annotators to improve your system performance.
Critique can evaluate outputs according to a broad variety of criteria, including translation quality, summary quality, toxicity, fluency, and factual consistency. The methods to assess the quality of generated text are based on world-leading research, and are able to accurately point out problems so users can focus on the examples that are truly problematic for their use case.
If you’d like to learn more about Critique, check out the documentation and get in contact with any questions. We’re excited to make Critique as useful as possible, so please reach out if you’re interested and keep your eye on this space for new features and updates!