Auditing Machine Learning Models with Azure's Responsible AI Dashboard

Nicole Mocskonyi
Sep 8
18 min read

Modern enterprises increasingly rely on machine learning models to make decisions, so ensuring these models are responsible, fair, and transparent is critical. Microsoft Azure’s Responsible AI Dashboard is a unified toolset that helps data science and IT teams audit and debug ML models to meet these needs. It brings together multiple mature Responsible AI tools in one interface, including model performance and fairness assessment, data exploration, interpretability, error analysis, counterfactual what-if analysis, and causal inference. By centralizing these capabilities, the dashboard makes it easier to identify issues and make informed, data-driven decisions about model improvements.

In this blog, Avyka experts walk through setting up Azure’s Responsible AI Dashboard, loading models and datasets, and interpreting its core components to ensure responsible and trustworthy AI.

Why Auditing Models Matter?

Auditing models is important for several reasons. It allows teams to evaluate model reliability, interpretability, fairness, and compliance. By examining how and why an AI system behaves in a certain way, you can uncover biases or error patterns that might otherwise go unnoticed. For example, auditing can reveal if a model performs worse for a particular subgroup of users, indicating a fairness issue. It also helps answer questions like “What is the minimum change in input features needed to get a different outcome from the model?” or “Which features are driving the model’s predictions?” Answering these questions gives insight into model behavior and potential risks. In short, rigorous model auditing builds trust with stakeholders and ensures your AI systems meet ethical and regulatory standards.

Azure’s Responsible AI Dashboard integrates multiple tools (data explorer, fairness, interpretability, error analysis, counterfactuals, causal analysis, etc.) into one interface for holistic model auditing. This unified approach enables enterprises to debug models and address issues like bias or errors early in the AI lifecycle.

The Responsible AI Dashboard is part of Azure Machine Learning and was designed to assist data scientists and ML engineers in understanding model bias, interpreting model results (globally and locally), and diagnosing model errors. By reducing biases and improving transparency, organizations can boost model accuracy and fairness while maintaining compliance. In the sections below, we’ll walk through how to set up this dashboard in Azure, load your model and data, and interpret the various components of the interface during a model audit.

Setting Up the Azure Responsible AI Dashboard

Before we can audit a model, we need to set up the Responsible AI Dashboard in our Azure environment. Below are the prerequisites and a step-by-step guide to create the dashboard.

Prerequisites

Azure Machine Learning Workspace: You should have an Azure subscription and an Azure ML workspace ready to use.
Registered Machine Learning Model: The model you want to audit must be trained and registered in the Azure ML workspace (e.g. in the model registry). Currently, the Responsible AI Dashboard primarily supports tabular classification or regression models in MLflow format (such as scikit-learn models). If your model isn’t registered yet, register it first so it’s available in Azure ML.
Datasets (Training and Test): Prepare the dataset used to train the model, and a test or validation dataset with ground truth labels. These should be registered as MLTable datasets in the workspace (the dashboard’s UI supports only MLTable format for now). The train dataset will be used to compute insights (like feature importance, causal analysis), and the test dataset will be used to evaluate model performance, fairness, and errors in the dashboard.

Step-by-Step: Creating a Responsible AI Dashboard in Azure ML Studio

Once the prerequisites are in place, Azure provides a no-code wizard in Machine Learning Studio to create the Responsible AI Dashboard:

Open the Model in Azure ML Studio: In the Azure ML Studio portal, navigate to the Models section (left pane). Select the registered model you want to audit from the list and go to the model’s Details page. On this page, click the “Create Responsible AI dashboard (preview)” button. This opens the Responsible AI Dashboard creation wizard.

Select Datasets: In the first step of the wizard, choose the datasets to use for analysis:
- Training dataset: select the registered dataset that was used to train your model. This dataset will be used by components like model interpretability and error analysis to generate insights.
- Test dataset: select the dataset for testing or validation. This dataset (with true labels) will be used to populate the dashboard’s visualizations (performance metrics, error rates, etc.). If your desired dataset isn’t listed, you can click “Create” to upload a new dataset.
Specify the Modeling Task: Indicate the type of model you are auditing, classification or regression. The wizard will present relevant options based on the task (for example, classification dashboards will include accuracy and precision metrics, whereas regression will include MAE, MSE, etc.). This step ensures the dashboard computes the appropriate performance metrics for your model.
Choose Dashboard Components (Tools): Next, select which Responsible AI tools to include in your dashboard. Azure offers two built-in profiles for convenience:
- Model debugging profile includes tools like Error Analysis, Counterfactual (what-if) analysis, and Model Interpretability (feature importance) to help debug model errors and understand predictions.
- Real-life interventions profile focuses on Causal Analysis to understand the effect of changing certain features (“treatments”) on outcomes. (Note: the causal analysis profile isn’t available for multi-class classification models).

You can choose one of these profiles or manually select specific components you want. For a comprehensive model audit, you might include all available components (performance/fairness metrics, data explorer, error analysis, interpretability, counterfactuals, and causal analysis). After selecting the profile or tools, click Next.
Configure Component Parameters: Depending on the tools you enabled, the wizard will prompt for certain configurations:
- Target Feature (Label): Specify the name of the label or outcome column that the model is predicting (the target variable).
- Categorical Features: (Optional) Identify which features are categorical. This ensures those features are handled correctly (e.g., not treated as numeric) in plots and analyses.
- Error Analysis settings: If Error Analysis is included, you can choose to pre-generate an error heat map by selecting up to two features to analyze error distribution. You can also adjust advanced settings like the maximum depth of the error decision tree or minimum samples per leaf, or simply use defaults.
- Counterfactual settings: If Counterfactual (what-if) analysis is included, set the number of counterfactual examples to generate per data point (e.g., 10) and define the desired outcome for those counterfactuals. For classification, the dashboard will automatically generate counterfactuals for the opposite class; for regression, you can specify a target value range for the predictions. You may also specify which features are allowed to change (perturb) when generating counterfactual scenarios (by default, all features can vary).
- Causal Analysis settings: If the causal component is selected (in the real-life interventions profile), choose the target outcome to analyze causally (usually the same as the model’s target) and one or more treatment features you hypothesize might influence that outcome. You can also configure advanced options like which causal algorithm to use or additional features for heterogeneity in the causal model.

Experiment Configuration: Finally, provide an Experiment name and select a compute resource to run the dashboard generation job. The Responsible AI Dashboard creation runs as an Azure ML pipeline job under the hood.

Choose a compute cluster or instance that has enough capacity to process the data (a standard DSv2 or DSv3 VM is usually sufficient for typical tabular data sizes). You can also give the dashboard a descriptive name and add tags or a description for organizational purposes.

Create and Run: Review your settings, then click Create. Azure ML will submit a pipeline job to generate the Responsible AI insights based on your configuration. You can navigate to the Experiments or Jobs section to monitor the job’s progress. This job orchestrates various components (for data analysis, fairness, explanation, etc.), and once it finishes, it produces a Responsible AI Dashboard artifact attached to your model.
Open the Dashboard: After the job completes successfully, go back to the model’s page in Azure ML Studio. Select the Responsible AI tab for the model to see the list of dashboards generated for that model. (You can create and store multiple RA dashboards per model, for example, one with all components and another focusing only on fairness, etc.). Click on the name of your newly created dashboard to open it in a full-page view. The dashboard will load in your browser, displaying all the selected components with interactive visualizations.

Tip: Some features of the dashboard (like on-the-fly what-if analysis or interactive error tree adjustments) require a compute instance for real-time calculations. At the top of the dashboard, there is an option to connect to a running compute. Make sure to attach a running compute instance (for example, your Azure ML Compute Instance) to unlock full interactivity.

Once connected (you’ll see a green status bar), you can use all features, such as dynamically generating new counterfactuals or drilling into error metrics with different settings.

Alternative: If you prefer code, you can also create a Responsible AI Dashboard via the Azure ML Python SDK or CLI. Azure provides built-in pipeline components for Responsible AI that you can chain together in a script or YAML (for instance, a component to construct the RAI dashboard, components to add an explanation, fairness, error analysis, etc., and a gather component to finalize the dashboard).

Using the SDK might look like building a pipeline where you fetch the registered model and datasets as inputs and then invoke these Responsible AI components. (This approach is beyond the scope of this article’s step-by-step, but it’s good to know for automation and CI/CD scenarios). For most users, the Azure ML Studio UI wizard described above is the easiest way to set up the dashboard without writing any code.

Loading a Model and Data into the Dashboard

Loading your model and data into the Responsible AI Dashboard is largely handled during the creation process described above. To summarize:

Model Loading: When you initiate the dashboard creation from a registered model’s page, Azure automatically knows which model to load. The dashboard is intrinsically linked to the chosen model; you don’t need to manually import the model again.

Under the hood, the pipeline job fetches the model (in MLflow format) from the registry and uses it to compute insights like predictions and explanations. Each Responsible AI dashboard is attached to a specific model and version, ensuring the analyses correspond to that model’s behavior.
Data Loading: The wizard prompts you to select the train and test datasets. When the dashboard opens, it loads these datasets for analysis. The Data Analysis component will display the data from the selected dataset, and other components (performance metrics, error analysis, etc.) will use the datasets to compute their results. Make sure the test dataset you provided includes the actual labels/outcomes, since the dashboard will compare model predictions to true values to compute error rates and fairness metrics.

If you realize you chose the wrong dataset or want to try a different slice of data, you can create a new dashboard or use the cohort filtering features (discussed below) to explore subsets of the data.

In practice, once a dashboard is generated, no manual coding is needed to load the model or data, it’s all preconfigured. When you open the dashboard, you’ll immediately see your model’s name, and all visuals will be populated with the data you specified (you can verify the dataset by looking at the data explorer or the cohort definitions). If the dashboard appears empty or errors out, double-check that your model was compatible (e.g., a supported model type) and that the datasets were properly registered in MLTable format.

Note: The Responsible AI Dashboard currently supports models that can be loaded in the Azure ML environment (for example, scikit-learn models, XGBoost, LightGBM, and other frameworks that can be wrapped with the MLflow sklearn flavor).

Models that cannot be deserialized in Python or don’t have an MLflow flavor might not work out of the box. Similarly, ensure your data is in a tabular form that the dashboard components can work with (images and text have separate dashboard variants). For most tabular use cases, if you’ve registered your datasets and model in Azure ML, loading into the dashboard should be seamless.

Interpreting the Responsible AI Dashboard Components

Once your Responsible AI Dashboard is up and running, you’ll see a rich interactive interface with multiple sections. Let’s break down the key components of the dashboard and how to use them to audit your model:

Global Controls (Cohorts and Settings), At the top of the dashboard, you’ll find controls to manage cohorts (subsets of data) and layout settings. By default, the dashboard shows metrics for All data (global cohort). You can create custom cohorts, for example, filtering the data by a feature (such as “Country = US” or “Age > 50”).

This is useful for analyzing model performance on specific segments. Global controls allow you to switch between cohorts, create new ones, or edit existing ones at any time. When a cohort is selected, all components in the dashboard update to reflect metrics/analyses for that subset of data. This makes it easy to compare different slices of your data for bias or error disparities.
Data Explorer (Data Analysis): This component helps you inspect the dataset itself, which is an important part of auditing a model. In the Data Analysis section, there are two views:
- A Table view that displays the raw dataset (features and rows), so you can browse records in the cohort.
- A Chart view that allows you to visualize the distribution of data and identify any imbalance or outliers. You can plot aggregate statistics or individual data points by selecting features for the X and Y axes. For example, you might create a bar chart of counts by age group, or a scatter plot of two features, and color points by whether the model was correct or not.
  
  These visualizations help in understanding the overrepresentation or underrepresentation of certain groups in your data. Data imbalance can often lead to model bias, so this explorer is useful to highlight such issues before even looking at model performance.
In the chart view, you can switch between aggregate plots (like histograms or bar charts) and individual data points (scatter plots). For aggregate plots, you might plot one feature on the X-axis and choose a statistic (like count or mean outcome) on the Y-axis. For individual points, you can add a color legend (e.g., color by error vs. correct, or by predicted class) to see patterns in the data.

If you discover a particular segment that looks problematic (say, a cluster of errors in a certain region of the plot), you can save it as a new cohort for deeper analysis. Overall, the data explorer ensures you’re auditing not just the model but also the data quality and distribution that could be affecting model behavior.
Model Performance and Fairness (Model Overview): The Model Overview section provides a comprehensive set of performance metrics for your model, and it also integrates fairness metrics to audit how these performances vary across different groups. There are typically two sub-tabs here:
- Dataset Cohorts Performance: This view lets you compare model performance across different cohorts of data. You will see a table of selected metrics (accuracy, F1-score, etc. for classification; or MAE, RMSE, etc. for regression) for each cohort, including the All-data cohort by default.
  
  You can add or remove metrics using a “Help me choose metrics” panel, which explains each metric and allows you to select the ones most relevant to you (for example, an enterprise might focus on accuracy and false positive rate for a loan approval model, to balance overall performance and fairness).
- You can also visualize a selected metric across cohorts in a bar chart for an easy comparison. This quickly shows if one group has systematically higher error rates or lower accuracy than another. If you have created custom cohorts (say, by gender or by region), those will appear here as separate columns. This allows for an apple-to-apple comparison of model metrics across different slices of data.
- Feature-based Fairness (Feature Cohorts): This tab is particularly useful for fairness audits. Here, you pick one or more sensitive features (or any feature of interest), for example, Gender, Race, or Income level, and the dashboard will automatically bin those features into groups (cohorts) and show performance metrics for each group. It also calculates disparity metrics, such as the maximum difference or ratio between any two groups for each performance metric.
  
  For instance, it might show that the model’s accuracy for Group A is 95% and for Group B is 85%, which is a 10-percentage point difference. Such a disparity might be highlighted if it exceeds a certain threshold. These fairness metrics give you a quantitative measure of how equitable your model’s predictions are. A key goal in responsible AI is to ensure one group isn’t unfairly disadvantaged by the model, and this feature cohort analysis makes those gaps visible. You can adjust which features to consider and how they’re binned (e.g., age could be binned into ranges) via the “Help me choose features” panel.
Together, the performance and fairness views allow you to evaluate your model’s overall effectiveness and its consistency across subpopulations. If you identify large disparities, you might consider retraining with more data for underperforming groups or applying mitigation techniques (Azure’s Fairlearn toolkit, which underpins the fairness metrics, can even suggest mitigation strategies.
Error Analysis: One of the most powerful features for auditing is the Error Analysis component. This part of the dashboard helps answer “Where and why does my model make mistakes?” It consists of two linked visualizations:
- Error Tree Map: This is a decision tree visualization that automatically finds clusters (cohorts) of data where the model’s error rate is notably high. Each node of the tree represents a cohort defined by certain feature splits (for example, a node might correspond to “Age > 50 and Income < 50k” and show how many errors occurred there).
  
  The tree map view shows nodes as rectangles sized by the number of data points and colored by error rate, giving a quick visual cue of which slices of data have high error concentrations.
  
  By selecting a node, you can inspect the filters (feature conditions) that define that cohort and see details like the number of errors vs. total points in that group. For instance, you might discover that “for older customers with low income, the model has a 30% error rate, which covers 40% of all errors”. This insight would be critical; it means your model is underperforming for a specific population. You can click “Save as new cohort” on any problematic node to create a cohort for further analysis or to take corrective action.
- Error Heatmap: The heatmap view is an alternative way to visualize model errors across two features at a time. When you switch to the Heatmap tab, you’ll choose two features (say, Feature X vs Feature Y) and the dashboard will display a matrix where each cell represents a combination of feature ranges and is colored by the error rate in that subset.
  
  This is great for spotting interaction effects, for example, maybe the model mostly makes mistakes for younger users with high income (one cell of the heatmap will show a dark red if the error rate is high). You can select one or multiple cells to see aggregate error statistics for those selections and also create cohorts from them for deeper dives. The heatmap essentially provides a visual audit of errors across feature pairs, complementing the tree which finds one set of splits automatically.
  
  Using both, an auditor can systematically investigate error hotspots. Notably, Azure’s error analysis tool (built on the Error Analysis package) also gives a sense of feature importance for errors, it can list which features were most related to the errors (using metrics like mutual information) to guide you on which features might need attention.
By leveraging error analysis, you can pinpoint failure modes of the model. For example, you might learn that most misclassifications happen for a certain product category or a certain demographic. This information is invaluable, it can guide data collection (to get more samples of those cases), feature engineering, or even decisions to not deploy the model for certain segments until improvements are made.
Model Interpretability (Feature Importance), To audit why the model is making certain predictions, the dashboard includes a Feature Importance or Model Explanation section. This component leverages Azure’s interpretability techniques (based on the SHAP/interpret-ML framework) to break down the model’s predictions:
- Global Explanations: In the Aggregate feature importance view, you’ll see a bar chart ranking the top K features that influence the model’s predictions overall. Essentially, it answers “Which features are most important in the model’s decision process, on average?” For example, you might find that “Age” and “Income” are the top drivers of the model’s predictions in a credit risk model.
  
  This global view can confirm if the model is using sensible factors or if there are any surprises (e.g., an unexpected feature dominating importance, which might indicate data leakage or a spurious correlation). You can adjust how many features to display and toggle between visualization types, bar chart or box plot (the box plot shows the distribution of feature importance values across individual instances, giving a sense of variability).
  
  Clicking on a particular feature in the chart will typically show a dependence plot: a scatter plot that shows how that feature’s value correlates with its influence on the prediction. This can reveal patterns like “as Age increases, its effect on the prediction increases positively” or threshold effects.
- Local Explanations: In the Individual feature importance view, you can select one or a few specific data points (from the dataset or cohorts) and inspect the contribution of each feature to that particular prediction. The dashboard might display this as a set of bar charts or a waterfall plot for each instance, showing how the prediction deviates from an average baseline due to each feature’s influence.
  
  For example, for a given individual, you might see that “High Income” strongly pushed the prediction toward “Approved” by +0.3, while “Young Age” pushed it toward “Denied” by –0.1, and so on, totaling up to the model’s final output score. By comparing multiple individuals' side by side, you can understand why the model made different decisions for them.
  
  This is crucial for validating that the model’s reasoning aligns with domain expectations and for explaining decisions to stakeholders. If the explanations show something concerning (like a feature that should be irrelevant having a big impact), that would be a red flag requiring further investigation.
Model interpretability is a cornerstone of responsible AI because it provides transparency. With the dashboard’s explanation charts, you can justify model decisions or debug them. For instance, if a regulator asks, “Why did the model reject this loan?”, you could use the dashboard to generate a feature importance report for that specific case. Many compliance frameworks now require such explanation capabilities.
Counterfactual What-If Analysis: The Counterfactual component allows you to explore “what-if” scenarios for individual data points. This is an interactive tool to answer the question: “What minimal changes to the input features would flip the model’s prediction or achieve a desired outcome?”. When you select a data point in the dashboard (for example, a particular customer who was denied a loan by the model), the counterfactual module will try to generate one or more counterfactual instances, i.e., a version of that data point with some feature values altered, that would result in a different model prediction (e.g., loan approved).

The dashboard shows these alternate scenarios and highlights the changes made. For example, it might show: “If Income was $50k instead of $40k and Credit Score was 700 instead of 650, the model’s prediction would change from Denied to Approved.” Each counterfactual is essentially a recommendation for change that would meet the target outcome.

This analysis is extremely insightful for both debugging and user-facing explanations. Debugging-wise, if the counterfactual changes seem unreasonable or very large, it may indicate the model is not easily fixable or is insensitive to certain features. On the other hand, if they seem intuitive (e.g., “slightly higher income leads to approval”), it provides some validation.

For users or decision-makers, counterfactuals can be framed as actions: “What would it take for this prediction to change?”, which is often what people want to know. Keep in mind the counterfactual generation respects the constraints you set; you can restrict which features are allowed to change to ensure realistic scenarios (for instance, you might allow tweaking Income but not Age because age can’t be changed).

Using the counterfactual tool, organizations can also test model robustness. For instance, if very tiny changes in input (like changing age by 1 year) flip predictions, the model might be overly sensitive or unstable, which could be a concern. The dashboard’s visual presentation of counterfactuals (often in a table or list with original vs. new values) makes it easy to communicate these findings.
Causal Analysis: If you included the Causal component, the dashboard would provide insights into causal relationships in your data. Unlike the other components, which are observational, causal analysis attempts to answer questions like “If we intervene on feature X, what is the effect on outcome Y?”.

For example, in a medical dataset, “If a patient follows Treatment A vs. Treatment B, how does it causally affect their recovery rate?” In the context of model auditing, causal analysis can help differentiate correlation from causation in your features. The dashboard allows you to select a target outcome and treatment features when configuring it.

After running, it may show results such as causal effect estimates (e.g., “increasing education level by 1 year is estimated to increase income by $3,000 on average, holding other factors constant”). It can also identify heterogeneous effects, for instance, the effect of a feature might be stronger for one group than another.

While causal analysis is a complex topic (and the dashboard uses advanced methods under the hood, like causal inference algorithms), its inclusion is part of responsible AI to ensure we aren’t misled by spurious correlations. In practice, this component is often used when you have an interest in policy or strategic decisions: for example, a business might use it to simulate the impact of improving a certain attribute (like product quality) on outcomes (like customer satisfaction).

In terms of model auditing, causal insights could inform you whether certain input features actually drive outcomes or if they are just proxies. This can prevent making unfair or ineffective policy changes based on model correlations alone.

Each of these components, data explorer, performance/fairness metrics, error analysis, interpretability, counterfactuals, and causal, work in concert to give you a 360-degree view of your model’s behavior. As you interact with the dashboard, you can drill down via cohorts and what-if scenarios to identify specific issues.

For example, you might start at the model overview and spot a fairness gap, then use error analysis to find where errors for the disadvantaged group are coming from and finally examine feature importance or counterfactuals for those cases to understand the root cause. This iterative auditing process is exactly what the Responsible AI Dashboard is designed to facilitate, all within a single tool.

Conclusion: Ensuring Responsible AI in Practice

Auditing machine learning models with Azure’s Responsible AI Dashboard enables enterprises to address fairness, transparency, and accountability alongside accuracy. The dashboard equips teams with insights to detect bias, interpret predictions, and resolve errors, building greater trust in AI-driven decisions.

At Avyka, we help organizations adopt Responsible AI in practice. From setting up and customizing Azure’s Responsible AI Dashboard to advising on governance and model improvements, our expertise ensures your AI initiatives are both impactful and compliant. By combining Azure’s Responsible AI tools with Avyka’s guidance, enterprises can deploy models that are not only high-performing but also trustworthy and ready for business-critical use.