ms-dp100 self-learning course: module 2
originally created on 2025-07-18
updated on 2025-07-27
tags: [ms-dp100-series, series]
- - -
this is my overview/notes for the second module of the microsoft dp-100 self-learning course.

module 2! (source)
module 1 - classification models
the module starts by recommending automl. lol...
it goes over the the typical data proprocessing steps.
firstly, you need your data asset. then, you can specify it as an input to the task at hand.
the data also has to be scaled and normalized. these functions and their parameters/definitions may vary.
apparently, automl does this automatically. pretty cool.
there are a bunch of experiments that you can run with automl, including:
- logistic regression
- light gradient boosting machine (lgbm)
- decision trees
- random forest
- naive bayes
- linear support vector machine (svm)
- xgboost

yeah...quite a few. (source)
"by default, automl will randomly select fvrom the full range of algorithms for the specified task."
it seems like you can block some algorithms if you know that they won't work for your data. that's definitely pretty
useful in terms of both efficiency and policy adherence.
creating an automl experiment in the sdk makes sense!
the primary metric is the metric that will be optimized. in the case of the code above, it's accuracy.
however, some tasks may have different primary metrics (the most popular alternative i have seen is precision).
the full list of primary metrics can be found in the documentation.
limits can also be set for the experiment.
timeout_minutes
- the maximum time that the experiment can run for.trial_timeout_minutes
- the maximum time that each trial can run for.max_trials
- the maximum number of trials that can be run.enable_early_termination
- whether to enable early termination of trials.
trials can also be concurrently run, and the respective limit can be set with the max_concurrent_trials
parameter.
starting the monitoring the job is also easy in the python sdk. well, to be particular, you can't monitor the
job directly in the sdk, but you can reference a link to the job in the azure portal.
what's cool about automl is that guardrails can be automatically applied to the job. some examples of guardrails could be:
- class balancing detection - automl can detect if the classes in the training data are imbalanced and apply techniques to balance them.
- missing feature value detection - automl can detect if there are missing values in the features and apply techniques to handle them.
- high cardinality feature detection - automl can detect if there are features with high cardinality (i.e. a large number of unique values) and apply techniques to handle them.
automl can possibly fix these issues on their own (done
status), or it can just detect them
if they are not fixable (alerted
status).
all of the models that are trained in the automl job can be seen in the job's "models" tab in the azure portal.
these models will have their metrics in the overview. pretty sick!
module 2 - track model training in jupyter notebooks
so...you know how i said that you can't monitor the automl job directly in the sdk?
well, you can actually track the job in a jupyter notebook. lol i lied ahahaha
using the mlflow
library, you can set the tracking uri to the mlflow tracking server.
from there, you can track the job and metrics in the notebook.
custom logging is also possible!
logged metrics can all be found in the studio if done correctly.
conclusion
all in all, this module was pretty useful. the first module was more of a review in terms of the math, but learning about automl and logging was pretty cool.

fin. a picture of 34th street at night :)
comments
no comments yet. be the first to comment!
please log in to post a comment.