Advanced Machine Learning Techniques with scikit-learn (Advanced)

Written by

Wilco team

•

November 16, 2024

Ensemble Methods

Ensemble methods combine multiple machine learning models to create more powerful models. In scikit-learn, ensemble methods are provided in the form of Bagging, Boosting, and Stacking.


    # Import necessary libraries
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import make_classification

    # Create a random dataset
    X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

    # Create a random forest classifier
    clf = RandomForestClassifier(max_depth=2, random_state=0)
    clf.fit(X, y)

Support Vector Machines

Support Vector Machine (SVM) is a powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection.


    # Import necessary libraries
    from sklearn import svm

    # Create a support vector classifier
    clf = svm.SVC()

    # Fit the model
    clf.fit(X, y)

Hyperparameter Tuning and Model Selection

Hyperparameters are parameters that are not learned from the data. They are set prior to the commencement of the learning process. Cross-validation is a technique used to assess the effectiveness of machine learning models. It is also a resampling procedure used to evaluate a model if we have a limited data.


    # Import necessary libraries
    from sklearn.model_selection import GridSearchCV

    # Set the parameters for cross-validation
    parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

    # Apply the cross-validation on the dataset using the defined parameters
    clf = GridSearchCV(svm.SVC(), parameters)
    clf.fit(X, y)
    clf.best_params_

Evaluating Model Performance

Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data. In scikit-learn, we can perform this task using different metrics such as the F1 Score, Precision, Recall, and ROC AUC Score.


    # Import necessary libraries
    from sklearn.metrics import classification_report

    # Predict the responses for test dataset
    y_pred = clf.predict(X_test)

    # Model Accuracy, how often is the classifier correct?
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

    # Model Precision
    print("Precision:",metrics.precision_score(y_test, y_pred))

    # Model Recall
    print("Recall:",metrics.recall_score(y_test, y_pred))

Top 10 Key Takeaways

Ensemble methods combine the predictions of several base estimators to improve generalizability and robustness.
Support Vector Machines are effective in high dimensional spaces and best suited for problems with complex domains where there are clear margins of separation in the data.
Hyperparameters are parameters that are not learned from the data, and are set before the learning process begins.
Cross-validation is a resampling procedure used to evaluate a model if we have a limited data.
Use GridSearchCV for hyperparameter tuning.
Model evaluation aims to estimate the generalization accuracy of a model on future (unseen/out-of-sample) data.
Model accuracy is the fraction of predictions our model got right.
Precision is the ability of the classifier not to label as positive a sample that is negative.
Recall is the ability of the classifier to find all the positive samples.
scikit-learn is a versatile library that provides simple and efficient tools for data mining and data analysis.

Ready to start learning? Start the quest now