{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# __Generalized Linear Model__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Generalized Linear Model (GLM) defines the loss function in group elastic net problems.\n", "Many GLMs have already been implemented in the `adelie.glm` submodule.\n", "In this notebook, we show how to use GLM objects and write custom GLM classes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import adelie as ad\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## __Single-Response GLM__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most commonly used GLMs are of single-response type.\n", "Some examples of single-response GLMs are Gaussian, Binomial, Poisson, and Cox models.\n", "As an example, we show below how to construct a gaussian GLM object." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "n = 100 # number of observations\n", "seed = 0 # seed\n", "\n", "np.random.seed(seed)\n", "y = np.random.normal(0, 1, n)\n", "\n", "glm = ad.glm.gaussian(y=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to also specify observation weights $w$.\n", "This may be useful if the user wishes to incorporate prior knowledge that some data points carry more signal or information in the underlying pattern.\n", "By default, our GLM classes will initialize the weights to be uniform across all data points.\n", "Moreover, the weights will be normalized to sum to $1$.\n", "The Gaussian solvers _require_ the weights to be normalized to $1$,\n", "although the general GLM solvers do not have the same requirements.\n", "However, all types of solvers benefit from numerical stability if the weights are normalized.\n", "For these reasons, we enforce that the weights are normalized." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jhyang/sandbox/adelie/adelie/glm.py:32: UserWarning: Normalizing weights to sum to 1.\n", " warnings.warn(\"Normalizing weights to sum to 1.\")\n" ] } ], "source": [ "# raises warning about normalization!\n", "w = np.random.uniform(0, 1, n)\n", "glm = ad.glm.gaussian(y=y, weights=w)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# no warnings!\n", "w = w / np.sum(w)\n", "glm = ad.glm.gaussian(y=y, weights=w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The single-response GLM classes expose a few member functions required by the GLM solver.\n", "For example, `.gradient()` is a member function that computes the (negative) gradient of the loss function.\n", "We show below the equivalence between calling `.gradient()` and the corresponding numpy code." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "eta = np.random.normal(0, 1, n)\n", "grad = np.empty(n)\n", "glm.gradient(eta, grad)\n", "assert np.allclose(\n", " grad,\n", " w * (y - eta),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is worth mentioning that the `.hessian()` member function computes the _diagonal_ of the full hessian matrix.\n", "It is too computationally burdensome to compute the full matrix,\n", "and moreover, to invert such a matrix during the proximal Newton step in our solver.\n", "Hence, as an approximation, we only request the diagonal of the matrix.\n", "Due to the warm-starting nature of our algorithm,\n", "this approximation seems to be sufficient and does not result in divergence issues." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the full set of methods, we refer the readers to\n", "[GlmBase64](https://jamesyang007.github.io/adelie/generated/adelie.glm.GlmBase64.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## __Multi-Response GLM__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The multi-response GLM classes correspond to GLMs with multiple responses.\n", "Some examples are multigaussian and multinomial GLMs.\n", "We show below an example of constructing a multigaussian GLM object." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "n = 100 # number of observations\n", "K = 4 # number of responses\n", "\n", "y = np.random.normal(0, 1, (n, K))\n", "\n", "glm = ad.glm.multigaussian(y=y, weights=w)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multi-response GLMs work similarly as single-response GLMs except that arguments to the member functions\n", "are of different shapes.\n", "For example, the `.gradient()` member function expects `eta` to be of `(n, K)` shape rather than `(n,)` shape as before.\n", "We show below an example of calling `.gradient()`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "eta = np.random.normal(0, 1, (n, K))\n", "grad = np.empty((n, K))\n", "glm.gradient(eta, grad)\n", "assert np.allclose(\n", " grad,\n", " w[:, None] * (y - eta) / K,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the full set of methods, we refer the readers to\n", "[GlmMultiBase64](https://jamesyang007.github.io/adelie/generated/adelie.glm.GlmMultiBase64.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## __Custom GLM Class__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A key feature of `adelie` is the ability to define a user-specific GLM class.\n", "This custom class can be implemented in either C++ _or_ Python!\n", "In this section, we show how to implement a custom single-response GLM class for the Gaussian family\n", "equivalent to `ad.glm.gaussian`.\n", "All of the discussion carries through for the multi-response GLM classes.\n", "\n", "We first show the full code for our custom GLM class." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "class Gaussian(ad.glm.GlmBase64):\n", " def __init__(self, y, w=None):\n", " self.y = y\n", " self.w = (\n", " np.full(y.shape[0], 1 / y.shape[0])\n", " if w is None else \n", " w / np.sum(w)\n", " )\n", " # MUST call base class __init__!\n", " ad.glm.GlmBase64.__init__(self, \"my_gaussian\", self.y, self.w)\n", " def gradient(self, eta, grad):\n", " grad[...] = self.w * (self.y - eta)\n", " def hessian(self, eta, grad, hess):\n", " hess[...] = self.w\n", " def loss(self, eta):\n", " return np.sum(self.w * (-self.y * eta + 0.5 * eta ** 2))\n", " def loss_full(self):\n", " return -0.5 * np.sum(self.w * self.y ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We remark on a few important points:\n", "\n", "- The custom GLM class must inherit from the provided base class `ad.glm.GlmBase64`.\n", " For users interested in using 32-bit floats, inherit from `ad.glm.GlmBase32`,\n", " however beware of the numerical instability in using 32-bit floats!\n", " You may observe strange behaviors in the optimizer, so _use with caution_!\n", " Users who are implementing multi-response GLMs must inherit from\n", " `ad.glm.GlmMultiBase64` or `ad.glm.GlmMultiBase32`.\n", "- The base class constructor must get called _exactly_ as shown above _without the use of_ `super()`!\n", " This is a quirk of the `pybind11` package which is what we rely on for exporting C++ classes to Python.\n", "- The base class requires a reference to `y` and `w`, however, the Python interpreter will not increment\n", " the reference counter for C++ classes holding references.\n", " To avoid memory lifetime issues, the user must save a reference to these variables on the Python side\n", " (e.g. `self.y = y` and `self.w = w`).\n", "- Many of the member functions are given the output container to store the result of computing a quantity.\n", " Hence, we use the syntax `out[...] = expression` to modify the container _in-place_.\n", " A common pitfall is to write `out = expression`, which will only redirect the local variable `out` \n", " to point to a different location in memory.\n", " This low-level interface is done for memory efficiency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now show that the GLM object can be passed into our existing solver `ad.grpnet`\n", "with no further changes to the code.\n", "We first generate our data and call `ad.grpnet`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "n = 100 # number of observations\n", "p = 1000 # number of features\n", "\n", "X = np.random.normal(0, 1, (n, p))\n", "y = X[:, -1] * np.random.normal(0, 1) + np.random.normal(0, 1, n)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jhyang/sandbox/adelie/adelie/matrix.py:259: UserWarning: Detected matrix to be C-contiguous. Performance may improve with F-contiguous matrix.\n", " warnings.warn(\n", " 44%|\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m | 44/100 [00:00:00<00:00:00, 6335.87it/s] [dev:90.7%]\n" ] } ], "source": [ "state = ad.grpnet(\n", " X=X,\n", " glm=Gaussian(y=y),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare solutions with our Gaussian solver.\n", "To demonstrate an exact match in coefficients, \n", "we pass in a Gaussian GLM object with optimization flag turned off\n", "so that the general GLM solver is used rather than our special Gaussian solver." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 44%|\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m\u001b[1;32m█\u001b[0m | 44/100 [00:00:00<00:00:00, 7097.11it/s] [dev:90.7%]\n" ] } ], "source": [ "state_exp = ad.grpnet(\n", " X=X,\n", " glm=ad.glm.gaussian(y=y, opt=False),\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "assert np.allclose(\n", " state.betas.toarray(),\n", " state_exp.betas.toarray(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on our experience, the part of the solver that interacts with the GLM object\n", "is not speed critical, so there is seldom a noticeable performance cost by using a Python GLM class.\n", "However, for the best performance, users may be interested in porting the code to C++ and exporting the bindings.\n", "We refer the readers to the \n", "[glm](https://github.com/JamesYang007/adelie/tree/main/adelie/src/include/adelie_core/glm) \n", "directory in our C++ backend for the numerous examples of how to extend the base classes in C++.\n", "The associated binding code can be found in\n", "[glm.cpp](https://github.com/JamesYang007/adelie/blob/main/adelie/src/glm.cpp)." ] } ], "metadata": { "kernelspec": { "display_name": "adelie", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 2 }