Machine Learning

Prototyping Gradient Fear in the study of a machine

Reading

Guded reading is a labeling types of labels that train algoriths to predict the results and note patterns.

Unlike random learning, the guarded algorithms are given to the training that is written between installation and results.

Requirement: Algebra line


Suppose we have Returning Problem When the model needs to predict continuous values ​​by taking the number of installations (xi).

The predicting value is defined as a job called Hypothesis (h):

when:

  • θi: Thtra parameter corresponding to each installation feature (X_I),
  • ε (Epsilon): Gaussian error (ε ~ ​​n (0, σ²)))

As one of the hypothesis of one installation produces the amount of scalar (Hθ (x) ∈ir), it can be seen as a DOT product of The disposal of prameter vector (θt) once The Vector feature of what is input (x):

Batch Gradient Forecent

Gradient Deccent It is algorithm to use force used to receive a work-up minimum. In each step, it motivates the side of the Steepeepest chief progress slowly – a job root – simple, continue to the down.

Now, remember that we have parameters n that affect predictions. Therefore, we need to know a specific donation of Each Perameter (θi) corresponding to training data (xi)Selected at work.

Suppose we place each step size as a reading (α), and find the curve of the Nkim (J), then a parameter is deducted from each step:

(Α: Level, J (θ): cOST WORK OST, ∂ / ∂θi: Deleted of something costs in respect of θiSelected

Gradient

Gradient represents the cost of the cost.

If you process the remaining parameters with its accompanied access to cost-related work (J), the cost of the cost of the cost of θ for N parameters is defined as:

Gradient is a method of receipt of the cost of cost of costs in relation to all parameters (θ0 in θn).

As a scalar study value (αi∈ ward, the Gradient Decent Renewal Act is exposed to the Matrix Notation:

As a result, Parameter (θ) resides in the area (N + 1) -And.

According to the area, fall down on the stage that is relevant to the quality of learning until the conversion.

Gradient Descent decreased to decrease effectively to parameter (Photo Source: Writer)

Consolidation

The purpose of accurate reduces the gap (MSE) between predicted prices and the actual prices are given to training database.

Cost's Employment (Objective Work)

This gap (MSE) is defined as a gap between all examples of training:

where

  • Jθ: Cost's Employment (or Losens),
  • hθ: Predicting from model,
  • X: The_that of the installation,
  • y: I_I targeted advice, and
  • M: Number of training models.

This page Gradient be compacted by taking The part of a part of the cost of expenditure in relation to each parameter:

Because we have n + 1 boundaries (including Intercept Tearm θ0) and stat instances of M, will build Gradient Vector using Matrix Notation:

In the Matrix Notation, where the IX represents the design matrix including intercept term and θ is parameter vevector, gradient ∇∇) is provided by:

This page LMS (at least squares) law) Is the algorithm that appears continuously transform the model parameters based on the error between its predictions and the actual intended prices for training examples.

At least lower limits of square (lms) rule

For each champight For a decisive generation, every parameter θi is renewed by removing part of an error in all the examples of training:

This procedure allows algorithm to find an amratily to get The correct parameters that reduces cost work.

.

General equation

Detection The right parameter (θ *) that reduces cost work, we can use General equation.

This approach provides a solution to accurate revenues, which allow us directly the value of θ to reduce cost function.

Unlike performance strategies, the usual equation receives this good by settling the point when the gradient is zero, confirming quick encounters:

So:

This is subject to thinking that the design matrix X is not workingmeaning that all aspects of the installation (from x_0 to x_n) are independently independent.

If X is not working, we will need to address the installation features to ensure their independence.

Flag

In fact, we repeat the procedure until the conversion is set up:

  • The work of its cost and gradient
  • Learning
  • Tolerate (mins. Cost limit to stop ITERATION)
  • The maximum amount of Iterations
  • The first point

Batch with a reading amount

The following coding snippet indicates the gradient function process receiving the quadratic costs of the Questratic costs (0.1, 0.3, 0.9 and 0.9):

def cost_func(x):
    return x**2 - 4 * x + 1

def gradient(x):
    return 2*x - 4

def gradient_descent(gradient, start, learn_rate, max_iter, tol):
    x = start
    steps = [start] # records learning steps

    for _ in range(max_iter):
        diff = learn_rate * gradient(x)
        if np.abs(diff) < tol:
            break
        x = x - diff
        steps.append(x)

    return x, steps

x_values = np.linspace(-4, 11, 400)
y_values = cost_func(x_values)
initial_x = 9
iterations = 100
tolerance = 1e-6
learning_rates = [0.1, 0.3, 0.8, 0.9]

def gradient_descent_curve(ax, learning_rate):
    final_x, history = gradient_descent(gradient, initial_x, learning_rate, iterations, tolerance)

    ax.plot(x_values, y_values, label=f'Cost function: $J(x) = x^2 - 4x + 1$', lw=1, color='black')

    ax.scatter(history, [cost_func(x) for x in history], color='pink', zorder=5, label='Steps')
    ax.plot(history, [cost_func(x) for x in history], 'r--', lw=1, zorder=5)

    ax.annotate('Start', xy=(history[0], cost_func(history[0])), xytext=(history[0], cost_func(history[0]) + 10),
                arrowprops=dict(facecolor='black', shrink=0.05), ha='center')
    ax.annotate('End', xy=(final_x, cost_func(final_x)), xytext=(final_x, cost_func(final_x) + 10),
                arrowprops=dict(facecolor='black', shrink=0.05), ha='center')
    
    ax.set_title(f'Learning Rate: {learning_rate}')
    ax.set_xlabel('Input feature: x')
    ax.set_ylabel('Cost: J')
    ax.grid(True, alpha=0.5, ls='--', color='grey')
    ax.legend()

fig, axs = plt.subplots(1, 4, figsize=(30, 5))
fig.suptitle('Gradient Descent Steps by Learning Rate')

for ax, lr in zip(axs.flatten(), learning_rates):
    gradient_descent_curve(ax=ax, learning_rate=lr)
Learning values ​​are controlled by negative steps. (Suppose the expenditure of the cost J (x) is a quadratic function, takes one element of the installation X.)

To predict credit card transaction

Let us use sample data in Kaggle to predict credit card transactions using direct conversion using the batch GD border.

1. Data data

a) Base Dataframe

First, we will integrate these four files from the sample database using the keys as key, while planning raw data:

  • transaction (CSV)
  • User (CSV)
  • Credit Card (CSV)
  • Train_Fraud_labels (JSON)
# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset 
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int)

merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})
merged_df = merged_df.dropna()

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.replace('nan', np.nan).dropna()
card_df = card_df[card_df['card_on_dark_web'] == 'No']
card_df = card_df.drop(columns=['acct_open_date', 'card_number', 'expires', 'cvv', 'card_on_dark_web'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# load user data
user_df = pd.read_csv(f'{dir}/users_data.csv')
user_df = user_df.drop(columns=['birth_year', 'birth_month', 'address', 'latitude', 'longitude'], axis='columns')
user_df = user_df.replace('nan', np.nan).dropna()
user_df['per_capita_income'] = user_df['per_capita_income'].apply(sanitize_df)
user_df['yearly_income'] = user_df['yearly_income'].apply(sanitize_df)
user_df['total_debt'] = user_df['total_debt'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = pd.merge(left=merged_df, right=user_df, left_on='client_id_x', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_x', 'client_id_x', 'card_id', 'merchant_id', 'id_y', 'client_id_y', 'id'], axis='columns')
merged_df = merged_df.dropna()

# finalize the dataframe
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float)
df = df.dropna()
print('Base data frame: n', df.head(n=3))

b) postponing
From Base DataFrame, we will select the appropriate installation features with:
Continuous amounts, and a relationship that seems to be seen with the same amount of purchase value.

df = df[df['is_fraud'] == 0]
df = df[['amount', 'per_capita_income', 'yearly_income', 'credit_limit', 'credit_score', 'current_age']]

Then, we will sort the merchants across 3 ordinary deviation away from the meaning:

def filter_outliers(df, column, std_threshold) -> pd.DataFrame:
    mean = df[column].mean()
    std = df[column].std()
    upper_bound = mean + std_threshold * std
    lower_bound = mean - std_threshold * std
    filtered_df = df[(df[column] <= upper_bound) | (df[column] >= lower_bound)]
    return filtered_df

df = df.replace(to_replace='NaN', value=0)
df = filter_outliers(df=df, column='amount', std_threshold=3)
df = filter_outliers(df=df, column='per_capita_income', std_threshold=3)
df = filter_outliers(df=df, column='credit_limit', std_threshold=3)

Finally, we will take a logarithm of the target value amount Reduced a combined distribution:

df['amount'] = df['amount'] + 1
df['amount_log'] = np.log(df['amount'])
df = df.drop(columns=['amount'], axis='columns')
df = df.dropna()

* Added one to price To avoid a bad mining in Value_log column.

Final Dataphame:

c) transformer
Now, we can distinguish and transform the last datafamram into rail / test datasets:

categorical_features = X.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

2. Defining Batch Gd Regreser

class BatchGradientDescentLinearRegressor:
    def __init__(self, learning_rate=0.01, n_iterations=1000, l2_penalty=0.01, tol=1e-4, patience=10):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.l2_penalty = l2_penalty
        self.tol = tol
        self.patience = patience
        self.weights = None
        self.bias = None
        self.history = {'loss': [], 'grad_norm': [], 'weight':[], 'bias': [], 'val_loss': []}
        self.best_weights = None
        self.best_bias = None
        self.best_val_loss = float('inf')
        self.epochs_no_improve = 0

    def _mse_loss(self, y_true, y_pred, weights):
        m = len(y_true)
        loss = (1 / (2 * m)) * np.sum((y_pred - y_true)**2)
        l2_term = (self.l2_penalty / (2 * m)) * np.sum(weights**2)
        return loss + l2_term

    def fit(self, X_train, y_train, X_val=None, y_val=None):
        n_samples, n_features = X_train.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.n_iterations):
            y_pred = np.dot(X_train, self.weights) + self.bias
        
            dw = (1 / n_samples) * np.dot(X_train.T, (y_pred - y_train)) + (self.l2_penalty / n_samples) * self.weights
            db = (1 / n_samples) * np.sum(y_pred - y_train)

            loss = self._mse_loss(y_train, y_pred, self.weights)
            gradient = np.concatenate([dw, [db]])
            grad_norm = np.linalg.norm(gradient)

            # update history
            self.history['weight'].append(self.weights[0])
            self.history['loss'].append(loss)
            self.history['grad_norm'].append(grad_norm)
            self.history['bias'].append(self.bias)

            # descent
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

            if X_val is not None and y_val is not None:
                val_y_pred = np.dot(X_val, self.weights) + self.bias
                val_loss = self._mse_loss(y_val, val_y_pred, self.weights)
                self.history['val_loss'].append(val_loss)

                if val_loss < self.best_val_loss - self.tol:
                    self.best_val_loss = val_loss
                    self.best_weights = self.weights.copy()
                    self.best_bias = self.bias
                    self.epochs_no_improve = 0
                else:
                    self.epochs_no_improve += 1
                    if self.epochs_no_improve >= self.patience:
                        print(f"Early stopping at iteration {i+1} (validation loss did not improve for {self.patience} epochs)")
                        self.weights = self.best_weights
                        self.bias = self.best_bias
                        break

            if (i + 1) % 100 == 0:
                print(f"Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}", end="")
                if X_val is not None:
                    print(f", Validation Loss: {val_loss:.4f}")
                else:
                    pass

    def predict(self, X_test):
        return np.dot(X_test, self.weights) + self.bias

3. Forecast & exploration

model = BatchGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=10000, l2_penalty=0, tol=1e-5, patience=5)
model.fit(X_train_processed, y_train.values)
y_pred = model.predict(X_test_processed)

Output:
Of five features of five, per_capita_income showed the highest connection with the purchase amount:

(Left: Many: Right transactions), Right-Cost (Reading_a. 5,000, N = 5,000, n = 5,000, n = 5,000, n = 5,000)

It means a combined error (MSE): 1.5752
Ir-Squared: 0.0206
Mean complete error (mae): 1.0472

Time Time: Training: O (N ² + n³) Forecasting: O (n)
Space Diectiver: O (NM)
(M: Example size training, n: Installation feature size, you think M >>


Stochastic Gradient Feor

Batch GD uses All training information Combining Gradient in each ITeration Itemation (EPOch) Step, most expensive especially when we have millions of data.

Stochastic Gradient Festecent (SGD) On the other hand,

  1. usually prompts training data at the beginning of each epoch,
  2. Choose randomly a single An Example of Training In each contitation,
  3. calculates gradient using an instance, and
  4. revitalizing the weight of model and choosing After processing each training instance.

This results in a lot of weight loss with each epoch (equivalent to the training samples number), cheapest quick updates that are based on individual data, Allowing to pass in large dataset as soon as possible.

Flag

It is like a batch GD, we will explain the SGD category and use predicting:

class StochasticGradientDescentLinearRegressor:
    def __init__(self, learning_rate=0.01, n_iterations=100, l2_penalty=0.01, random_state=None):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.l2_penalty = l2_penalty
        self.random_state = random_state
        self._rng = np.random.default_rng(seed=random_state)
        self.weights_history = []
        self.bias_history = []
        self.loss_history = []
        self.weights = None
        self.bias = None

    def _mse_loss_single(self, y_true, y_pred):
        return 0.5 * (y_pred - y_true)**2

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = self._rng.random(n_features)
        self.bias = 0.0

        for epoch in range(self.n_iterations):
            permutation = self._rng.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            epoch_loss = 0
            for i in range(n_samples):
                xi = X_shuffled[i]
                yi = y_shuffled[i]

                y_pred = np.dot(xi, self.weights) + self.bias
                dw = xi * (y_pred - yi) + self.l2_penalty * self.weights
                db = y_pred - yi

                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db
                epoch_loss += self._mse_loss_single(yi, y_pred)

                if n_features >= 2:
                    self.weights_history.append(self.weights[:2].copy())
                elif n_features == 1:
                    self.weights_history.append(np.array([self.weights[0], 0]))
                self.bias_history.append(self.bias)
                self.loss_history.append(self._mse_loss_single(yi, y_pred) + (self.l2_penalty / (2 * n_samples)) * (np.sum(self.weights**2) + self.bias**2)) # Approx L2

            print(f"Epoch {epoch+1}/{self.n_iterations}, Loss: {epoch_loss/n_samples:.4f}")

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

model = StochasticGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=200, random_state=42)
model.fit(X=X_train_processed, y=y_train.values)
y_pred = model.predict(X_test_processed)

Which is output:


Left: Weight in installation features, right: Cost Work (Read_we're Writing= 0.001, i = 200, m = 50,000, n = 5)

SGD launched that may in the process of doing well (Fig. Right).

This “Sound” can help algorithm jumping out of the nomo mistakes or saddle And you may have found better regions of parameter.

Results:
Meaning Credit error (MSE): 1.5808
Ir-Squared: 0.0172
It means a complete error (mae): 1.0475

Time Time: Training: O (N ² + n³) Forecasting: O (n)
Space Diectiver: O (n)
(M: Example size training, n: Installation feature size, you think M >>


Store

While A simple simple model It works more well, its natural simplicity often prevents you from capturing difficult relationships within the data.

Calculation Offs Different ways of measuring contradicts is important to achieve appropriate results.


Indication

All photos, unless noted in another way, they are the author.

The theme uses service, licensed under Apache 2.0 trading data.


Author: Kiriko Wai

Portfolio / LinkedIn / Guthub

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button