Prototyping Gradient Fear in the study of a machine

Reading
Guded reading is a labeling types of labels that train algoriths to predict the results and note patterns.
Unlike random learning, the guarded algorithms are given to the training that is written between installation and results.
Requirement: Algebra line
Suppose we have Returning Problem When the model needs to predict continuous values by taking the number of installations (xi).
The predicting value is defined as a job called Hypothesis (h):
when:
- θi: Thtra parameter corresponding to each installation feature (X_I),
- ε (Epsilon): Gaussian error (ε ~ n (0, σ²)))
As one of the hypothesis of one installation produces the amount of scalar (Hθ (x) ∈ir), it can be seen as a DOT product of The disposal of prameter vector (θt) once The Vector feature of what is input (x):

Batch Gradient Forecent
Gradient Deccent It is algorithm to use force used to receive a work-up minimum. In each step, it motivates the side of the Steepeepest chief progress slowly – a job root – simple, continue to the down.
Now, remember that we have parameters n that affect predictions. Therefore, we need to know a specific donation of Each Perameter (θi) corresponding to training data (xi)Selected at work.
Suppose we place each step size as a reading (α), and find the curve of the Nkim (J), then a parameter is deducted from each step:

(Α: Level, J (θ): cOST WORK OST, ∂ / ∂θi: Deleted of something costs in respect of θiSelected
Gradient
Gradient represents the cost of the cost.
If you process the remaining parameters with its accompanied access to cost-related work (J), the cost of the cost of the cost of θ for N parameters is defined as:

Gradient is a method of receipt of the cost of cost of costs in relation to all parameters (θ0 in θn).
As a scalar study value (αi∈ ward, the Gradient Decent Renewal Act is exposed to the Matrix Notation:

As a result, Parameter (θ) resides in the area (N + 1) -And.
According to the area, fall down on the stage that is relevant to the quality of learning until the conversion.

Consolidation
The purpose of accurate reduces the gap (MSE) between predicted prices and the actual prices are given to training database.
Cost's Employment (Objective Work)
This gap (MSE) is defined as a gap between all examples of training:

where
- Jθ: Cost's Employment (or Losens),
- hθ: Predicting from model,
- X: The_that of the installation,
- y: I_I targeted advice, and
- M: Number of training models.
This page Gradient be compacted by taking The part of a part of the cost of expenditure in relation to each parameter:

Because we have n + 1 boundaries (including Intercept Tearm θ0) and stat instances of M, will build Gradient Vector using Matrix Notation:

In the Matrix Notation, where the IX represents the design matrix including intercept term and θ is parameter vevector, gradient ∇∇) is provided by:

This page LMS (at least squares) law) Is the algorithm that appears continuously transform the model parameters based on the error between its predictions and the actual intended prices for training examples.
At least lower limits of square (lms) rule
For each champight For a decisive generation, every parameter θi is renewed by removing part of an error in all the examples of training:

This procedure allows algorithm to find an amratily to get The correct parameters that reduces cost work.
.
General equation
Detection The right parameter (θ *) that reduces cost work, we can use General equation.
This approach provides a solution to accurate revenues, which allow us directly the value of θ to reduce cost function.
Unlike performance strategies, the usual equation receives this good by settling the point when the gradient is zero, confirming quick encounters:

So:

This is subject to thinking that the design matrix X is not workingmeaning that all aspects of the installation (from x_0 to x_n) are independently independent.
If X is not working, we will need to address the installation features to ensure their independence.
Flag
In fact, we repeat the procedure until the conversion is set up:
- The work of its cost and gradient
- Learning
- Tolerate (mins. Cost limit to stop ITERATION)
- The maximum amount of Iterations
- The first point
Batch with a reading amount
The following coding snippet indicates the gradient function process receiving the quadratic costs of the Questratic costs (0.1, 0.3, 0.9 and 0.9):
def cost_func(x):
return x**2 - 4 * x + 1
def gradient(x):
return 2*x - 4
def gradient_descent(gradient, start, learn_rate, max_iter, tol):
x = start
steps = [start] # records learning steps
for _ in range(max_iter):
diff = learn_rate * gradient(x)
if np.abs(diff) < tol:
break
x = x - diff
steps.append(x)
return x, steps
x_values = np.linspace(-4, 11, 400)
y_values = cost_func(x_values)
initial_x = 9
iterations = 100
tolerance = 1e-6
learning_rates = [0.1, 0.3, 0.8, 0.9]
def gradient_descent_curve(ax, learning_rate):
final_x, history = gradient_descent(gradient, initial_x, learning_rate, iterations, tolerance)
ax.plot(x_values, y_values, label=f'Cost function: $J(x) = x^2 - 4x + 1$', lw=1, color='black')
ax.scatter(history, [cost_func(x) for x in history], color='pink', zorder=5, label='Steps')
ax.plot(history, [cost_func(x) for x in history], 'r--', lw=1, zorder=5)
ax.annotate('Start', xy=(history[0], cost_func(history[0])), xytext=(history[0], cost_func(history[0]) + 10),
arrowprops=dict(facecolor='black', shrink=0.05), ha='center')
ax.annotate('End', xy=(final_x, cost_func(final_x)), xytext=(final_x, cost_func(final_x) + 10),
arrowprops=dict(facecolor='black', shrink=0.05), ha='center')
ax.set_title(f'Learning Rate: {learning_rate}')
ax.set_xlabel('Input feature: x')
ax.set_ylabel('Cost: J')
ax.grid(True, alpha=0.5, ls='--', color='grey')
ax.legend()
fig, axs = plt.subplots(1, 4, figsize=(30, 5))
fig.suptitle('Gradient Descent Steps by Learning Rate')
for ax, lr in zip(axs.flatten(), learning_rates):
gradient_descent_curve(ax=ax, learning_rate=lr)

To predict credit card transaction
Let us use sample data in Kaggle to predict credit card transactions using direct conversion using the batch GD border.
1. Data data
a) Base Dataframe
First, we will integrate these four files from the sample database using the keys as key, while planning raw data:
- transaction (CSV)
- User (CSV)
- Credit Card (CSV)
- Train_Fraud_labels (JSON)
# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')
# sanitize the dataset
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)
# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
fraud_labels_json = json.load(fp=fp)
fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int)
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})
merged_df = merged_df.dropna()
# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.replace('nan', np.nan).dropna()
card_df = card_df[card_df['card_on_dark_web'] == 'No']
card_df = card_df.drop(columns=['acct_open_date', 'card_number', 'expires', 'cvv', 'card_on_dark_web'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)
# load user data
user_df = pd.read_csv(f'{dir}/users_data.csv')
user_df = user_df.drop(columns=['birth_year', 'birth_month', 'address', 'latitude', 'longitude'], axis='columns')
user_df = user_df.replace('nan', np.nan).dropna()
user_df['per_capita_income'] = user_df['per_capita_income'].apply(sanitize_df)
user_df['yearly_income'] = user_df['yearly_income'].apply(sanitize_df)
user_df['total_debt'] = user_df['total_debt'].apply(sanitize_df)
# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = pd.merge(left=merged_df, right=user_df, left_on='client_id_x', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_x', 'client_id_x', 'card_id', 'merchant_id', 'id_y', 'client_id_y', 'id'], axis='columns')
merged_df = merged_df.dropna()
# finalize the dataframe
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float)
df = df.dropna()
print('Base data frame: n', df.head(n=3))

b) postponing
From Base DataFrame, we will select the appropriate installation features with:
Continuous amounts, and a relationship that seems to be seen with the same amount of purchase value.
df = df[df['is_fraud'] == 0]
df = df[['amount', 'per_capita_income', 'yearly_income', 'credit_limit', 'credit_score', 'current_age']]
Then, we will sort the merchants across 3 ordinary deviation away from the meaning:
def filter_outliers(df, column, std_threshold) -> pd.DataFrame:
mean = df[column].mean()
std = df[column].std()
upper_bound = mean + std_threshold * std
lower_bound = mean - std_threshold * std
filtered_df = df[(df[column] <= upper_bound) | (df[column] >= lower_bound)]
return filtered_df
df = df.replace(to_replace='NaN', value=0)
df = filter_outliers(df=df, column='amount', std_threshold=3)
df = filter_outliers(df=df, column='per_capita_income', std_threshold=3)
df = filter_outliers(df=df, column='credit_limit', std_threshold=3)
Finally, we will take a logarithm of the target value amount
Reduced a combined distribution:
df['amount'] = df['amount'] + 1
df['amount_log'] = np.log(df['amount'])
df = df.drop(columns=['amount'], axis='columns')
df = df.dropna()
* Added one to price To avoid a bad mining in Value_log column.
Final Dataphame:

c) transformer
Now, we can distinguish and transform the last datafamram into rail / test datasets:
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
2. Defining Batch Gd Regreser
class BatchGradientDescentLinearRegressor:
def __init__(self, learning_rate=0.01, n_iterations=1000, l2_penalty=0.01, tol=1e-4, patience=10):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.l2_penalty = l2_penalty
self.tol = tol
self.patience = patience
self.weights = None
self.bias = None
self.history = {'loss': [], 'grad_norm': [], 'weight':[], 'bias': [], 'val_loss': []}
self.best_weights = None
self.best_bias = None
self.best_val_loss = float('inf')
self.epochs_no_improve = 0
def _mse_loss(self, y_true, y_pred, weights):
m = len(y_true)
loss = (1 / (2 * m)) * np.sum((y_pred - y_true)**2)
l2_term = (self.l2_penalty / (2 * m)) * np.sum(weights**2)
return loss + l2_term
def fit(self, X_train, y_train, X_val=None, y_val=None):
n_samples, n_features = X_train.shape
self.weights = np.zeros(n_features)
self.bias = 0
for i in range(self.n_iterations):
y_pred = np.dot(X_train, self.weights) + self.bias
dw = (1 / n_samples) * np.dot(X_train.T, (y_pred - y_train)) + (self.l2_penalty / n_samples) * self.weights
db = (1 / n_samples) * np.sum(y_pred - y_train)
loss = self._mse_loss(y_train, y_pred, self.weights)
gradient = np.concatenate([dw, [db]])
grad_norm = np.linalg.norm(gradient)
# update history
self.history['weight'].append(self.weights[0])
self.history['loss'].append(loss)
self.history['grad_norm'].append(grad_norm)
self.history['bias'].append(self.bias)
# descent
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
if X_val is not None and y_val is not None:
val_y_pred = np.dot(X_val, self.weights) + self.bias
val_loss = self._mse_loss(y_val, val_y_pred, self.weights)
self.history['val_loss'].append(val_loss)
if val_loss < self.best_val_loss - self.tol:
self.best_val_loss = val_loss
self.best_weights = self.weights.copy()
self.best_bias = self.bias
self.epochs_no_improve = 0
else:
self.epochs_no_improve += 1
if self.epochs_no_improve >= self.patience:
print(f"Early stopping at iteration {i+1} (validation loss did not improve for {self.patience} epochs)")
self.weights = self.best_weights
self.bias = self.best_bias
break
if (i + 1) % 100 == 0:
print(f"Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}", end="")
if X_val is not None:
print(f", Validation Loss: {val_loss:.4f}")
else:
pass
def predict(self, X_test):
return np.dot(X_test, self.weights) + self.bias
3. Forecast & exploration
model = BatchGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=10000, l2_penalty=0, tol=1e-5, patience=5)
model.fit(X_train_processed, y_train.values)
y_pred = model.predict(X_test_processed)
Output:
Of five features of five, per_capita_income
showed the highest connection with the purchase amount:
(Left: Many: Right transactions), Right-Cost (Reading_a. 5,000, N = 5,000, n = 5,000, n = 5,000, n = 5,000)
It means a combined error (MSE): 1.5752
Ir-Squared: 0.0206
Mean complete error (mae): 1.0472
Time Time: Training: O (N ² + n³) Forecasting: O (n)
Space Diectiver: O (NM)
(M: Example size training, n: Installation feature size, you think M >>
Stochastic Gradient Feor
Batch GD uses All training information Combining Gradient in each ITeration Itemation (EPOch) Step, most expensive especially when we have millions of data.
Stochastic Gradient Festecent (SGD) On the other hand,
- usually prompts training data at the beginning of each epoch,
- Choose randomly a single An Example of Training In each contitation,
- calculates gradient using an instance, and
- revitalizing the weight of model and choosing After processing each training instance.
This results in a lot of weight loss with each epoch (equivalent to the training samples number), cheapest quick updates that are based on individual data, Allowing to pass in large dataset as soon as possible.
Flag
It is like a batch GD, we will explain the SGD category and use predicting:
class StochasticGradientDescentLinearRegressor:
def __init__(self, learning_rate=0.01, n_iterations=100, l2_penalty=0.01, random_state=None):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.l2_penalty = l2_penalty
self.random_state = random_state
self._rng = np.random.default_rng(seed=random_state)
self.weights_history = []
self.bias_history = []
self.loss_history = []
self.weights = None
self.bias = None
def _mse_loss_single(self, y_true, y_pred):
return 0.5 * (y_pred - y_true)**2
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = self._rng.random(n_features)
self.bias = 0.0
for epoch in range(self.n_iterations):
permutation = self._rng.permutation(n_samples)
X_shuffled = X[permutation]
y_shuffled = y[permutation]
epoch_loss = 0
for i in range(n_samples):
xi = X_shuffled[i]
yi = y_shuffled[i]
y_pred = np.dot(xi, self.weights) + self.bias
dw = xi * (y_pred - yi) + self.l2_penalty * self.weights
db = y_pred - yi
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
epoch_loss += self._mse_loss_single(yi, y_pred)
if n_features >= 2:
self.weights_history.append(self.weights[:2].copy())
elif n_features == 1:
self.weights_history.append(np.array([self.weights[0], 0]))
self.bias_history.append(self.bias)
self.loss_history.append(self._mse_loss_single(yi, y_pred) + (self.l2_penalty / (2 * n_samples)) * (np.sum(self.weights**2) + self.bias**2)) # Approx L2
print(f"Epoch {epoch+1}/{self.n_iterations}, Loss: {epoch_loss/n_samples:.4f}")
def predict(self, X):
return np.dot(X, self.weights) + self.bias
model = StochasticGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=200, random_state=42)
model.fit(X=X_train_processed, y=y_train.values)
y_pred = model.predict(X_test_processed)
Which is output:
Left: Weight in installation features, right: Cost Work (Read_we're Writing= 0.001, i = 200, m = 50,000, n = 5)
SGD launched that may in the process of doing well (Fig. Right).
This “Sound” can help algorithm jumping out of the nomo mistakes or saddle And you may have found better regions of parameter.
Results:
Meaning Credit error (MSE): 1.5808
Ir-Squared: 0.0172
It means a complete error (mae): 1.0475
Time Time: Training: O (N ² + n³) Forecasting: O (n)
Space Diectiver: O (n)
(M: Example size training, n: Installation feature size, you think M >>
Store
While A simple simple model It works more well, its natural simplicity often prevents you from capturing difficult relationships within the data.
Calculation Offs Different ways of measuring contradicts is important to achieve appropriate results.
Indication
All photos, unless noted in another way, they are the author.
The theme uses service, licensed under Apache 2.0 trading data.
Author: Kiriko Wai
Portfolio / LinkedIn / Guthub