ANI

Command-line math for basic information science

Command-line math for basic information science
Image editor

The obvious Getting started

If you're just starting your data science journey, you might think you need tools like Python, r, or other software to work on statistical analysis of data. However, the command line is already a powerful statistical tool.

Linear tools often process large datasets faster than loading them into memory-heavy applications. It's easy to do and automatic. Furthermore, these tools work on any UNIX system outside of to include anything.

In this article, you will learn to perform important math operations directly from your terminal using only built-in tools.

Here it is Bash script on github. Coding is highly recommended to understand the concepts fully.

To follow this tutorial, you will need:

  • You will need a Unix-like environment (Linux, Macos, or windows with WSL).
  • We will use the standard UNIX tools that are already installed.

Open your terminal to start.

The obvious Sample data set

Before we can analyze data, we need data. Create a simple CSV file representing Daily Webs Traffic by running the following command in your console:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This creates a new file called traffic.csv With headers and ten rows of sample data.

The obvious Checking your data

// Counting rows in your data

One of the first things to identify in a dataset is the number of records it contains. This page wc (Count word) command with -l The flag counts the number of lines in the file:

To show the output: 11 traffic.csv (11 lines total, 1 header = 10 data lines).

// Viewing your data

Before proceeding with the calculation, it is useful to confirm the structure of the data. This page head The command displays the first few lines of the file:

This shows the first 5 rows, allowing you to preview the data.

date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8

// Taking out one column

To work with specific columns in a CSV file, use to cut command by field number. The following command outputs the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

This Extracts 2 field (guest column) uses cutagain tail -n +2 cross the head line.

The obvious Calculating the means of the central tendency

// Finding the mean (average)

It means the sum of all values ​​divided by the number of values. We can calculate this by extracting the target column, then using pray Accumulation of values:

cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

This page awk the command accumulates the sum and calculates as it processes each line, and divides them into END block.

Next, we calculate the Median and Mode.

// Finding the Median

Median is the middle value when the dataset is sorted. For a number of values, it is the average of two middle values. First, sort the data, and find the middle:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'

This identifies information numerically with sort -nStore the values ​​in order, then find the median value (or the average of the two median values ​​if the number exists).

// Finding Mode

The mode is the value that occurs most often. We find this by sorting, double counting, and identifying which value appears most often:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

This includes values, double counting uniq -ctypes often in order, and choose the best result.

The obvious Calculating the means of dispersion (or diffusion)

// Getting the maximum value

To find the biggest value in your data, we compare each value and track the highest value:

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

This goes over the head with NR>1compares each value to the current max, and updates it when it finds a larger value.

// Finding the Smallest Price

Similarly, to find the smallest value, start the minimum from the first data row and update it when smaller values ​​are found:

awk -F',' 'NR==2 {min=$2} NR>2 {if($2

Run the above commands to find the maximum and minimum values.

// Getting Both Mins and Max

Instead of running two separate commands, we can get the minimum and maximum in one pass:

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

This one-pass method starts both variables from the first row, then updates each one independently.

// Calculate the (population) standard deviation

The standard deviation measures how distributed the values ​​are from the mean. To find the total value, use this formula:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

This accumulates the sum and the sum of squares, and the formula works: ( sqrt { sum x {2 } {n Mu} {n Mu} {

// Calculating the sample standard deviation

When working with a sample rather than an absolute value, use Bessel correction (dividing by (n-1) unshifted sample rates:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

This is burning:

// Calculating diversity

The variance is the square of the standard deviation. It is another measure of dispersion that is useful for many statistical calculations:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csv

This calculation takes into account the standard deviation but leaves out the square root.

The obvious Calculating percentages

// Calculating Quartiles

To re-divide the sorted data into four equal parts. They are particularly useful for understanding the distribution of data:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
  q1_pos = (count+1)/4
  q2_pos = (count+1)/2
  q3_pos = 3*(count+1)/4
  print "Q1 (25th percentile):", arr[int(q1_pos)]
  print "Q2 (Median):", (count%2==1) ? arr[int(q2_pos)] : (arr[count/2]+arr[count/2+1])/2
  print "Q3 (75th percentile):", arr[int(q3_pos)]
}'

This application stores the values ​​arranged in a list, calculates the quartile positions using the formula ((n + 1) / 4 ), and then subtracts the values ​​from those positions. Code results:

Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520

// To calculate any percentile

You can calculate any percentile by adjusting the rank calculation. A flexible approach uses direct translation:

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr[count]
  else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'

This is calculated position as ((n + 1) times (percentile / 100) ), then use the direct translation between the arrays of indices in the positions of the positional deviations.

The obvious Working with multiple columns

Often, you'll want to calculate statistics across multiple columns at once. Here's how to combine visitor rates, page views, and the Bounce Race at the same time:

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

This keeps a separate stack for each column and shares the same calculation across all three, giving the following result:

Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06

// Calculating connections

Correlation measures the relationship between two variables. This page Pearson condiciection cocciffliction Ranges from -1 (good connection) to 1 (good connection):

awk -F', *' '
NR>1 {
  x[NR-1] = $2
  y[NR-1] = $3

  sum_x += $2
  sum_y += $3

  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x[i] - mean_x
    dy = y[i] - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  correlation = (cov / count) / (sd_x * sd_y)

  print "Correlation:", correlation
}' traffic.csv

This is calculated for the Pearson correlation by dividing the covariances by the product of the standard deviations.

The obvious Lasting

Command Line is a powerful tool for statistical analysis. You can process volumes of data, calculate complex calculations, and edit reports – all without installing anything beyond your system.

These skills complement your python and r knowledge rather than spending money. Use command-line tools for rapid testing and data validation, and deploy specialized tools for complex measurements and visualizations when needed.

The best part is that these tools are available on almost every System you will use in a Data Science career. Open your terminal and start checking your data.

Count Priya C is a writer and technical writer from India. He likes to work in the field of statistical communication, programming, data science and content creation. His areas of interest and expertise include deliops, data science and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he is working on learning and sharing his knowledge with the engineering community through tutorials, how-to guides, idea pieces, and more. Calculate and create resource views and code tutorials.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button