Data cleaning on the command line for the first data scientist


Photo by the Author
The obvious Getting started
When you think about data cleaning, you might be thinking of mixing a Jupyter Notebook. But here's what might surprise you: Some of the most powerful data cleaning can happen right on your computer, using tools that are already installed on your system.
In this article, you'll learn how to use Basic Command-Line utilities to clean, modify, and inspect data files. No installation is required; Of course you end up with some CSV files.
Before we start cleaning data from the command line, let's discuss why this is important:
- Command-line tools are easy to use, fast, and efficient, especially for large files.
- These tools are built for Linux/Macos and are available for Windows.
- They are great for getting a preview of the data before loading and analyzing it in Python.
- It is easy to move commands together in documents and use these documents for automation.
Now, let's start coding!
A notebook: You can find all commands In this bash script on github. To get the most out of this article, I encourage you to open a terminal with your code.
The obvious Sample data set
Let's create a jumbled CSV file to work with. This simulates real world data constraints that you will be able to run into.
cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,[email protected]
Jane Smith,28,55000,Marketing,[email protected]
Bob Davis ,35,60000,Engineering,[email protected]
Alice Williams,29,,Marketing,[email protected]
Charlie Brown,45,70000,Sales,[email protected]
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,[email protected]
Frank Miller,38,65000,Sales,[email protected]
John Lee,32,50000,Engineering,[email protected]
Grace Lee,27,51000,Engineering,[email protected]
EOF
This data has several common problems: leading and trailing, missing values, and double rows. Good to read!
The obvious 1. To check your data with head, tailagain wc
Before cleaning your data, you need to understand what you are working with. Let's start with the basics.
# See the first 5 rows (including header)
head -n 5 messy_data.csv
# See the last 3 rows
tail -n 3 messy_data.csv
# Count total rows (including header)
wc -l messy_data.csv
Here's what happens:
head -n 5It shows the first 5 lines, giving you a quick overview.tail -n 3It shows the last 3 rows (useful for checking that the data is complete).wc -lCount rows – Subtract 1 to attract to get your count of records.
Output:
name,age,salary,department,email
John Lee,32,50000,Engineering,[email protected]
Jane Smith,28,55000,Marketing,[email protected]
Bob Davis ,35,60000,Engineering,[email protected]
Alice Williams,29,,Marketing,[email protected]
Frank Miller,38,65000,Sales,[email protected]
John Lee,32,50000,Engineering,[email protected]
Grace Lee,27,51000,Engineering,[email protected]
11 messy_data.csv
The obvious 2. To view specific columns with cut
It is not always necessary to see all the columns. Let's remove only the names and the doors.
cut -d',' -f1,4 messy_data.csv
To break it down:
cutis a tool to extract components from each line.-d','Sets the compiler to a comma (for CSV files).-f1,4Selects fields (columns) 1 and 4.- You can also use grades:
-f1-3columns 1 to 3.
Here is the result:
name,department
John Lee,Engineering
Jane Smith,Marketing
Bob Davis ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering
The obvious 3. Removing duplicate lines sort and uniq
Note that “John Lee” appears twice in our database. Let's fix that.
# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv
# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv
Here's what each command does: head -n 1 you just grab the subject line. tail -n +2 It gets everything starting from line 2 (skipping the header). After that, sort sort the rows. Please note that uniq it only works on structured data, too uniq deletes adjacent duplicate lines. Finally, >> The style in the file (vs > written).
The obvious 4. Search and sort by grep
Now let's do some searching and sorting operations. Want to find all developers or filter rows with missing details? it's broken comes in handy for all such tasks.
# Find all engineers
grep "Engineering" messy_data.csv
# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv
# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv
Here, grep "pattern" Search for lines containing that pattern. grep -v enters the game (shows lines that don't match). This is a quick way to filter incomplete records, as long as the missing value in Comma (,,) is found.
The obvious 5. Reducing whitespace by sed
See if the “BOB Davis” record has more spaces? Let's clean that up.
sed 's/^[ t]*//; s/[ t]*$//' messy_data.csv > trimmed_data.csv
Now let's understand the command: come down Text editor for streaming. s/pattern/replacement/ embedded syntax. ^[ t]* matching spaces / tabs at the beginning of the line. [ t]*$ matching spaces / tabs at the end of the line. A semicolon separates the two functions (decrease the line, then determine the end of the line).
The obvious 6. Substitution of values with sed
Sometimes you need to set values or correct typos. Let's try to replace all occurrences of “Engineering” with “Tech”.
# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv
Next, let's fill in the empty email fields (which show a comma at the end of the line) with the default email value.
# Replace empty email fields with "[email protected]"
sed 's/,$/,[email protected]/' messy_data.csv
Run the above commands and notice the result. I didn't want to extract it here to avoid repetition.
To break it down:
- This page
gThe flag means “Global” – return everything that happens on each line. ,$aligns a comma at the end of a line (which indicates an empty last field).- You may not include multiple fillings with
;between them.
The obvious 7. Counting and summarizing with awk
pray It is very useful for field based operations. Let's do something basic.
# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c
# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'
In this case awk order, -F',' sets the field separator to a comma, too $2 referred to in the second field (age). The situation if($2) ensures only non-null values are processed, while sum += $2 Accumulates the total amount. Finally, END The block outputs after all the lines have been read to calculate and print the average age.
Output:
5 Engineering
3 Marketing
2 Sales
Average age: 33
The obvious 8. Combining commands with pipes
You get more useful processing when you quote the command line tools together.
# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq
# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3
# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn
Here, each | it passes the result of one command as input to the next. This allows you to create complex data transformations step by step. The last step is counting by counting in numerical order (-rn).
This comes out:
Engineering
Marketing
Sales
Bob Davis ,60000
5 Engineering
3 Marketing
2 Sales
The obvious 9. Converting data formats
Sometimes you need to work with different producers. Here, we try to use tab as separator instead of comma.
# Convert CSV to TSV (tab-separated)
sed 's/,/t/g' messy_data.csv > data.tsv
# Add a new column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv
In this case awk command, BEGIN{OFS=","} Sets the output field separator to a comma. $0 represents the entire input line, too print $0, "2024" It appends “2024” as a new column to each line of the output.
The obvious 10. Complete cleaning hose
Let's put it all into one useful command that cleans up our dirty data:
# Save header
head -n 1 messy_data.csv > final_clean.csv
# Clean the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv |
sed 's/^[ t]*//; s/[ t]*$//' |
grep -v ",," |
sort |
uniq >> final_clean.csv
echo "Cleaning complete! Check final_clean.csv"
This pipeline first saves the header to save the column names, and then skips while processing the data rows. It extracts leading and trailing bits from each row, removes any rows that contain empty fields (specifically commas), filters the data, and eliminates duplicate entries. Finally, it provides clean details in the output file.
The obvious Lasting
Command-Line data cleaning is a powerful but advanced skill for a data scientist. These tools are fast and reliable. While you'll still be using Python for more complex analysis, mastering these basics will make you more efficient and provide options where Python isn't.
The best part is that these skills are transferable to data engineering, devops, and system administration roles. Learning to manipulate data on the command line makes you a more flexible developer.
Start practicing on your own, and you'll be surprised how often you find these tools instead of spinning python aboutbooks. Happy data cleaning!
Count Priya C is a writer and technical writer from India. He likes to work in the field of statistical communication, programming, data science and content creation. His areas of interest and expertise include deliops, data science and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he is working on learning and sharing his knowledge with the engineering community through tutorials, how-to guides, idea pieces, and more. Calculate and create resource views and code tutorials.



