Estate: Software Engineering Conceptting Se De Scientists should know success

You should read this article
If you plan to go to the data science, either a degree or expert you want to work change, or manager in charge of the best habits, this article is yours.
Data science attracts different different backgrounds. From my professional experience, I worked with the existing colleagues:
- Nuclear Physicists
- Post-Docs Docs Research Waves Rich Waves
- Phds at Biology Computational Biology
- Families
simply to just name.
It is surprising to be able to fulfill various sets and I have seen such different minds leads to the growth of data and active work.
However, I've seen and each of the lowest in this variety:
Everyone has different levels of exposure to key engineering tracks, which leads to coding skills.
Because of this, I have seen the work done by some beautiful scientists, but this is:
- It is not readable – you are not sensible to do.
- Flaky – breaks a minute of another person trying to drive it.
- Not compared to – The immediate code is made or easily break.
- UN-Extendable – Code for one use and behavior is not extended.
The last reduces the impact of their work that can have and create all kinds of problems at the bottom of the line.
Therefore, in a series of articles, I plan to explain some important software concepts I have synced to the requirements for data scientists.
It is simple ideas, but the difference between knowing that you are not clearly aware of drawing a line between Amateur and expert.
Today's concept: estate
The legacy is essential to the pure code, which is used to improve your performance and production. It can also be used to measure the way the group writes the reading and storage code.
Looking back on how hard to read these concepts when I started reading the code, I would not start with the invisible, high description of the degree without this category. There is a lot on the Internet you don't google if you want this.
Instead, let's look at the real example of data science project.
It will determine the kind of practical problems the scientist you can enter, and see which inheritance, and how the data scientist can write better code.
Beside superior We mean:
- Code for easy reading.
- Code for easy to keep.
- The easy-to-use code.
Example: To enter data from many different sources

The most important part of the database of data scientists find out where to find data, how to read, clean it, and how to keep it.
Suppose you have labels given on CSV files to be taken to five different sources, each containing its unique schema.
Your work is cleaning each of them and removes them as a parequet file, and this file is accompanied by the processes at the bottom of the festival, they must agree with Schema:
label_id
: The numberlabel_value
: The numberlabel_timestamp
: String Timestamp in an ISO format.
The fastest and dirty way
In this case, the fastest and dirty way can write a different text of each file.
# clean_source1.py
import polars as pl
if __name__ == '__main__':
df = pl.scan_csv('source1.csv')
overall_label_value = df.group_by('some-metadata1').agg(
overall_label_value=pl.col('some-metadata2').or_().over('some-metadata2')
)
df = df.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)
df = df.join(overall_label_value, on='some-metadata4')
df = df.select(
pl.col('primary_key').alias('label_id'),
pl.col('overall_label_value').alias('label_value').replace([True, False], [1, 0]),
pl.col('some-metadata6').alias('label_timestamp'),
)
df.to_parquet('output/source1.parquet')
And each script is not different.
So what's wrong with this? Finds work done well?
Let's go back to our good form of code and examine why these are bad:
1. It is hard to learn
No organization or structure in the code.
Everything you are loading, cleaning, and saving everything is in the same place, so it's hard to see where the line is in place.
Remember, this is an example presented, easy. In the real world, the code you can write can be very long and complicated.
When learning code, and five types of its, it leads to long-term problems:
2. It is difficult to maintain
The lack of a building makes it difficult to add new features or adjust the bugs. If logic was to be modified, all text should be needed.
If there is regular surgery you need to be used all out, then someone needs to go and change all five texts separately.
Each time they need to specify the purpose of the lines and the code rows. Because there is no difference in the middle
- Where is the data loaded,
- Where is data used,
- Which variables are subject to low performance,
It is difficult to know if the changes you make will have an unknown impact on the lowest code, or you break something higher.
Finally, the bugs are very easy to enter.
3. It is difficult to reuse
This code is One-Off description.
It's hard to read, you don't know what happens when you plant a lot of time to make sure you understand the whole line of the code.
If one wants to reuse logic to it, the only way they had to copy – to attach complete The script then changes it, or records themselves from the beginning.
There are better, more effective writing.
The Best, Fitting Way
Now, let us look at how we can improve our situation by inheritance.

1. point to normal
In our example, all sources of data are different. We know each file will require:
- One or more duties of cleaning
- The saves action, which all files will be stored in one file.
We also know that each file requires consistency with the same schema, so we have some output data verification.
So these familiar items will appreciate what jobs we can write and use it again.
2. Create a basic section
Now it comes part of her estate.
We write a base class
or parent class
which is involved in the general administrative sense of the above. This section will be statue When classes will appear in “DESENS”.
Classes killed by death from this phase (called Children's classes) will have the same functioning and parental phase, but will not add new performance, or change they are already available.
import polars as pl
class BaseCSVLabelProcessor:
REQUIRED_OUTPUT_SCHEMA = {
"label_id": pl.Int64,
"label_value": pl.Int64,
"label_timestamp": pl.Datetime
}
def __init__(self, input_file_path, output_file_path):
self.input_file_path = input_file_path
self.output_file_path = output_file_path
def load(self):
"""Load the data from the file."""
return pl.scan_csv(self.input_file_path)
def clean(self, data:pl.LazyFrame):
"""Clean the input data"""
...
def save(self, data:pl.LazyFrame):
"""Save the data to parquet file."""
data.sink_parquet(self.output_file_path)
def validate_schema(self, data:pl.LazyFrame):
"""
Check that the data conforms to the expected schema.
"""
for colname, expected_dtype in self.REQUIRED_OUTPUT_SCHEMA.items():
actual_dtype = data.schema.get(colname)
if actual_dtype is None:
raise ValueError(f"Column {colname} not found in data")
if actual_dtype != expected_dtype:
raise ValueError(
f"Column {colname} has incorrect type. Expected {expected_dtype}, got {actual_dtype}"
)
def run(self):
"""Run data processing on the specified file."""
data = self.load()
data = self.clean(data)
self.validate_schema(data)
self.save(data)
3. Describe children's classes
We now describe children's classes:
class Source1LabelProcessor(BaseCSVLabelProcessor):
def clean(self, data:pl.LazyFrame):
# bespoke logic for source 1
...
class Source2LabelProcessor(BaseCSVLabelProcessor):
def clean(self, data:pl.LazyFrame):
# bespoke logic for source 2
...
class Source3LabelProcessor(BaseCSVLabelProcessor):
def clean(self, data:pl.LazyFrame):
# bespoke logic for source 3
...
Since all ordinary logic has already been used in parent class, the entire baby class needs to worry about it is a different logic bespic in each file.
So the code we have written a bad example now can be changed to:
from import BaseCSVLabelProcessor
class Source1LabelProcessor(BaseCSVLabelProcessor):
def get_overall_label_value(self, data:pl.LazyFrame):
"""Get overall label value."""
return data.with_column(pl.col('some-metadata2').or_().over('some-metadata1'))
def conform_to_output_schema(self, data:pl.LazyFrame):
"""Drop unnecessary columns and confrom required columns to output schema."""
data = data.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)
data = data.select(
pl.col('primary_key').alias('label_id'),
pl.col('some-metadata5').alias('label_value').replace([True, False], [1, 0]),
pl.col('some-metadata6').alias('label_timestamp'),
)
return data
def clean(self, data:pl.LazyFrame) -> pl.DataFrame:
"""Clean label data from Source 1.
The following steps are necessary to clean the data:
1.
2.
3. Renaming columns and data types to confrom to the expected output schema.
"""
overall_label_value = self.get_overall_label_value(data)
df = df.join(overall_label_value, on='some-metadata4')
df = self.conform_to_output_schema(df)
return df
And to use our code, we can do it in the middle:
# label_preparation_pipeline.py
from import Source1LabelProcessor, Source2LabelProcessor, Source3LabelProcessor
INPUT_FILEPATHS = {
'source1': '/path/to/file1.csv',
'source2': '/path/to/file2.csv',
'source3': '/path/to/file3.csv',
}
OUTPUT_FILEPATH = '/path/to/output.parquet'
def main():
"""Label processing pipeline.
The label processing pipeline ingests data sources 1, 2, 3 which are from
external vendors .
The output is written to a parquet file, ready for ingestion by .
The code assumes the following:
-
The user needs to specify the following inputs:
-
"""
processors = [
Source1LabelProcessor(FILEPATHS['source1'], OUTPUT_FILEPATH),
Source2LabelProcessor(FILEPATHS['source2'], OUTPUT_FILEPATH),
Source3LabelProcessor(FILEPATHS['source3'], OUTPUT_FILEPATH)
]
for processor in processors:
processor.run()
Why is this better?
1.
Should not look under the hood to know how to drive a car.
Anyone colleague that needs to run this code will need only running main()
work. Would you prepare enough documents in the appropriate activities to explain what they do and how they use it.
But they don't have to know that the rest of one code is running.
They should be able to trust your work and run. Only when it is necessary to correct the bug or extend its performance will be required to be deeply deepened.
This is called murder – By considering hiding startup information from the user. It is another important formal idea in writing a good code.

In short, it should be enough that the learner depends on the documents to understand how the code does and how it uses it.
How often do you get on scikit-learn
Source code to learn to use their models? You will never do it. scikit-learn
It is a good example of good codes design with Emasbation.
I've already wrote an article dedicated to the price here, so if you want to know more, check it.
2. Better Attention
What if the results of the label now had to change? For example, the processes below the bottom of the labels now require that they are stored at the SQL table.
Well, it is very easy to do this – we just need to turn save
How In the BaseCSVLabelProcessor
The paragraph, and all children classes will cost the change automatically.
What if you get inequality between the label and a specific process that drops? Maybe the new column is required?
Well, you will need to change the specific clean
Ways to tell about this. But, don't extend checks to validate
How In the BaseCSVLabelProcessor
the accounting class with this new need.
You can take this one step and add additional checks to make sure the results are expected – you may want to explain a different verification module by doing this, and installing it in the validate
the way.
You can see that to give the development of our label code is very easy.
Compared, if the code lived in different Bespoke texts, you will be a copy and paste these checkups several times. The worst, perhaps each file requires something of the Bespoke. This means that the same problem needs to be settled five times, where it can be properly resolved once.
It also works, inefficiency, and timed of time.
Last Speech
Therefore, in this article, we covered that the use of estate is very highly improves the quality of our code code.
By means of the inheritance properly, we are able to solve regular problems from different tasks, and we first notice how this leads to:
- Code for easy reading – readable
- The easy code to fix and save – maintaining
- Simple Code to add and extension – toxic
However, some students will still doubt the need to write the code like this.
Perhaps they have been writing the same text about their whole work, and everything you have come down so far. Why do you suffer in the code of writing more sophisticated?

Well, that's the best question – And there is a clear reason why it is necessary.
Until recently, Data Science was a new, niche indicator where evidence-and the concepts and research were focused on work. Conting Standards don't care, as long as we find something about doors and worked.
But the science of the data approach quickly to maturity, where it is no longer enough to build models.
Now we have to maintain, fix, fix the mistake, and we use only models, but also entire Procedures required to create model – as long as they are used.
This is true that data science needs to face – Building models are easy part while it is kept that we have created hard part.
At that time, the software engineering has been doing this for decades, and they have the trial and error that own all the best practices we discuss today so that your code is easy to keep.
Therefore, data scientists will need to know these excellent ways.
Those who know this will be absolutely advantageous compared to those who do not know.