How to Distinguish Lung Cancerty Subtype from DNA POPER Number Using PyTorch

I will show how one can build a Convolution Neural Network to differentiate between different types of cancer using a simple Pytorch Crisefier. The data and code used for training are publicly available and training can be done on a personal computer, possibly even with a CPU.
Cancer is a negative result of our cells accumulating information errors over the course of our lives, leading to uncontrolled growth. As research we are investigating the patterns of these errors to better understand the disease. Seen from a scientific point of view, the human genome is a string around three billion letters with the letters a, c, g, t (ie 2 bits of information per letter). A copying error or an external event can remove / insert / change a letter, causing mutations and disruption of genomic function.
However, individual mistakes are more likely to lead to the development of cancer. The human body has many ways to prevent cancer from developing, including dedicated proteins – the so-called tumor depressors. A list of necessary conditions – the so-called “hallmarks of cancer” must be met in order for the cell to be able to continue to grow.
Therefore, changes in individual types of DNA are often not enough to cause increased growth. Most of the cancers are mutated (unlike other sources of cancer, for example the HPV Virus) and show number (CN) changes. These are massive events, often adding or removing millions of DNA bases at a time.

These major changes in the structure of the genome lead to the loss of genes that can prevent cancer from forming, while accumulating genes to promote cell growth. By sequencing the DNA of these cells, we can identify these changes, which tend to occur in regions specific to the type of cancer. Numerical values for each allele can be derived from sequencing using self-calculating copies.
Processes copy number profiles
One of the advantages of working with a copy number (CN) is that it is not a biometric profile and therefore it is published without the need to reach limits. This allows us to accumulate data over time from many subjects to create datasets of sufficient size. However, data from different studies are not always directly comparable, because they may be conducted using different technologies, have different resolutions, or run earlier.
To find information and the process that converges and visualize them, we will be using the CNSISTENT tool, developed as part of the work of the Center for Cancer Cancer, Cologne, Germany.
First we compile the repository and data and set it to the version used in this script:
git clone [email protected]:schwarzlab/cnsistent.git
cd cnsistent
git checkout v0.9.0
Since the data we will be using is inside the repository (~1GB of data), it takes a few minutes to download. To block both git and git lfs must be present in the system.
Inside the repository is a requirements.txt file that lists all the dependencies that can be installed using pip install -r requirements.txt.
(Building a physical environment first). When the requirements are installed, CNSISTENT can be activated pip install -e . in the same folder. This page -e The flag includes the packet from its source pointer, which is required to access the data through the API.
The repository contains raw data from three databases: TCGA, CCAWG, and Tracerx. This needs to be pre-processed. This can be done by running a script bash ./scripts/data_process.sh.
Now, we have processed the datasets and can load them using the CNSSISTENT WASTILILILTIALL library:
import cns.data_utils as cdu
samples_df, cns_df = cdu.main_load("imp")
print(cns_df.head())
To produce the following result:
| | sample_id | chrom | start | end | major_cn | minor_cn |
|---:|:------------|:--------|---------:|---------:|-----------:|-----------:|
| 0 | SP101724 | chr1 | 0 | 27256755 | 2 | 2 |
| 1 | SP101724 | chr1 | 27256755 | 28028200 | 3 | 2 |
| 2 | SP101724 | chr1 | 28028200 | 32976095 | 2 | 2 |
| 3 | SP101724 | chr1 | 32976095 | 33354394 | 5 | 2 |
| 4 | SP101724 | chr1 | 33354394 | 33554783 | 3 | 2 |
This table displays copy number data in the following columns:
sample_id: sample pointer,chrom: chromosome,start: The first position of the segment (0-index inclusive),end: Last position of the part (0-special index),major_cn: Number of super large copies (the bigger of the two),minor_cn: The number of copies of the minor alkele (smaller of the two).
In the first row we can see the part that the sample SP101724 has 2 large copies and 2 copies of ALLELE 4 (4 to 27.26 Megabase.
The second DataFrame we loaded, samples_df, contains the Metadata of the samples. For our purposes only the type is important. We can investigate the types available by operation:
import matplotlib.pyplot as plt
type_counts = samples_df["type"].value_counts()
plt.figure(figsize=(10, 6))
type_counts.plot(kind='bar')
plt.ylabel('Count')
plt.xticks(rotation=90)

In the example shown above, we can see a potential problem with the data – the lengths of the individual segments are not the same. The first part is 27.26 megabase long, while the second is only 0.77 megabase long. This is a problem in neural networks, which expect the input to be of fixed size.
We technically take all existing diagrams and create partitions between all breakpoints in the database, called A small constant division. This however led to a large number of sections – a quick check uses it len(cns_df[“end”].unique()) it shows that there is 823652 separate breakveinds.
Alternatively, we can use CNSISTENT to create a new partition using the binning algorithm. This will create segments of fixed size, which can be used as input to the neural network. In our work we have decided on 1-3 megabase segments to provide the best trade-off between accuracy and overcriticism. We first create a partition and use it to find new CNS files using the following Bash script:
threads=8
cns segment whole --out "./out/segs_3MB.bed" --split 3000000 --remove gaps - filter 300000
for dataset in TRACERx PCAWG TCGA_hg19;
do
cns aggregate ./out/${dataset}_cns_imp.tsv - segments ./out/segs_3MB.bed - out ./out/${dataset}_bin_3MB.tsv - samples ./out/${dataset}_samples.tsv - threads $threads
done
The Loop processes each data separately, while maintaining the same classification. This page--threadsThe flag is used to speed up the process by using aggregation in parallel, changing the value according to the number of available cores.
This page--remove gaps --filter 300000Arguments will remove regions of low mappability (AKA spaces) and filter segments shorter than 300 kb. This page --split 3000000 The argument will create 3 MB partitions.
Small-Cell Lung Carcinoma
In this article we will focus on the classification of non-cell lul Carcinoma, which accounts for about 85% of all lung cancers, especially the difference between adenocarcinoma and squamous-cell carcinoma. It is important to distinguish between the two since their treatment issues will be different and new methods offer hope for non-invasive detection from blood samples or nasal swabs.
We will use the components generated above and load this using the function function provided using the function function. Since we are distinguishing between two types of cancer, we can filter the samples to include only the appropriate types, LUAD (adenocarcinoma) and LUSC (squamous cell carcinoma) and arrange the first sample:
import cns
samples_df, cns_df = cdu.main_load("3MB")
samples_df = samples_df.query("type in ['LUAD', 'LUSC']")
cns_df = cns.select_CNS_samples(cns_df, samples_df)
cns_df = cns.only_aut(cns_df)
cns.fig_lines(cns.cns_head(cns_df, n=3))
Major and Little Copy Numbers in 3MB bins for the first three samples. In this case all three samples come from the same patient's multi-region recovery, indicating that they are heterogeneous Cancer cells can be within a single tumor.
Aural Aural Neural Network Model
Running the code requires Python 3 with Pytorch 2+ installed and a compatible bash shell. NVIDIA GPU is recommended for fast training, but not required.
First we define a neural network with three layers:
import torch.nn as nn
class CNSConvNet(nn.Module):
def __init__(self, num_classes):
super(CNSConvNet, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv1d(in_channels=2, out_channels=16, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2)
)
self.fc_layers = nn.Sequential(
nn.LazyLinear(128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.conv_layers(x)
x = x.view(x.size(0), -1)
x = self.fc_layers(x)
return x
This is a deep CNN with 2 input channels – one for each allele – and 3 layers using a 1D kernel of size 3 and a Relation function. Convelvals layers are followed by Max Pooling layers with a kernel size of 2. Resolution is traditionally used for edge detection, which is useful for us as we are interested in the copy number, i.e. we have class edges.
Dropout ConvelcalALL layers are now inclined and pass two fully connected layers with Dropout. This page LazyLinearLayer combines the output of 64 laminated channels into a single layer of 128 areas, without needing to count how many areas there are when the poles run out. This is where most of our parameters are, so we apply to stop to prevent overshooting.
Training the model
First we have to convert from DataFrames to Torch tinson. We use the Utility Function BINS_TO in the Process and divide the data into the training and test sets in a 4:1 ratio:
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# convert data to features and labels
features, samples_list, columns_df = cns.bins_to_features(cns_df)
# convert data to Torch tensors
X = torch.FloatTensor(features)
label_encoder = LabelEncoder()
y = torch.LongTensor(label_encoder.fit_transform(samples_df.loc[samples_list]["type"]))
# Test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Create dataloaders
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=32, shuffle=False)
Now we can train the model using the next training loop with 20 epoch. Adam Optimizer and Crostentropy loss are often used for classification tasks, so we use them here as well:
# setup the model, loss, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNSConvNet(num_classes=len(label_encoder.classes_)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Clear gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
# Print statistics
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_loader):.4f}')
This concludes the training. After that, we can test the model and print the confusion matrix:
import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Loop over batches in the test set and collect predictions
model.eval()
y_true = []
y_pred = []
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
y_true.extend(labels.cpu().numpy())
y_pred.extend(outputs.argmax(dim=1).cpu().numpy())
_, predicted = torch.max(outputs.data, 1)
# Calculate accuracy and confusion matrix
accuracy = (np.array(y_true) == np.array(y_pred)).mean()
cm = confusion_matrix(y_true, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(3, 3), dpi=200)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix, accuracy={:.2f}'.format(accuracy))
plt.savefig("confusion_matrix.png", bbox_inches='tight')

The training process takes about 7 complete seconds on the NVIDIA RTX 4090 GPU.
Lasting
We developed an efficient and accurate classifier for lung cancer sites from copy number data. As we have shown, such models transfer well to new studies and sources of sequence data.
The Scale of Measuring AI is sometimes often justified, among others, as a solution to cancer “. However, as in this article, small models with classical methods tend to serve their purpose well. Others also say that the real obstacle to machine learning in biology in medicine cannot solve problems, but actually make an impact on patients.
Nevertheless, machine learning has been able to largely solve at least one of the biggest puzzles of computational biology, intensifying the focus on machine learning in cancer. With some combination, we may be able to look at the next ten years as the last time that cancer has ever been “solved”.
Bonus: Cell2Sentence
Today's basic models often contain information, such as which genes have high copy numbers in which cancers, as part of their training set. This led to the creation of methods based on LLM, such as CellCsenter, where the data is converted into natural language and submitted to LLM.
We can use this method here. First we transform a random LUSC sample into a list of genes, based on their copy number value:
import cns
import cns.data_utils as cdu
samples_df, cns_df = cdu.main_load("COSMIC")
cns_df = cns.add_total_cn(cns_df)
lusc_df = cns.select_cns_by_type(cns_df, samples_df, "LUSC")
lusc_one_df = cns.cns_head(lusc_df, 1)
rank_list = lusc_one_df.sort_values(by="total_cn", ascending=False)["name"].to_list()
We then convert the list of genes into a natural language query:
query = "Below is a list of COSMIC {gene_names} ordered by descending total copy number in a cell. Your task is to identify the cancer type of that sample.nn"
query += "{gene_names}: " + " ".join(rank_list) + "nn"
query += "Answer:"
print(query)
Posted in ChatGPT 5, the model correctly identifies LUSC cancer:
Based on the gene copy-number profile you listed, the strongest clue is the massive amplification of 11q13 (CCND1, FADD) together with SOX2, PIK3CA, TP63, EGFR, FGFR1, MYC, CCND1/CCND2/CCND3, CDK6, CDK4. This is a very characteristic pattern of squamous cell carcinoma of the lung (LUSC):
CCND1 amplification is one of the most frequent events in LUSC.
SOX2 amplification (3q26) is a defining driver of lung squamous carcinoma, rarely seen in adenocarcinoma.
TP63 amplification is again highly specific to squamous tumors.
Co-amplifications of PIK3CA, FGFR1, EGFR, MYC fit this profile.
The broad set of genes amplified (many regulators of squamous differentiation and RTK/RAS/PI3K signaling) reflects typical copy-number–driven LUSC biology.
By contrast, lung adenocarcinomas usually show KRAS, EGFR, ALK, ROS1, MET exon 14, ERBB2 point mutations/fusions and have fewer widespread squamous-type CNAs.
Answer: Lung squamous cell carcinoma (LUSC).
However, generating this single sample output takes longer than splitting all the data for our model and it will cost about 200$ in API fee to split all our data.



