Machine Learning

Bridging the Gap Between Research and Literacy with Marco Hening Tallarico

In the Author Spotlight series, TDS Editors talk to members of our community about their work in data science and AI, their writing, and their sources of inspiration. Today, we are excited to share our interview Marco Hening Tallarico.

Marco is a graduate student at the University of Toronto and a researcher at Risklab, with a strong interest in applied mathematics and machine learning. Born in Brazil and raised in Canada, Marco appreciates the universal language of mathematics.

What motivates you to take dense academic concepts (like Stochastic Differential Equations) and make them accessible tutorials for the wider TDS community?

It is natural to want to read everything in its natural order. Algebra, calculus, math, etc. But if you want to improve quickly, you must stop this tendency. When you're trying to solve a maze, it's cheating to choose a place in the middle, but in reading, there is no rule. Start at the end and go back if you like. It makes it less boring.

Yours The Challenge of Data Science the article focuses on detecting data leaks in code rather than theory. In your experience, what are the most common silent leaks still making it into production systems today?

It's really easy to let data leakage creep in during data analysis, or when using measurements as input to a model. Especially now that aggregates can be calculated in real time easily. Before graphing, before running .head() work, I think it is important to make a classification of the train test. Think about how classification should be done, from user level, size, and chronological order to hierarchical classification: there are many choices to make, and it's worth taking the time.

Also, if you're using metrics like average users per month, you need to double check that the total wasn't calculated during the month you're using as your test set. These are complex, as they are not specific. It's not always as obvious as not using black box data when trying to predict which planes will crash. If you have a black box, it is not a prediction; the plane crashed.

You mean that learning a grammar from data alone is computationally expensive. Do you believe that hybrid models (mathematical + formal) are the only way to achieve sustainable scale of AI over time?

If we take LLMs as an example, there are many simple tasks they struggle with, such as adding a list of numbers or changing a page of text to upper case. It is not unreasonable to think that making the model bigger will solve these problems but it is not a good solution. It is more reliable to call a .sum() or .upper() it works on your behalf and uses its own linguistic reasoning to select inputs. This is likely what big AI models are already doing with rapid engineering.

It's much easier to use formal grammar to remove unwanted artifacts, like the em dash problem, than to rip another third of the internet's data and do more training.

You compare forward and inverse problems in PDE theory. Can you share a real world situation without a temperature model where the inverse problem approach would be a solution?

The problem going forward is usually what most people are uncomfortable with. If we look at the Black Scholes model, the previous problem would be: given some market conditions, what is the option price? But there is another question we can ask: given a large number of observed prices, what are the parameters of the model? This is the opposite problem: the end, it means flexibility.

We can also think in terms of Navier-Stokes equations, which model fluid dynamics. Forward problem: given the wing shape, initial velocity, and air viscosity, compute the velocity or pressure field. But we can also ask, if we look at the speed and the pressure field, how is the wing of our plane. This is often very difficult to solve. Given the causes, it is very easy to calculate the effects. But when you are given a bunch of effects, it is not really easy to calculate the cause. This is because multiple causes can explain the same observation.

And it's part of why PINNs just came out; they highlight how neural networks can effectively learn from data. This opens up a whole toolbox, like Adam, SGD, and backpropagation, but in terms of solving PDEs, it's genius.

As a Master's student who is also a talented technical writer, what advice would you give to other students who want to start sharing their research on forums like Towards Data Science?

I think that in technical writing, there are two competing options that you should actively pursue; you can think of it as distillation or dilution. Research articles are a lot like a shot of vodka; in the introduction, many fields of study are summarized in a few sentences. Although the bitter taste of vodka comes from drying, in writing, the main reason is jargon. This verbal compression algorithm allows us to discuss abstract concepts, such as the curse of dimensionality or data leakage, in just a few words. It is a tool that can be your undoing.

The first intensive reading paper is 7 pages. There are also 800-page in-depth textbooks (a piña colada by comparison). Both are good for the same reason: they provide the right level of detail for the right audience. To understand the right level of detail, you must learn about the genre you want to publish.

Yes, the way you clean it is important; nobody wants 1 part warm water, 1 part Tito's filth. Some recipes that make writing clearer include using memorable similes (this makes the content stick, like a piña colada on the table), focusing on a few key concepts, and elaborating on examples.

But there's also a distillation that happens in technical writing, and that comes down to “omitt[ing] unnecessary words,” is an old Strunk & White saying that will always ring true and remind you to learn about the craft of writing.” Roy Peter Clark is my favorite.

He writes again research topics. How do you tailor your content differently when writing for a general data science audience versus a research-focused one?

I would definitely avoid any alcohol related metaphors. Any figurative language, in fact. Stick to concrete. In research articles, the main thing you need to communicate is what progress has been made. Where the field was before, and where it is now. It is not teaching; he thinks the audience knows. It's about selling an idea, advocating a particular approach, and supporting an idea. You should show how there was a gap and explain how your paper filled it. If you can do those two things, you have a great research paper.

To learn more about Marco's work and stay up to date with his latest articles, you can visit his website and follow him on TDS, or on LinkedIn.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button