CHAPTER #12

5 Steps on How to Approach a New Data Science Problem

Many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data but inability to transform it into actionable insights. Here's how to do it right.

Last Update

24 Mar, 2021

Reading Time

6 Min

Authors
Marcin DrykaCTO
Github LinkLinkedin Link
Matt WarcholinskiCOO & Co-Founder
Linkedin Link
Bianka PluszczewskaEditor
Linkedin Link

Introduction

Data has become the new gold. 85 percent of companies are trying to be data-driven, according to last year’s survey by NewVantage Partners, and the global data science platform market is expected to reach $128.21 billion by 2022, up from $19.75 billion in 2016.

Clearly, data science is not just another buzzword with limited real-world use cases. Yet, many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data.

In the past few years alone, 90 percent of all of the world’s data has been created, and our current daily data output has reached 2.5 quintillion bytes, which is such a mind-bogglingly large number that it’s difficult to fully appreciate the break-neck pace at which we generate new data.

The real problem is the inability of companies to transform the data they have at their disposal into actionable insights that can be used to make better business decisions, stop threats, and mitigate risks.

In fact, there’s often too much data available to make a clear decision, which is why it’s crucial for companies to know how to approach a new data science problem and understand what types of questions data science can answer.

What types of questions can data science answer?

“Data science and statistics are not magic. They won’t magically fix all of a company’s problems. However, they are useful tools to help companies make more accurate decisions and automate repetitive work and choices that teams need to make,” writes Seattle Data Guy, a data-driven consulting agency.

The questions that can be answered with the help of data science fall under following categories:

  • Identifying themes in large data sets: Which server in my server farm needs maintenance the most?
  • Identifying anomalies in large data sets: Is this combination of purchases different from what this customer has ordered in the past?
  • Predicting the likelihood of something happening: How likely is this user to click on my video?
  • Showing how things are connected to one another: What is the topic of this online article?
  • Categorizing individual data points: Is this an image of a cat or a mouse?

Of course, this is by no means a complete list of all questions that data science can answer. Even if it were, data science is evolving at such a rapid pace that it would most likely be completely outdated within a year or two from its publication.

Now that we’ve established the types of questions that can be reasonably expected to be answered with the help of data science, it’s time to lay down the steps most data scientists would take when approaching a new data science problem.

Step 1: Define the problem

First, it’s necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable. Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.

Here are some basic characteristics of a well-defined data problem:

  • The solution to the problem is likely to have enough positive impact to justify the effort.
  • Enough data is available in a usable format.
  • Stakeholders are interested in applying data science to solve the problem.

Step 2: Decide on an approach

There are many data science algorithms that can be applied to data, and they can be roughly grouped into the following families:

  • Two-class classification: useful for any question that has just two possible answers.
  • Multi-class classification: answers a question that has multiple possible answers.
  • Anomaly detection: identifies data points that are not normal.
  • Regression: gives a real-valued answer and is useful when looking for a number instead of a class or category.
  • Multi-class classification as regression: useful for questions that occur as rankings or comparisons.
  • Two-class classification as regression: useful for binary classification problems that can also be reformulated as regression.
  • Clustering: answer questions about how data is organized by seeking to separate out a data set into intuitive chunks.
  • Dimensionality reduction: reduces the number of random variables under consideration by obtaining a set of principal variables.
  • Reinforcement learning algorithms: focus on taking action in an environment so as to maximize some notion of cumulative reward.

Step 3: Collect data

With the problem clearly defined and a suitable approach selected, it’s time to collect data. All collected data should be organized in a log along with collection dates and other helpful metadata.

It’s important to understand that collected data is seldom ready for analysis right away. Most data scientists spend much of their time on data cleaning, which includes removing missing values, identifying duplicate records, and correcting incorrect values.

Step 4: Analyze data

The next step after data collection and cleanup is data analysis. At this stage, there’s a certain chance that the selected data science approach won’t work. This is to be expected and accounted for. Generally, it’s recommended to start with trying all the basic machine learning approaches as they have fewer parameters to alter.

There are many excellent open source data science libraries that can be used to analyze data. Most data science tools are written in Python, Java, or C++.

“Tempting as these cool toys are, for most applications the smart initial choice will be to pick a much simpler model, for example using scikit-learn and modeling techniques like simple logistic regression,”

– advises Francine Bennett, the CEO and co-founder of Mastodon C.

Step 5: Interpret results

After data analysis, it’s finally time to interpret the results. The most important thing to consider is whether the original problem has been solved. You might discover that your model is working but producing subpar results. One way how to deal with this is to add more data and keep retraining the model until satisfied with it.

Conclusion

Most companies today are drowning in data. The global leaders are already using the data they generate to gain competitive advantage, and others are realizing that they must do the same or perish. While transforming an organization to become data-driven is no easy task, the reward is more than worth the effort.

The 5 steps on how to approach a new data science problem we’ve described in this article are meant to illustrate the general problem-solving mindset companies must adopt to successfully face the challenges of our current data-centric era.

Liked this chapter?

Awesome! We’ll be adding new content on this topic soon. Want to be notified?

Leading a Software Development Team: Guidebook for CTOs and Team Leaders

Follow handbook