A QUICK SUMMARY – FOR THE BUSY ONES
TABLE OF CONTENTS
Data has become the new gold. 85 percent of companies are trying to be data-driven, according to last year’s survey by NewVantage Partners, and the global data science platform market is expected to reach $128.21 billion by 2022, up from $19.75 billion in 2016.
Clearly, data science is not just another buzzword with limited real-world use cases. Yet, many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data.
In the past few years alone, 90 percent of all of the world’s data has been created, and our current daily data output has reached 2.5 quintillion bytes, which is such a mind-bogglingly large number that it’s difficult to fully appreciate the break-neck pace at which we generate new data.
The real problem is the inability of companies to transform the data they have at their disposal into actionable insights that can be used to make better business decisions, stop threats, and mitigate risks.
In fact, there’s often too much data available to make a clear decision, which is why it’s crucial for companies to know how to approach a new data science problem and understand what types of questions data science can answer.
“Data science and statistics are not magic. They won’t magically fix all of a company’s problems. However, they are useful tools to help companies make more accurate decisions and automate repetitive work and choices that teams need to make,” writes Seattle Data Guy, a data-driven consulting agency.
The questions that can be answered with the help of data science fall under following categories:
Of course, this is by no means a complete list of all questions that data science can answer. Even if it were, data science is evolving at such a rapid pace that it would most likely be completely outdated within a year or two from its publication.
Now that we’ve established the types of questions that can be reasonably expected to be answered with the help of data science, it’s time to lay down the steps most data scientists would take when approaching a new data science problem.
First, it’s necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable. Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.
Here are some basic characteristics of a well-defined data problem:
There are many data science algorithms that can be applied to data, and they can be roughly grouped into the following families:
With the problem clearly defined and a suitable approach selected, it’s time to collect data. All collected data should be organized in a log along with collection dates and other helpful metadata.
It’s important to understand that collected data is seldom ready for analysis right away. Most data scientists spend much of their time on data cleaning, which includes removing missing values, identifying duplicate records, and correcting incorrect values.
The next step after data collection and cleanup is data analysis. At this stage, there’s a certain chance that the selected data science approach won’t work. This is to be expected and accounted for. Generally, it’s recommended to start with trying all the basic machine learning approaches as they have fewer parameters to alter.
There are many excellent open source data science libraries that can be used to analyze data. Most data science tools are written in Python, Java, or C++.
<blockquote><p>“Tempting as these cool toys are, for most applications the smart initial choice will be to pick a much simpler model, for example using scikit-learn and modeling techniques like simple logistic regression,” – advises Francine Bennett, the CEO and co-founder of Mastodon C.</p></blockquote>
After data analysis, it’s finally time to interpret the results. The most important thing to consider is whether the original problem has been solved. You might discover that your model is working but producing subpar results. One way how to deal with this is to add more data and keep retraining the model until satisfied with it.
Most companies today are drowning in data. The global leaders are already using the data they generate to gain competitive advantage, and others are realizing that they must do the same or perish. While transforming an organization to become data-driven is no easy task, the reward is more than worth the effort.
The 5 steps on how to approach a new data science problem we’ve described in this article are meant to illustrate the general problem-solving mindset companies must adopt to successfully face the challenges of our current data-centric era.
Every year, Brainhub helps 750,000+ founders, leaders and software engineers make smart tech decisions. We earn that trust by openly sharing our insights based on practical software engineering experience.
A serial entrepreneur, passionate R&D engineer, with 15 years of experience in the tech industry.
Top reads this month
Get smarter in engineering and leadership in less than 60 seconds.
Join 300+ founders and engineering leaders, and get a weekly newsletter that takes our CEO 5-6 hours to prepare.
No previous chapters
No next chapters