Hadoop platforms have long seemed essential in the implementation of big data projects, but it is not always the best solution to meet the needs of each business activity.
While there is no longer any doubt that companies now have an obligation to use their data to ensure their growth, and even their survival, the implementation of such a project is complex.
The primary objective is in fact to gain control over the company’s raw data in order to extract information from it.
The last few years have thus seen the birth of a boom in Big Data projects, reinforced more recently by projects linked to AI.
Big data platforms have deployed on a massive scale in companies, all of which link for historical reasons to Hadoop technology. But while these were once the necessary technological prerequisite, this is no longer true today.
1. Hadoop platforms ubiquitous but often inadequate
While Hadoop platforms may differ in terms of node size and usage across organizations, we can see that a single production platform is often used across the enterprise.
This is primarily due to the fact that a platform is only efficient from around ten nodes, leading to a need to pool uses and therefore rationalize costs. Moreover, given the technical complexity of the platforms and their specificity, the IT department tends to deploy these tools according to a global architectural model. Finally, many companies opt for single big data platforms because they do not first think about the uses applicable to their business model, sometimes giving in to the mirages of fashion.
Hadoop platforms have therefore become widespread, but they still suffer from significant drawbacks:
Being made up of the aggregation of Hadoop components linked by the overlay of an editor, to which are added company frameworks, a dependence on Java and Linux technologies and clusters often deployed on virtualized machines. So, it is difficult to ensure the overall consistency of these platforms over the course of updates.
Low return on investment:
60% of big data projects do not go beyond the study or POC stage while among the 40% deployed, only a third has a positive ROI.
The origin of the problem is to be found on the side of governance, projects on the one hand, during the industrialization of data science models on the production platform, and data on the other hand, for lack of knowledge of the company’s information assets and the methods to make them usable.
Lack of agility:
The more successful a platform is in the business, the less flexible its use becomes, due to the need to validate an update for all of the processes operated on the cluster. In addition, since these different processes do not have the same needs in terms of configuration according to their memory or CPU consumption, an agnostic cluster is not optimal for anyone and the platform is badly or underexploited.
Cloudera and Hortonworks have dominated The Hadoop platform market for several years.
The place of other players remaining anecdotal, such as MapR which almost disappeared before being acquired by HP. These market movements are not unrelated to the arrival of more generalist publishers in the field of Big Data.
2. New solutions for specific uses
Most users of packaged Hadoop distributions have found that they don’t need all of the services they offer. These solutions have in fact been designed by experts for experts, whereas the users of the data are potentially all players in the company. If we add to that heavy investment and a rarely achieved ROI, now is the time for technological rationalization.
To cope with the drawbacks observed in the use of Hadoop platforms, providers of cloud solutions have started to offer specialized services allowing the assembly of dedicated platforms each responding to a particular use.
We can cite for example:
- Storage of large volumes of data with good input/output performance (Azure Data lake Store, AWS S3, Google Cloud Storage, etc.)
- Quick and adjustable calculation capacities (AWS EC2, K8S, etc.)
- Big data business intelligence (BI) tools (Redshift, Azure Synapse, Snowflake, BigQuery, etc.)
- High-velocity ingestion, transformation, and modeling tools (Databricks, etc.)
Hadoop platforms rented in cloud mode (HD Insight, EMR, etc.)
These solutions thus offer specific parts of the services offered by traditional Hadoop platforms, but without their administrative complexity.
Here are some advantages while reducing the initial performance:
- Ease of use and speed of handling specific to managed services in the cloud,
- Costs rationalized when using the services used only,
- Better agility, and control by the teams IT.
Hadoop clusters have thus lost their dominance over big data technologies, especially as certain “on-premises” products (SQL Server, Oracle, etc.) have also caught up to their backlog in this field.
3. The flip side of SaaS
As the data solutions deployed in cloud mode have now reached a certain maturity, it is possible to draw some lessons from their implementation:
Simple to use, difficult to integrate:
Although the new tools have enabled significant time and energy savings in the start-up of new projects, their integration into a secure IS as well as their interoperability is sometimes problematic.
Sometimes unpredictable costs:
Very few players train in the financial management of cloud platforms, and the role of FinOps is often ignored or limited to adjusting production costs. It is not uncommon to find platforms that have consumed their annual budget since the summer.
A weakening of data ownership:
Publishers do everything to attract users to their platform and keep them there. Few tools are compatible on several platforms and the return of data in “traditional” datacenters is still very expensive to set up. In addition, securing this type of platform is much more complex and data leaks are not uncommon.
Little expertise on the market:
these tools are relatively new and engineers able to use them correctly are rare. Faced with a shortage of resources, many people are embarking on these technologies without having the required level of training. which impacts the ROI of the platforms: much larger development and operating budgets, lower production stability, etc.
Faced with these issues, discussions are being carried out to find the right balance between the flexibility and agility of cloud services on the one hand and the stability, robustness, and security (technical and financial) of the solutions on the other.
4. Adopt big data tools to business needs rather than the other way around
Before choosing your technological tools, you must establish a clear big data strategy adapted to your own needs.
We proved this according to four principles:
Business uses take precedence over technology:
keeping in mind the “why” of a project, we avoid choosing platforms disconnected from the business of the company. It must find “how” to do it, rather than looking for use cases that justify an existing platform.
Specialized and responsive services rather than shared monoliths:
for big data projects to succeed, the IT department must provide a service adapted to each need, also in terms of roadmap and time to market. At the risk of seeing the businesses move forward without them and develop shadow IT.
Cross-disciplinary governance rather than siled initiatives:
it is not ideal either that each business has its own platform. It is important to ensure good communication between actors about company data. To find a balance between the flexibility needed for trading and the general strategy for it.
Fluid management rather than a one-way strategy:
technologies are evolving more and more quickly, as are the strategic changes of publishers. It is, therefore, necessary to maintain a constant watch on all activities.
Finally, these principles seem clear, but their implementation can be complex in many contexts. As there are no miracle solutions to solve all the problems, it is essential to be supported in order to have experience.