8 Big Data Pain Points and How to Address Them - Part 2
August 03, 2018

Kamesh Pemmaraju
ZeroStack

Share this

There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments or use public or private cloud platforms, and there are also some best practices companies can use to address those pain points. Here are 5 more pain points and corresponding best practices.

Start with 8 Big Data Pain Points and How to Address Them - Part 1

PAIN POINT 4 – BIG DATA TOOLS EXPLOSION AND DEPLOYMENT COMPLEXITY

In the past decade, technologies such as Hadoop and MapReduce have become common frameworks to speed up processing of large datasets by breaking up them up into small fragments, running them in distributed farms of storage and processors clusters, and then collating the results back for consumption. Companies like Cloudera, Hortonworks and others have addressed many of the challenges associated with scheduling, cluster management, resource and data sharing, and performance tuning of these tools. And typically, such deployments are optimized to run on bare metal or on virtualization platforms like VMware, and therefore tend to remain in their own silo because of the complexity of deploying and operating these environments.

Modern big data use cases, however, need a whole bunch of other technologies and tools. You have Docker. You have Kubernetes. You have Spark. You have NoSQL Databases such as Cassandra and MongoDB. And when you get into machine learning you have several options.

Deploying Hadoop, which is quite complex, is one thing, arguably made relatively easy by companies like Cloudera and Hortonworks, but then if you need to deploy Cassandra or MongoDB, you have to put in effort to write scripts to deploy them. And depending on the target platform (bare metal, VMware, Microsoft), you will need to maintain and run multiple scripts. You then have to figure out how to network the Hadoop cluster with the Cassandra cluster and of course, inevitably, deal with DNS services, load balancers, firewalls, etc. Add other Big Data tools to be deployed, managed, and integrated, and you will begin to appreciate the challenge.

IT teams should address this challenge with a unifying platform that can not only deploy multiple Big Data tools and platforms from a curated "application and big data catalog," but also provide a way to virtualize all the underlying infrastructure resources along with an infrastructure-as-code framework via open API access This greatly simplifies the IT burden when it comes to provisioning the underlying infrastructure resources, and end users can simply deploy the tools they want and need with a single click and have the ability to use APIs to automate their deployment, provisioning, and configuration challenges.

PAIN POINT 5 – ONE BIG DATA CLUSTER DOESN'T ADDRESS ALL NEEDS

Organizations have diverse Big Data teams, production and R&D portfolios, and sometimes conflicting requirements for performance, data locality, cost, or specialized hardware resources. One single, standardized data cluster is not going to meet all of those needs. Companies will need to deploy multiple, independent Big Data clusters with possibly different underlying CPU, memory, and storage footprints. One cluster could be dedicated and fine-tuned for a Hadoop deployment with high local storage IOPS requirements, another may be running Spark jobs with more CPU and memory-bound configurations, and others like machine learning will need GPU infrastructure. Deploying and managing the complexity of such multiple diverse clusters will place a high operational overhead on the IT team, reducing their ability to respond quickly to Big Data user requests, and making it difficult to manage costs and maintain operational efficiency.

To address this pain point, the IT team should again have a unified orchestration/management platform and be able to set up logical business units that can be assigned to different Big Data teams. This way, each team gets full self-service capability within quota limits imposed by the IT staff, and each team can automatically deploy its own Big Data tools with a few clicks, independently of other teams.

PAIN POINT 6: SKYROCKETING IT OPERATIONS COSTS

Developing, deploying, and operating large-scale enterprise big data clusters can get complex, especially if it involves multiple sites, multiple teams, and diverse infrastructure, as we have seen. The operational overhead of these systems can be expensive and manually time-consuming. For example, IT operations teams still need to set up firewalls, load balancers, DNS services, and VPN services, to name a few. They still need to manage infrastructure operations such as physical host maintenance, disk additions/removals/replacements, and physical host additions/removals/replacements. They still need to do capacity planning, and they still need to monitor utilization, allocation, and performance of compute, storage, and networking.

IT teams should look for a solution that addresses this operational overhead through automation and the use of modern SaaS-based management portals that help the teams optimize sizing, perform predictive capacity planning, and implement seamless failure management.

PAIN POINT 7 – CONSISTENT POLICY-DRIVEN SECURITY AND CUSTOMIZATION REQUIREMENTS

Enterprises have policies around using their specifically hardened and approved gold images of operating systems. The operating systems often need to have security configurations, databases, and other management tools installed before they can be used. Running these on public cloud may not be allowed, or they may run very slowly.

The solution is to enable an on-premises data center image store where enterprises can create customized gold images. Using fine-grained RBAC, the IT team can share these images selectively with various development teams around the world based on the local security, regulatory, and performance requirements. The local Kubernetes deployments are then carried out using these gold images to provide the underlying infrastructure to run containers.

PAIN POINT 8 – DR STRATEGY FOR EDGE COMPUTING AND BIG DATA CLUSTERS

Any critical application and the data associated with it needs to be protected from natural disasters regardless of whether or not these apps are based on containers. None of the existing solutions provides an out-of-the-box disaster recovery feature for critical edge computing clusters or Big Data analytics applications. Customers are left to cobble together their own DR strategy.

As part of a platform's multi-site capabilities, IT teams should be able to perform remote data replication and disaster recovery between remote geographically-separated sites. This protects persistent data and databases used by these clusters.

Infrastructure management for Big Data projects can be extremely complex, but with centralized management of virtualized or cloud-based resources, it can be far easier.

Kamesh Pemmaraju is VP of Product at ZeroStack
Share this

The Latest

February 22, 2024

Some companies are just starting to dip their toes into developing AI capabilities, while (few) others can claim they have built a truly AI-first product. Regardless of where a company is on the AI journey, leaders must understand what it means to build every aspect of their product with AI in mind ...

February 21, 2024

Generative AI will usher in advantages within various industries. However, the technology is still nascent, and according to the recent Dynatrace survey there are many challenges and risks that organizations need to overcome to use this technology effectively ...

February 20, 2024

In today's digital era, monitoring and observability are indispensable in software and application development. Their efficacy lies in empowering developers to swiftly identify and address issues, enhance performance, and deliver flawless user experiences. Achieving these objectives requires meticulous planning, strategic implementation, and consistent ongoing maintenance. In this blog, we're sharing our five best practices to fortify your approach to application performance monitoring (APM) and observability ...

February 16, 2024

In MEAN TIME TO INSIGHT Episode 3, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA) discusses network security with Chris Steffen, VP of Research Covering Information Security, Risk, and Compliance Management at EMA ...

February 15, 2024

In a time where we're constantly bombarded with new buzzwords and technological advancements, it can be challenging for businesses to determine what is real, what is useful, and what they truly need. Over the years, we've witnessed the rise and fall of various tech trends, such as the promises (and fears) of AI becoming sentient and replacing humans to the declaration that data is the new oil. At the end of the day, one fundamental question remains: How can companies navigate through the tech buzz and make informed decisions for their future? ...

February 14, 2024

We increasingly see companies using their observability data to support security use cases. It's not entirely surprising given the challenges that organizations have with legacy SIEMs. We wanted to dig into this evolving intersection of security and observability, so we surveyed 500 security professionals — 40% of whom were either CISOs or CSOs — for our inaugural State of Security Observability report ...

February 13, 2024

Cloud computing continues to soar, with little signs of slowing down ... But, as with any new program, companies are seeing substantial benefits in the cloud but are also navigating budgetary challenges. With an estimated 94% of companies using cloud services today, priorities for IT teams have shifted from purely adoption-based to deploying new strategies. As they explore new territories, it can be a struggle to exploit the full value of their spend and the cloud's transformative capabilities ...

February 12, 2024

What will the enterprise of the future look like? If we asked this question three years ago, I doubt most of us would have pictured today as we know it: a future where generative AI has become deeply integrated into business and even our daily lives ...

February 09, 2024

With a focus on GenAI, industry experts offer predictions on how AI will evolve and impact IT and business in 2024. Part 5, the final installment in this series, covers the advantages AI will deliver: Generative AI will become increasingly important for resolving complicated data integration challenges, essentially providing a natural-language intermediary between data endpoints ...

February 08, 2024

With a focus on GenAI, industry experts offer predictions on how AI will evolve and impact IT and business in 2024. Part 4 covers the challenges of AI: In the short term, the rapid development and adoption of AI tools and products leveraging AI services will lead to an increase in biased outputs ...