8 Big Data Pain Points and How to Address Them - Part 2
August 03, 2018

Kamesh Pemmaraju
ZeroStack

Share this

There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments or use public or private cloud platforms, and there are also some best practices companies can use to address those pain points. Here are 5 more pain points and corresponding best practices.

Start with 8 Big Data Pain Points and How to Address Them - Part 1

PAIN POINT 4 – BIG DATA TOOLS EXPLOSION AND DEPLOYMENT COMPLEXITY

In the past decade, technologies such as Hadoop and MapReduce have become common frameworks to speed up processing of large datasets by breaking up them up into small fragments, running them in distributed farms of storage and processors clusters, and then collating the results back for consumption. Companies like Cloudera, Hortonworks and others have addressed many of the challenges associated with scheduling, cluster management, resource and data sharing, and performance tuning of these tools. And typically, such deployments are optimized to run on bare metal or on virtualization platforms like VMware, and therefore tend to remain in their own silo because of the complexity of deploying and operating these environments.

Modern big data use cases, however, need a whole bunch of other technologies and tools. You have Docker. You have Kubernetes. You have Spark. You have NoSQL Databases such as Cassandra and MongoDB. And when you get into machine learning you have several options.

Deploying Hadoop, which is quite complex, is one thing, arguably made relatively easy by companies like Cloudera and Hortonworks, but then if you need to deploy Cassandra or MongoDB, you have to put in effort to write scripts to deploy them. And depending on the target platform (bare metal, VMware, Microsoft), you will need to maintain and run multiple scripts. You then have to figure out how to network the Hadoop cluster with the Cassandra cluster and of course, inevitably, deal with DNS services, load balancers, firewalls, etc. Add other Big Data tools to be deployed, managed, and integrated, and you will begin to appreciate the challenge.

IT teams should address this challenge with a unifying platform that can not only deploy multiple Big Data tools and platforms from a curated "application and big data catalog," but also provide a way to virtualize all the underlying infrastructure resources along with an infrastructure-as-code framework via open API access This greatly simplifies the IT burden when it comes to provisioning the underlying infrastructure resources, and end users can simply deploy the tools they want and need with a single click and have the ability to use APIs to automate their deployment, provisioning, and configuration challenges.

PAIN POINT 5 – ONE BIG DATA CLUSTER DOESN'T ADDRESS ALL NEEDS

Organizations have diverse Big Data teams, production and R&D portfolios, and sometimes conflicting requirements for performance, data locality, cost, or specialized hardware resources. One single, standardized data cluster is not going to meet all of those needs. Companies will need to deploy multiple, independent Big Data clusters with possibly different underlying CPU, memory, and storage footprints. One cluster could be dedicated and fine-tuned for a Hadoop deployment with high local storage IOPS requirements, another may be running Spark jobs with more CPU and memory-bound configurations, and others like machine learning will need GPU infrastructure. Deploying and managing the complexity of such multiple diverse clusters will place a high operational overhead on the IT team, reducing their ability to respond quickly to Big Data user requests, and making it difficult to manage costs and maintain operational efficiency.

To address this pain point, the IT team should again have a unified orchestration/management platform and be able to set up logical business units that can be assigned to different Big Data teams. This way, each team gets full self-service capability within quota limits imposed by the IT staff, and each team can automatically deploy its own Big Data tools with a few clicks, independently of other teams.

PAIN POINT 6: SKYROCKETING IT OPERATIONS COSTS

Developing, deploying, and operating large-scale enterprise big data clusters can get complex, especially if it involves multiple sites, multiple teams, and diverse infrastructure, as we have seen. The operational overhead of these systems can be expensive and manually time-consuming. For example, IT operations teams still need to set up firewalls, load balancers, DNS services, and VPN services, to name a few. They still need to manage infrastructure operations such as physical host maintenance, disk additions/removals/replacements, and physical host additions/removals/replacements. They still need to do capacity planning, and they still need to monitor utilization, allocation, and performance of compute, storage, and networking.

IT teams should look for a solution that addresses this operational overhead through automation and the use of modern SaaS-based management portals that help the teams optimize sizing, perform predictive capacity planning, and implement seamless failure management.

PAIN POINT 7 – CONSISTENT POLICY-DRIVEN SECURITY AND CUSTOMIZATION REQUIREMENTS

Enterprises have policies around using their specifically hardened and approved gold images of operating systems. The operating systems often need to have security configurations, databases, and other management tools installed before they can be used. Running these on public cloud may not be allowed, or they may run very slowly.

The solution is to enable an on-premises data center image store where enterprises can create customized gold images. Using fine-grained RBAC, the IT team can share these images selectively with various development teams around the world based on the local security, regulatory, and performance requirements. The local Kubernetes deployments are then carried out using these gold images to provide the underlying infrastructure to run containers.

PAIN POINT 8 – DR STRATEGY FOR EDGE COMPUTING AND BIG DATA CLUSTERS

Any critical application and the data associated with it needs to be protected from natural disasters regardless of whether or not these apps are based on containers. None of the existing solutions provides an out-of-the-box disaster recovery feature for critical edge computing clusters or Big Data analytics applications. Customers are left to cobble together their own DR strategy.

As part of a platform's multi-site capabilities, IT teams should be able to perform remote data replication and disaster recovery between remote geographically-separated sites. This protects persistent data and databases used by these clusters.

Infrastructure management for Big Data projects can be extremely complex, but with centralized management of virtualized or cloud-based resources, it can be far easier.

Kamesh Pemmaraju is VP of Product at ZeroStack
Share this

The Latest

August 17, 2018

As a Network Operations professional, you know how hard it is to ensure optimal network performance when you’re unsure of how end-user devices, application code, and infrastructure affect performance. Identifying your important applications and prioritizing their performance is more difficult than ever, especially when much of an organization’s web-based traffic appears the same to the network. You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. But you can stay ahead. Follow these 10 steps to maximize the performance of your applications and underlying network infrastructure ...

August 16, 2018

IT organizations are constantly trying to optimize operations and troubleshooting activities and for good reason. Let's look at one example for the medical industry. Networked applications, such as electronic medical records (EMR), are vital for hospitals to provide outstanding service to their patients and physicians. However, a networking team can often not be aware of slow response times on the remotely hosted EMR application until a physician or someone else calls in to complain ...

August 15, 2018

In 2014, AWS Lambda introduced serverless architecture. Since then, many other cloud providers have developed serverless options. What’s behind this rapid growth? ...

August 14, 2018

This question is really two questions. The first would be: What's really going on in terms of a confusion of terms? — as we wrestle with AIOps, IT Operational Analytics, big data, AI bots, machine learning, and more generically stated "AI platforms" (… and the list is far from complete). The second might be phrased as: What's really going on in terms of real-world advanced IT analytics deployments — where are they succeeding, and where are they not? This blog will look at both questions as a way of introducing EMA's newest research with data ...

August 13, 2018

Consumers will now trade app convenience for security, according to a study commissioned by F5 Networks, The Curve of Convenience – The Trade-Off between Security and Convenience ...

August 10, 2018

Gartner unveiled the CX Pyramid, a new methodology to test organizations’ customer journeys and forge more powerful experiences that deliver greater customer loyalty and brand advocacy ...

August 09, 2018

Nearly half (48 percent) of consumers report that they currently use, or have used in the past, services of organizations that were involved in a publicly disclosed data breach and, of those, 48 percent have stopped using the services of an organization because of a breach, according to Global State of Digital Trust Survey and Index 2018, a new report from CA Technologies ...

August 08, 2018

Here's the problem: IT teams are in the dark. The only information they have available to them is based on what users decide to tell them about through calls to the help desk ...

August 07, 2018

Over the past year, the enterprise network grew significantly more complicated, creating new challenges for network professionals, according to IDG’s 8th annual State of the Network study. Internet of Things (IoT) projects, the demands of an increasingly mobile workforce, and an explosion of apps prompted network professionals to enhance their network infrastructure and the skillsets needed to support it. Network professionals are now being asked to help shape IT strategy ...

August 06, 2018

Retailers are already busy prepping to avoid an Amazon Prime type meltdown during the holiday shopping season. However, rather than focusing efforts on coping with surges in traffic to your website, you also need to be thinking about the ongoing speed of your site ...