There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments or use public or private cloud platforms, and there are also some best practices companies can use to address those pain points. Here are 5 more pain points and corresponding best practices.
Start with 8 Big Data Pain Points and How to Address Them - Part 1
PAIN POINT 4 – BIG DATA TOOLS EXPLOSION AND DEPLOYMENT COMPLEXITY
In the past decade, technologies such as Hadoop and MapReduce have become common frameworks to speed up processing of large datasets by breaking up them up into small fragments, running them in distributed farms of storage and processors clusters, and then collating the results back for consumption. Companies like Cloudera, Hortonworks and others have addressed many of the challenges associated with scheduling, cluster management, resource and data sharing, and performance tuning of these tools. And typically, such deployments are optimized to run on bare metal or on virtualization platforms like VMware, and therefore tend to remain in their own silo because of the complexity of deploying and operating these environments.
Modern big data use cases, however, need a whole bunch of other technologies and tools. You have Docker. You have Kubernetes. You have Spark. You have NoSQL Databases such as Cassandra and MongoDB. And when you get into machine learning you have several options.
Deploying Hadoop, which is quite complex, is one thing, arguably made relatively easy by companies like Cloudera and Hortonworks, but then if you need to deploy Cassandra or MongoDB, you have to put in effort to write scripts to deploy them. And depending on the target platform (bare metal, VMware, Microsoft), you will need to maintain and run multiple scripts. You then have to figure out how to network the Hadoop cluster with the Cassandra cluster and of course, inevitably, deal with DNS services, load balancers, firewalls, etc. Add other Big Data tools to be deployed, managed, and integrated, and you will begin to appreciate the challenge.
IT teams should address this challenge with a unifying platform that can not only deploy multiple Big Data tools and platforms from a curated "application and big data catalog," but also provide a way to virtualize all the underlying infrastructure resources along with an infrastructure-as-code framework via open API access This greatly simplifies the IT burden when it comes to provisioning the underlying infrastructure resources, and end users can simply deploy the tools they want and need with a single click and have the ability to use APIs to automate their deployment, provisioning, and configuration challenges.
PAIN POINT 5 – ONE BIG DATA CLUSTER DOESN'T ADDRESS ALL NEEDS
Organizations have diverse Big Data teams, production and R&D portfolios, and sometimes conflicting requirements for performance, data locality, cost, or specialized hardware resources. One single, standardized data cluster is not going to meet all of those needs. Companies will need to deploy multiple, independent Big Data clusters with possibly different underlying CPU, memory, and storage footprints. One cluster could be dedicated and fine-tuned for a Hadoop deployment with high local storage IOPS requirements, another may be running Spark jobs with more CPU and memory-bound configurations, and others like machine learning will need GPU infrastructure. Deploying and managing the complexity of such multiple diverse clusters will place a high operational overhead on the IT team, reducing their ability to respond quickly to Big Data user requests, and making it difficult to manage costs and maintain operational efficiency.
To address this pain point, the IT team should again have a unified orchestration/management platform and be able to set up logical business units that can be assigned to different Big Data teams. This way, each team gets full self-service capability within quota limits imposed by the IT staff, and each team can automatically deploy its own Big Data tools with a few clicks, independently of other teams.
PAIN POINT 6: SKYROCKETING IT OPERATIONS COSTS
Developing, deploying, and operating large-scale enterprise big data clusters can get complex, especially if it involves multiple sites, multiple teams, and diverse infrastructure, as we have seen. The operational overhead of these systems can be expensive and manually time-consuming. For example, IT operations teams still need to set up firewalls, load balancers, DNS services, and VPN services, to name a few. They still need to manage infrastructure operations such as physical host maintenance, disk additions/removals/replacements, and physical host additions/removals/replacements. They still need to do capacity planning, and they still need to monitor utilization, allocation, and performance of compute, storage, and networking.
IT teams should look for a solution that addresses this operational overhead through automation and the use of modern SaaS-based management portals that help the teams optimize sizing, perform predictive capacity planning, and implement seamless failure management.
PAIN POINT 7 – CONSISTENT POLICY-DRIVEN SECURITY AND CUSTOMIZATION REQUIREMENTS
Enterprises have policies around using their specifically hardened and approved gold images of operating systems. The operating systems often need to have security configurations, databases, and other management tools installed before they can be used. Running these on public cloud may not be allowed, or they may run very slowly.
The solution is to enable an on-premises data center image store where enterprises can create customized gold images. Using fine-grained RBAC, the IT team can share these images selectively with various development teams around the world based on the local security, regulatory, and performance requirements. The local Kubernetes deployments are then carried out using these gold images to provide the underlying infrastructure to run containers.
PAIN POINT 8 – DR STRATEGY FOR EDGE COMPUTING AND BIG DATA CLUSTERS
Any critical application and the data associated with it needs to be protected from natural disasters regardless of whether or not these apps are based on containers. None of the existing solutions provides an out-of-the-box disaster recovery feature for critical edge computing clusters or Big Data analytics applications. Customers are left to cobble together their own DR strategy.
As part of a platform's multi-site capabilities, IT teams should be able to perform remote data replication and disaster recovery between remote geographically-separated sites. This protects persistent data and databases used by these clusters.
Infrastructure management for Big Data projects can be extremely complex, but with centralized management of virtualized or cloud-based resources, it can be far easier.