8 Big Data Pain Points and How to Address Them - Part 1

August 02, 2018

Kamesh Pemmaraju

The word "Big" in Big Data doesn't even come close to capturing what is happening today in our industry and what is yet to come. The volume, velocity, and variety of data that is being generated has overwhelmed the capabilities of infrastructure and analytics we have today.

We are now experiencing Moore's law for data growth: data is doubling every 18 months. No wonder IDC forecasts that the global datasphere will grow to 163 zettabytes (a trillion gigabytes) by 2025. That's ten times the data generated in 2016.

Data scientists typically may have to simultaneously combine data from various sources with different volume, variety, and velocity needs to gain useful insights, but that in turn puts different demands on processing power, storage and network performance, latencies etc. Here's a quick look at the different types of Big Data sources:

Unstructured data: The type of data generated by sources such as social media, log files, and sensor data is not very structured and hence is generally not amenable to traditional database analysis methods. A large variety of Big Data tools, techniques, and approaches have emerged in the last few years to ingest, analyze, and extract customer sentiment from social media data. Newer approaches include Natural Language Processing, News Analytics, unstructured text analysis, etc.

Semi-structured data: Some unstructured data may in fact have some structure to them. Examples include email, call center logs, and IoT data. Some in the industry have coined a new term, "Semi-structured data," to describe these data sources. These may require a combination of traditional databases and newer Big Data tools to extract useful insights from these types of data.

Streaming data brings in the dimension of higher velocity and real-time processing constraints. The velocity of data varies widely depending on the type of application: IoT data tends to be small packets of data regularly streamed at low velocity, while 4K video streams stretch the velocity to the highest end of the spectrum.

The alluring promise of these new use cases – and associated emerging technologies and tools – is that they can generate useful insights faster so that companies can take actions to achieve better business outcomes, improve customer experience, and gain significant competitive advantage.

No wonder Big Data projects have been on the CIO top ten initiatives for the past decade – almost 70 percent of Fortune 1000 firms rate big data as important to their businesses; over 60 percent already have at least one big data project in place.

While data scientists are dealing with the complexity of how to derive value from diverse data sources, IT practitioners need to figure out the most efficient way to deal with the infrastructure requirements of Big Data projects. Traditional bare-metal infrastructure, with its siloed management of servers, storage, and networks, is not flexible enough to tackle the dynamic nature of the new Big Data workloads. This is where cloud-based systems shine. However, many challenges remain to be addressed in the areas of workload scaling, performance and latency, data migration, bandwidth limitations, and application architectures.

There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments or use public or private cloud platforms, and there are also some best practices companies can use to address those pain points.

PAIN POINT 1: LONG COMMUTE FROM STORAGE TO COMPUTE

As data amounts grow from terabyte to petabyte and beyond, the time it takes to transport this data closer to compute resources and perform data processing and analytics takes longer and longer, impeding the agility of the organization. Public cloud vendors like AWS, who are all about centralized data centers, want to get your data into their cloud and go to extreme lengths (see AWS snowmobile) to get it. Furthermore, data transfer fees are mostly unidirectional, i.e., only data that is going out of an AWS service is subject to data transfer fees. Not only is this a classic lock-in scenario, but it also goes against other key emerging trends:

Edge Computing and Artificial Intelligence, especially for use cases such as IoT, 5G, image/speech recognition, Blockchain, and others, where there is a need to place processing and data closer to each other and/or closer to where the user or device is. Edge computing delivers faster data analytics results with the data being closer to processing while simultaneously reducing the cost of transporting data to the cloud.

Artificial Intelligence systems are more effective the more data they are given. For example, in deep learning, the more cases (data) you give to the system, the more it learns and the more accurate its results become. This is a case where you need massive parallel processing (e.g., using GPUs) of large data sets. Big Data analytics and AI can complement each other to improve speed of processing and produce more useful and relevant results.

To address the need to get data to where the compute resources are, IT leaders should look for hyper-converged, scale-out solutions that bring together compute, storage, and networking, thus reducing data I/O latency and improving data processing and analytics times. For even better performance, they should look for solutions that can bring the computing units (VMs or containers) as close to the physical storage as possible, without losing the manageability of the storage solution and while maintaining multi-tenancy across the cluster. For example, a Hadoop Data Node VM running on the same physical host and accessing local SSDs will experience the highest performance and faster results overall without impacting other workloads running within other tenants.

IT leaders can take advantage of many emerging memory technologies such as persistent memory (a new memory technology between DRAM and flash that will be non-volatile, with low latency and higher capacity than DRAMs), NVMe, and faster flash drives. With prices falling rapidly, there seems little need for spinning disks for primary storage.

IT administrators should implement a central way to manage all the edge computing sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and Project-level RBAC and security controls.

PAIN POINT 2: DISTRIBUTED TEAMS, LOCAL PERFORMANCE NEEDS

For data science development and testing use cases, companies do not build a single huge data processing cluster in a centralized data center for all of their big data teams spread around the world. Building such a cluster in one location has DR implications, not to mention latency and country-specific data regulation challenges. Typically, companies want to build out separate local/edge clusters based on location, type of application, data locality requirements, and the need for separate development, test, and production environments.

Having a central pane of glass for management becomes crucial in this situation for operational efficiency, simplifying deployment, and upgrading these clusters. Having strict isolation and role-based access control (RBAC) is often a security requirement.

IT administrators should implement a central way to manage diverse infrastructures in multiple sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and Project-level RBAC and security controls.

PAIN POINT 3: STUCK ON BARE METAL AND ITS SILO INEFFICIENCIES

Companies still run the majority of their Big Data workloads, particularly Hadoop-based workloads, on bare metal. This is obviously not as scalable, elastic, or flexible as a virtual or cloud platform. Traditional bare metal environments are famous for creating silos where various specialist teams (storage, networking, security) form fiefdoms around their respective functional areas. Silos impede velocity because they lead to complexity of operations, lack of consistency in the environment, and lack of automation. Automating across silos turns into an exercise of custom scripts and lot of "glue and duct tape," which makes maintenance and change management complex, slow, and error-prone.

A virtualized environment for Big Data allows data scientists to create their own Hadoop, Spark or Cassandra clusters and to evaluate their algorithms. These clusters need to be self-service, elastic and high performing. IT should be able to control the resource allocation to data scientists and teams using quotas and role-based access control.

Better yet, IT managers should look for an orchestration platform that can deal with both bare metal and virtual environments, so IT can place workloads in the best target environment based on performance and latency requirements.

Read 8 Big Data Pain Points and How to Address Them - Part 2, to learn about 5 more big data pain points.

Hot Topics

Big Data

The Latest

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI Drives Surge in Data Budgets

May 21, 2025

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned Architecture Causes Service Disruptions, High Operational Costs and Security Challenges

May 20, 2025

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

How GenAI Can Save Time for the NetOps Team

May 19, 2025

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

Will AI Solve the Growing Data Divide?

May 16, 2025

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

Top Concerns for Tech Decision Makers

May 15, 2025

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

Gartner: Top Trends Shaping the Future of Cloud

May 14, 2025

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

The Great SaaS Hangover (and the Cure Nobody Is Talking About)

May 13, 2025

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

OpenShift Monitoring: 5 Things You Need to Keep an Eye on

May 12, 2025

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

8 Big Data Pain Points and How to Address Them - Part 1

August 02, 2018

Kamesh Pemmaraju

PAIN POINT 1: LONG COMMUTE FROM STORAGE TO COMPUTE

PAIN POINT 2: DISTRIBUTED TEAMS, LOCAL PERFORMANCE NEEDS

PAIN POINT 3: STUCK ON BARE METAL AND ITS SILO INEFFICIENCIES

Read 8 Big Data Pain Points and How to Address Them - Part 2, to learn about 5 more big data pain points.

Hot Topics

Big Data

The Latest

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

AI Drives Surge in Data Budgets

May 21, 2025

Misaligned Architecture Causes Service Disruptions, High Operational Costs and Security Challenges

May 20, 2025

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

How GenAI Can Save Time for the NetOps Team

May 19, 2025

Will AI Solve the Growing Data Divide?

May 16, 2025

Top Concerns for Tech Decision Makers

May 15, 2025

Gartner: Top Trends Shaping the Future of Cloud

May 14, 2025

The Great SaaS Hangover (and the Cure Nobody Is Talking About)

May 13, 2025

OpenShift Monitoring: 5 Things You Need to Keep an Eye on

May 12, 2025

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Webinar

Featured eBook

Featured White Paper

Featured White Paper

Featured Webinar

Featured White Paper

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Report

Featured Webinar

Featured Webinar

Featured Webinar

Featured Free Trial

Featured Webinar

Featured White Paper

Featured eBook

Featured White Paper

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Free Trial

Featured Free Trial

Featured Webinar

Featured Free Trial

Featured Free Tool

Featured White Paper

Featured eBook

Featured Free Tool

Featured White Paper

Featured Free Trial

Featured Webinar

Featured White Paper

Featured Webinar

Featured Webinar

Featured Webinar

Featured eBook

Featured White Paper

Featured Report

Featured Webinar

Featured eBook

Featured Webinar

Featured White Paper

Featured eBook

Featured Webinar

Featured Webinar

Featured White Paper

Featured Webinar

Featured White Paper

Featured Report

Featured White Paper

Featured White Paper

Featured Free Trial

Featured White Paper

Featured White Paper

Featured Webinar

Featured Free Trial

Featured Report