Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Big Data Forensics: Learning Hadoop Investigations
  • Toc
  • feedback
Big Data Forensics: Learning Hadoop Investigations

Big Data Forensics: Learning Hadoop Investigations

By : Joe Sremack
5 (3)
close
Big Data Forensics: Learning Hadoop Investigations

Big Data Forensics: Learning Hadoop Investigations

5 (3)
By: Joe Sremack

Overview of this book

Big Data forensics is an important type of digital investigation that involves the identification, collection, and analysis of large-scale Big Data systems. Hadoop is one of the most popular Big Data solutions, and forensically investigating a Hadoop cluster requires specialized tools and techniques. With the explosion of Big Data, forensic investigators need to be prepared to analyze the petabytes of data stored in Hadoop clusters. Understanding Hadoop’s operational structure and performing forensic analysis with court-accepted tools and best practices will help you conduct a successful investigation. Discover how to perform a complete forensic investigation of large-scale Hadoop clusters using the same tools and techniques employed by forensic experts. This book begins by taking you through the process of forensic investigation and the pitfalls to avoid. It will walk you through Hadoop's internals and architecture, and you will discover what types of information Hadoop stores and how to access that data. You will learn to identify Big Data evidence using techniques to survey a live system and interview witnesses. After setting up your own Hadoop system, you will collect evidence using techniques such as forensic imaging and application-based extractions. You will analyze Hadoop evidence using advanced tools and techniques to uncover events and statistical information. Finally, data visualization and evidence presentation techniques are covered to help you properly communicate your findings to any audience.
Table of Contents (10 chapters)
close
9
Index

Big Data forensics

The changes to the volumes of data and the advent of Big Data systems have changed the requirements of forensics when Big Data is involved. Traditional forensics relies on time-consuming and interruptive processes for collecting data. Techniques central to traditional forensic include removing hard drives from machines containing source evidence, calculating MD5/SHA-1 checksums, and performing physical collections that capture all metadata. However, practical limitations with Big Data systems prevent investigators from always applying these techniques. The differences between traditional forensics and forensics for Big Data are covered and explained in this section.

One goal of any type of forensic investigation is to reliably collect relevant evidence in a defensible manner. The evidence in a forensic investigation is the data stored in the system. This data can be the contents of a file, metadata, deleted files, in-memory data, hard drive slack space, and other forms. Forensic techniques are designed to capture all relevant information. In certain cases—especially when questions about potentially deleted information exist—the entire filesystem needs to be collected using a physical collection of every individual bit from the source system. In other cases, only the informational content of a source filesystem or application system are of value. This situation arises most commonly when only structured data systems—such as databases—are in question, and metadata or slack space are irrelevant or impractical to collect. Both types of collection are equally sound; however, the application of the type of collection depends on both practical considerations and the types of evidence required for collection.

Big Data forensics is the identification, collection, analysis, and presentation of the data in a Big Data system. The practical challenges of Big Data systems aside, the goal is to collect data from distributed filesystems, large-scale databases, and the associated applications. Many similarities exist between traditional forensics and Big Data forensics, but the differences are important to understand.

Tip

Every forensic investigation is different. When choosing how to proceed with collecting data, consider the investigation requirements and practical limitations.

Metadata preservation

Metadata is any information about a file, data container, or application data that describes its attributes. Metadata provides information about the file that may be valuable when questions arise about how the file was created, modified, or deleted. Metadata can describe who altered a file, when a file was revised, and which system or application generated the data. These are crucial facts when trying to understand the life cycle and story of an individual file.

Metadata is not always crucial to a Big Data investigation. Metadata is often altered or lost when data flows into and through a Big Data system. The ingestion engines and data feeds collect the data without preserving the metadata. The metadata would thus not provide information about who created the data, when the data was last altered in the upstream data source, and so on. Collecting information in these cases may not serve a purpose. Instead, upstream information about how the data was received can be collected as an alternative source of detail.

Investigations into Big Data systems can hinge on the information in the data and not the metadata. Like structured data systems, metadata does not serve a purpose when an investigation is solely based on the content of the data. Quantitative and qualitative questions can be answered by the data itself; metadata in that case would not be useful, so long as the collection was performed properly and no questions exist about who imported and/or altered the data in the Big Data system. The data within the systems is the only source of information.

Tip

Collecting upstream information from application logs, source systems, and/or audit logs can be used in place of metadata collection.

Collection methods

Big Data systems are large, complex systems with business requirements. As such, they may not be able to be taken offline for a forensic investigation. In traditional forensics, systems can be taken offline, and a collection is performed by removing the hard drive to create a forensic copy of the data. In Big Data investigations, hundreds or thousands of storage hard drives may be involved, and data is lost when the Big Data system is brought offline. Also, the system may need to stay online due to business requirements. Big Data collections usually require logical and targeted collection methods by way of logical file forensic copies and query-based collection.

Collection verification

Traditional forensics relies on MD5 and SHA-1 to verify the integrity of the data collected, but it is not always feasible to use hashing algorithms to verify Big Data collections. Both MD5 and SHA-1 are disk-access intensive. Verifying collections by computing an MD5 or SHA-1 hash comprises a large percentage of the time dedicated to collecting and verifying source evidence. Spending the time to calculate the MD5 and SHA-1 for a Big Data collection may not be feasible when many terabytes of data are collected. The alternative is to rely on control totals, collection logs, and other descriptive information to verify the collection.

bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete