Skip to content

Upgrade Hadoop to 3.4.1 #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nishithakbhaskaran
Copy link

@nishithakbhaskaran nishithakbhaskaran commented Apr 25, 2025

Co-authors : @imjalpreet @bibith4 @nishithakbhaskaran

Description

Upgrade Hadoop version from 2.7.4 to 3.4.1

Resolves #68
The above issue also describes the changes made in this PR.

Motivation and Context

presto-hadoop-apache is currently using apache/hadoop related components version 2.7.4 . In this PR we are upgrading them to 3.4.1

Test Plan

Release Notes

== RELEASE NOTES ==

General Changes
* Upgrade Hadoop to 3.4.1
 

Copy link

linux-foundation-easycla bot commented Apr 25, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: nishithakbhaskaran (486c361)
  • ✅ login: imjalpreet / name: Jalpreet Singh Nanda (847acc4)
  • ✅ login: bibith4 (1cdf3e2)

@bibith4
Copy link

bibith4 commented Apr 25, 2025

Steps to Build Native Hadoop Library on Amazon Linux 2
Environment Details
Operating System: Amazon Linux 2 LTS
Architecture: arm64
Node: m6g.2xlarge
Hadoop Version: 3.4.1

1. Install Development Tools
sudo yum groupinstall "Development Tools"

2. Update Yum Repository
sudo yum update

3. Install Required Libraries
sudo yum install openssl-devel
sudo yum install snappy snappy-devel

4. Install Full JDK 8 (Amazon Corretto)
sudo curl -LO https://corretto.aws/downloads/latest/amazon-corretto-8-aarch64-linux-jdk.tar.gz
sudo tar -xvzf amazon-corretto-8-aarch64-linux-jdk.tar.gz
sudo mv amazon-corretto-8.*-linux-aarch64 corretto-8

Set JAVA_HOME
export JAVA_HOME=/opt/corretto-8
export PATH=$JAVA_HOME/bin:$PATH

5. Install Maven
sudo yum install maven

6. Install Protocol Buffers v3.21.12
curl -L https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.21.12.tar.gz >
protobuf-3.21.12.tar.gz
tar -zxvf protobuf-3.21.12.tar.gz && cd protobuf-3.21.12
./autogen.sh
./configure --prefix=/usr/local
make
sudo make install
cd ..

7. Install CMake
wget
https://github.com/Kitware/CMake/releases/download/v3.22.3/cmake-3.22.3-linux-aarch64.sh
sudo sh cmake-3.22.3-linux-aarch64.sh --prefix=/usr/local/ --exclude-subdir
cmake --version

8. Clone Apache Hadoop Source
git clone https://github.com/apache/hadoop.git
cd hadoop/
git checkout branch-3.4.1

9. Build Native Hadoop Library
cd hadoop
mvn clean package -Pdist,native -DskipTests -Dtar -Drequire.snappy
-Dmaven.javadoc.skip=true
ls hadoop-dist/target/hadoop-3.4.1/lib/native

Optional: Troubleshooting Errors

1. Boost Not Found (>= 1.72.0)
cd /opt
sudo curl -LO https://archives.boost.io/release/1.78.0/source/boost_1_78_0.tar.gz
sudo tar -xvzf boost_1_78_0.tar.gz
cd boost_1_78_0
sudo ./bootstrap.sh --prefix=/opt/boost
sudo ./b2 install
export BOOST_ROOT=/opt/boost
export CPLUS_INCLUDE_PATH=$BOOST_ROOT/include:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=$BOOST_ROOT/lib:$LIBRARY_PATH

2. Missing SASL Library
sudo dnf install cyrus-sasl-devel

3. TIRPC_INCLUDE_DIRS Not Found
sudo dnf install libtirpc-devel

@bibith4
Copy link

bibith4 commented Apr 25, 2025

Building Native Hadoop Library on CentOS

Environment Details
Operating System: Linux
Architecture: x86_64
Hadoop Version: 3.4.1

Steps to Build Native Hadoop Library

1. Install Development Tools
sudo yum groupinstall "Development Tools"

2. Update Yum Repository
sudo yum update

3. Install Required Libraries
sudo yum install openssl-devel
sudo yum install snappy snappy-devel

4. Install java
sudo yum install java-1.8.0-openjdk-devel -y

Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.362.b09-4.el9.x86_64
export PATH=$JAVA_HOME/bin:$PATH

5. Install Maven
sudo yum install maven

6. Install Protocol Buffers v3.21.12
curl -L https://github.com/protocolbuffers/protobuf/archive/refs/tags/v3.21.12.tar.gz >
protobuf-3.21.12.tar.gz
tar -zxvf protobuf-3.21.12.tar.gz && cd protobuf-3.21.12
./autogen.sh
./configure --prefix=/usr/local
make
sudo make install
cd ..

7. Install CMake
wget
https://github.com/Kitware/CMake/releases/download/v3.22.3/cmake-3.22.3-linux-aarch64.sh
sudo sh cmake-3.22.3-linux-aarch64.sh --prefix=/usr/local/ --exclude-subdir
cmake --version

8. Clone Apache Hadoop Source
git clone https://github.com/apache/hadoop.git
cd hadoop/
git checkout branch-3.4.1

9. Build Native Hadoop Library
cd hadoop
mvn clean package -Pdist,native -DskipTests -Dtar -Drequire.snappy
-Dmaven.javadoc.skip=true
ls hadoop-dist/target/hadoop-3.4.1/lib/native

Optional: Troubleshooting Errors

1. Boost Not Found (>= 1.72.0)
cd /opt
sudo curl -LO https://archives.boost.io/release/1.78.0/source/boost_1_78_0.tar.gz
sudo tar -xvzf boost_1_78_0.tar.gz
cd boost_1_78_0
sudo ./bootstrap.sh --prefix=/opt/boost
sudo ./b2 install

export BOOST_ROOT=/opt/boost
export CPLUS_INCLUDE_PATH=$BOOST_ROOT/include:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=$BOOST_ROOT/lib:$LIBRARY_PATH

2. Missing SASL Library
sudo dnf install cyrus-sasl-devel

3. TIRPC_INCLUDE_DIRS Not Found
sudo dnf install libtirpc-devel
If that fails:
sudo dnf config-manager --set-enabled crb
sudo dnf makecache
sudo dnf install libtirpc-devel

@imjalpreet imjalpreet force-pushed the upgrade branch 4 times, most recently from 5c41dbe to 83bf46a Compare May 18, 2025 20:24
@nishithakbhaskaran nishithakbhaskaran changed the title [WIP - Do Not Review ]Upgrade presto-hadoop-apache version to 3.4.1 Upgrade Hadoop to 3.4.1 May 19, 2025
@nishithakbhaskaran nishithakbhaskaran marked this pull request as ready for review May 19, 2025 14:10
Copy link

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to incorporate building of the native libraries into the maven build lifecycle? Did we attempt at all? It probably wouldn't be very easy but I would prefer that rather than committing binaries.

Also, there is a number of places in the Presto codebase which use reflection and requires us to use --add-opens in the JVM arguments due to the use of reflection. If I recall there were a couple instances of this due to the FileSystem/CompressionCodec interfaces in the hadoop libraries. It would be nice to see if we can remove their usage in the Presto codebase by modifying some of the interfaces here.


<name>hadoop-apache2</name>
<description>Shaded version of Apache Hadoop 2.x for Presto</description>
<url>https://github.com/facebook/presto-hadoop-apache2</url>
<description>Shaded version of Apache Hadoop for Presto</description>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the repo name is presto-hadoop-apache2 I am wondering if we should create a new repository or just change the repo name to presto-hadoop-apache..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I had this in my mind as well. I think we can probably rename the repository.

@@ -10,11 +10,11 @@

<groupId>com.facebook.presto.hadoop</groupId>
<artifactId>hadoop-apache2</artifactId>
<version>2.7.4-13-SNAPSHOT</version>
<version>3.4.1-1-SNAPSHOT</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment above, but let's change the airbase version to 105.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done as part of another commit/PR which handles the Java 17 upgrade: #67

Comment on lines +67 to +82
<exclusions>
<exclusion>
<groupId>com.google.code.findbugs</groupId>
<artifactId>jsr305</artifactId>
</exclusion>
<exclusion>
<groupId>com.google.errorprone</groupId>
<artifactId>error_prone_annotations</artifactId>
</exclusion>
<exclusion>
<groupId>com.google.j2objc</groupId>
<artifactId>j2objc-annotations</artifactId>
</exclusion>
<exclusion>
<groupId>org.checkerframework</groupId>
<artifactId>checker-qual</artifactId>
</exclusion>
</exclusions>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for the exclusions? Do they conflict with something else?

Copy link
Author

@nishithakbhaskaran nishithakbhaskaran May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We excluded these redundant dependencies from the shaded jar as these artifacts were causing duplicate classes issues in the presto build.

imjalpreet and others added 2 commits May 26, 2025 13:46
Inline KMSClientProvider from Hadoop 3.2.0

Since https://issues.apache.org/jira/browse/HADOOP-13988
`KMSClientProvider` has this code:
https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/kms/KMSClientProvider.java#L1176-L1184

This cannot work in Presto. In context of Presto `UserGroupInformation.getLoginUser()`
should not be used, but the `KMSClientProvider` still tried to do that.

Co-authored-by: Ariel Weisberg <aweisberg@fb.com>
Co-authored-by: David Phillips <david@acz.org>
Co-authored-by: Hao Luo <hluo@twitter.com>
Co-authored-by: Anu Sudarsan <anuatinfy@gmail.com>
Co-authored-by: Rajat Bhatt <rajatrj@amazon.com>
Co-authored-by: Nishitha-Bhaskaran <nishithakbhaskaran@gmail.com>
Co-authored-by: Shijin K <bibith4@gmail.com>
Co-authored-by: Shijin K <bibith4@gmail.com>
Co-authored-by: Jalpreet Singh Nanda (:imjalpreet) <jalpreetnanda@gmail.com>
@imjalpreet
Copy link
Member

Also, there is a number of places in the Presto codebase which use reflection and requires us to use --add-opens in the JVM arguments due to the use of reflection. If I recall there were a couple instances of this due to the FileSystem/CompressionCodec interfaces in the hadoop libraries. It would be nice to see if we can remove their usage in the Presto codebase by modifying some of the interfaces here.

This is also already handled as part of the Java 17 upgrade PR (#67)

@bibith4
Copy link

bibith4 commented May 26, 2025

Would it be possible to incorporate building of the native libraries into the maven build lifecycle? Did we attempt at all? It probably wouldn't be very easy but I would prefer that rather than committing binaries.

@ZacBlanco To generate the native libraries, we need to build the Hadoop source code directly, rather than the shaded version used in Presto (presto-hadoop-apache). Therefore, we can't include the native library build process in the Maven build cycle

@imjalpreet imjalpreet requested a review from ZacBlanco May 26, 2025 09:34
@ZacBlanco
Copy link

To generate the native libraries, we need to build the Hadoop source code directly, rather than the shaded

So can't we add the hadoop repo as a git submodule and add a step in our maven build lifecycle to build artifacts from that repo?

@imjalpreet
Copy link
Member

Would it be possible to incorporate building of the native libraries into the maven build lifecycle? Did we attempt at all? It probably wouldn't be very easy but I would prefer that rather than committing binaries.

So can't we add the hadoop repo as a git submodule and add a step in our maven build lifecycle to build artifacts from that repo?

@ZacBlanco Here’s my take on this.

We haven’t attempted to integrate native Hadoop library builds into the Maven lifecycle so far. While it’s technically possible, it’s non-trivial and may not be worth the effort in practice.

  • As native libraries (like libhadoop.so) are written in C/C++, they require a full native toolchain: autotools, make, gcc, cmake, and potentially dependencies like zlib, etc.
  • We would need to produce these libraries for multiple architectures and potentially older OS versions to ensure broader glibc compatibility. This means setting up cross-builds for Linux (x86 and ARM), macOS (aarch64), and more. Native libraries can’t be built on a single machine for all targets.
  • Based on my past experience, building Hadoop from source is quite involved. Across 2–3 different Hadoop versions I’ve tried, there were always new environment prerequisites or build issues that cropped up.
  • Given that we don’t frequently upgrade Hadoop, we only need to build these native libraries once per Hadoop upgrade. For all subsequent releases of the shaded dependency, the native libraries remain unchanged.

IMO, a more practical approach is to document how to reproduce the native builds, in a BUILD_NATIVE.md or a Wiki, including toolchain requirements, supported OS/architectures, and optionally, build scripts. This provides reproducibility without complicating the Maven lifecycle.

Let me know what your views are, and we can discuss further.

@ZacBlanco
Copy link

ZacBlanco commented Jun 10, 2025

I would be satisfied if we could at least provide a minimal Dockerfile/script with an environment capable of building the hadoop libs. We don't need to integrate it with maven, but we just need the build to be reproducible

@imjalpreet
Copy link
Member

Makes sense. The steps to build the native libraries are outlined in the earlier comments here:
#65 (comment) and #65 (comment).

@bibith4, could you create scripts or Dockerfiles for each architecture based on those steps?

@ZacBlanco, would you prefer checking these into the repo directly, or should we document them on a dedicated Wiki page instead?

@ZacBlanco
Copy link

I would prefer if they were checked in to the repo

@bibith4
Copy link

bibith4 commented Jun 11, 2025

@imjalpreet @ZacBlanco i will create scripts and check in to the repo

@bibith4
Copy link

bibith4 commented Jun 16, 2025

@imjalpreet @ZacBlanco i have added docker files to build native libs for hadoop in linux. Please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to Hadoop 3.4.1 and JDK 17
4 participants