-
Notifications
You must be signed in to change notification settings - Fork 59
Upgrade Hadoop to 3.4.1 #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Steps to Build Native Hadoop Library on Amazon Linux 2 1. Install Development Tools 2. Update Yum Repository 3. Install Required Libraries 4. Install Full JDK 8 (Amazon Corretto) Set JAVA_HOME 5. Install Maven 6. Install Protocol Buffers v3.21.12 7. Install CMake 8. Clone Apache Hadoop Source 9. Build Native Hadoop Library Optional: Troubleshooting Errors 1. Boost Not Found (>= 1.72.0) 2. Missing SASL Library 3. TIRPC_INCLUDE_DIRS Not Found |
Building Native Hadoop Library on CentOS Environment Details Steps to Build Native Hadoop Library 1. Install Development Tools 2. Update Yum Repository 3. Install Required Libraries 4. Install java Set JAVA_HOME 5. Install Maven 6. Install Protocol Buffers v3.21.12 7. Install CMake 8. Clone Apache Hadoop Source 9. Build Native Hadoop Library Optional: Troubleshooting Errors 1. Boost Not Found (>= 1.72.0) export BOOST_ROOT=/opt/boost 2. Missing SASL Library 3. TIRPC_INCLUDE_DIRS Not Found |
5c41dbe
to
83bf46a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to incorporate building of the native libraries into the maven build lifecycle? Did we attempt at all? It probably wouldn't be very easy but I would prefer that rather than committing binaries.
Also, there is a number of places in the Presto codebase which use reflection and requires us to use --add-opens
in the JVM arguments due to the use of reflection. If I recall there were a couple instances of this due to the FileSystem/CompressionCodec interfaces in the hadoop libraries. It would be nice to see if we can remove their usage in the Presto codebase by modifying some of the interfaces here.
|
||
<name>hadoop-apache2</name> | ||
<description>Shaded version of Apache Hadoop 2.x for Presto</description> | ||
<url>https://github.com/facebook/presto-hadoop-apache2</url> | ||
<description>Shaded version of Apache Hadoop for Presto</description> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the repo name is presto-hadoop-apache2
I am wondering if we should create a new repository or just change the repo name to presto-hadoop-apache
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, I had this in my mind as well. I think we can probably rename the repository.
@@ -10,11 +10,11 @@ | |||
|
|||
<groupId>com.facebook.presto.hadoop</groupId> | |||
<artifactId>hadoop-apache2</artifactId> | |||
<version>2.7.4-13-SNAPSHOT</version> | |||
<version>3.4.1-1-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment above, but let's change the airbase version to 105.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done as part of another commit/PR which handles the Java 17 upgrade: #67
<exclusions> | ||
<exclusion> | ||
<groupId>com.google.code.findbugs</groupId> | ||
<artifactId>jsr305</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>com.google.errorprone</groupId> | ||
<artifactId>error_prone_annotations</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>com.google.j2objc</groupId> | ||
<artifactId>j2objc-annotations</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>org.checkerframework</groupId> | ||
<artifactId>checker-qual</artifactId> | ||
</exclusion> | ||
</exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for the exclusions? Do they conflict with something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We excluded these redundant dependencies from the shaded jar as these artifacts were causing duplicate classes issues in the presto build.
Inline KMSClientProvider from Hadoop 3.2.0 Since https://issues.apache.org/jira/browse/HADOOP-13988 `KMSClientProvider` has this code: https://github.com/apache/hadoop/blob/rel/release-3.2.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/crypto/key/kms/KMSClientProvider.java#L1176-L1184 This cannot work in Presto. In context of Presto `UserGroupInformation.getLoginUser()` should not be used, but the `KMSClientProvider` still tried to do that. Co-authored-by: Ariel Weisberg <aweisberg@fb.com> Co-authored-by: David Phillips <david@acz.org> Co-authored-by: Hao Luo <hluo@twitter.com> Co-authored-by: Anu Sudarsan <anuatinfy@gmail.com> Co-authored-by: Rajat Bhatt <rajatrj@amazon.com> Co-authored-by: Nishitha-Bhaskaran <nishithakbhaskaran@gmail.com> Co-authored-by: Shijin K <bibith4@gmail.com>
Co-authored-by: Shijin K <bibith4@gmail.com> Co-authored-by: Jalpreet Singh Nanda (:imjalpreet) <jalpreetnanda@gmail.com>
This is also already handled as part of the Java 17 upgrade PR (#67) |
@ZacBlanco To generate the native libraries, we need to build the Hadoop source code directly, rather than the shaded version used in Presto (presto-hadoop-apache). Therefore, we can't include the native library build process in the Maven build cycle |
So can't we add the hadoop repo as a git submodule and add a step in our maven build lifecycle to build artifacts from that repo? |
@ZacBlanco Here’s my take on this. We haven’t attempted to integrate native Hadoop library builds into the Maven lifecycle so far. While it’s technically possible, it’s non-trivial and may not be worth the effort in practice.
IMO, a more practical approach is to document how to reproduce the native builds, in a BUILD_NATIVE.md or a Wiki, including toolchain requirements, supported OS/architectures, and optionally, build scripts. This provides reproducibility without complicating the Maven lifecycle. Let me know what your views are, and we can discuss further. |
I would be satisfied if we could at least provide a minimal Dockerfile/script with an environment capable of building the hadoop libs. We don't need to integrate it with maven, but we just need the build to be reproducible |
Makes sense. The steps to build the native libraries are outlined in the earlier comments here: @bibith4, could you create scripts or Dockerfiles for each architecture based on those steps? @ZacBlanco, would you prefer checking these into the repo directly, or should we document them on a dedicated Wiki page instead? |
I would prefer if they were checked in to the repo |
@imjalpreet @ZacBlanco i will create scripts and check in to the repo |
…4 and amd64 architectures
@imjalpreet @ZacBlanco i have added docker files to build native libs for hadoop in linux. Please check |
Co-authors : @imjalpreet @bibith4 @nishithakbhaskaran
Description
Upgrade Hadoop version from 2.7.4 to 3.4.1
Resolves #68
The above issue also describes the changes made in this PR.
Motivation and Context
presto-hadoop-apache is currently using apache/hadoop related components version 2.7.4 . In this PR we are upgrading them to 3.4.1
Test Plan
Release Notes