Skip to content

Commit 3fce83f

Browse files
Merge pull request #14534 from JohnSnowLabs/release/600-release-candidate
Release/600 release candidate
2 parents a95c2b6 + 7f160f3 commit 3fce83f

File tree

2,483 files changed

+486148
-200022
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,483 files changed

+486148
-200022
lines changed

.github/workflows/publish_docs.yaml

Lines changed: 65 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -3,60 +3,77 @@ name: Publish APIs
33
on:
44
push:
55
branches:
6-
- '*release*'
7-
- 'release/**'
6+
- "*release*"
7+
- "release/**"
88
pull_request:
99
branches:
10-
- 'main'
11-
- 'master'
12-
- '*release*'
13-
- 'release/**'
10+
- "main"
11+
- "master"
12+
- "*release*"
13+
- "release/**"
14+
1415
env:
1516
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
1617

1718
jobs:
1819
build:
1920
if: "contains(toJSON(github.event.commits.*.message), '[run doc]')"
20-
runs-on: ubuntu-20.04
21+
runs-on: ubuntu-22.04
2122
steps:
22-
- name: checkout repo
23-
uses: actions/checkout@v2
24-
- name: Set up JDK 8
25-
uses: actions/setup-java@v1
26-
with:
27-
java-version: 1.8
28-
- name: Install Python 3.7
29-
uses: actions/setup-python@v2
30-
with:
31-
python-version: 3.7.7
32-
architecture: x64
33-
- name: Build Scala APIs
34-
run: |
35-
sbt doc
36-
- name: Install PyPI dependencies
37-
run: |
38-
python -m pip install --upgrade pip
39-
cd ./python/docs && pip install -r requirements_doc.txt
40-
- name: Build Python APIs
41-
run: |
42-
cd ./python/docs
43-
make html
44-
- name: Commit changes
45-
id: commit
46-
run: |
47-
git config --local user.email "action@github.com"
48-
git config --local user.name "github-actions"
49-
git add --all
50-
if [-z "$(git status --porcelain)"]; then
51-
echo "::set-output name=push::false"
52-
else
53-
git commit -m "Update Scala and Python APIs" -a
54-
echo "::set-output name=push::true"
55-
fi
56-
shell: bash
57-
- name: Push changes
58-
if: steps.commit.outputs.push == 'true'
59-
uses: ad-m/github-push-action@master
60-
with:
61-
github_token: ${{ secrets.GITHUB_TOKEN }}
62-
branch: ${{ github.ref }}
23+
- name: Checkout repo
24+
uses: actions/checkout@v2
25+
26+
- name: Set up JDK 8
27+
uses: actions/setup-java@v1
28+
with:
29+
java-version: 1.8
30+
31+
- name: Install Python
32+
uses: actions/setup-python@v4
33+
with:
34+
python-version: "3.8"
35+
architecture: "x64"
36+
37+
- name: Install SBT
38+
run: |
39+
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
40+
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
41+
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x99E82A75642AC823" | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/sbt.gpg > /dev/null
42+
sudo apt-get update
43+
sudo apt-get install -y sbt
44+
45+
- name: Build Scala APIs
46+
run: sbt doc
47+
48+
- name: Install PyPI dependencies
49+
run: |
50+
python -m pip install --upgrade pip
51+
cd ./python/docs && pip install -r requirements_doc.txt
52+
53+
- name: Build Python APIs
54+
run: |
55+
cd ./python/docs
56+
# Run with verbose output to debug any issues
57+
SPHINX_APIDOC_OPTIONS=members,undoc-members,show-inheritance sphinx-apidoc -e -f -o ./_api ../sparknlp ../sparknlp/tests
58+
make html SPHINXOPTS="-v"
59+
60+
- name: Commit changes
61+
id: commit
62+
run: |
63+
git config --local user.email "action@github.com"
64+
git config --local user.name "github-actions"
65+
git add --all
66+
if [ -z "$(git status --porcelain)" ]; then
67+
echo "push=false" >> $GITHUB_OUTPUT
68+
else
69+
git commit -m "Update Scala and Python APIs" -a
70+
echo "push=true" >> $GITHUB_OUTPUT
71+
fi
72+
shell: bash
73+
74+
- name: Push changes
75+
if: ${{ steps.commit.outputs.push == 'true' }}
76+
uses: ad-m/github-push-action@master
77+
with:
78+
github_token: ${{ secrets.GITHUB_TOKEN }}
79+
branch: ${{ github.ref }}

CHANGELOG

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,43 @@
1+
=======
2+
6.0.0
3+
=======
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* Introducing new large language models:
8+
* OLMo model support (SPARKNLP-1006)
9+
* Phi 3.5 Vision model support (SPARKNLP-1060)
10+
* LLAVA model support (SPARKNLP-1033)
11+
* CoHere model support (SPARKNLP-1032)
12+
* Qwen2-VL model support (SPARKNLP-1077)
13+
* Llama 3.2 Vision models (SPARKNLP-1078)
14+
* Deepseek Janus model support (SPARKNLP-1088)
15+
* Added LLAVA v1.5 7b quantized model
16+
* Added StarCoder2 3b int8 model
17+
18+
* New MultipleChoice Transformers:
19+
* AlbertForMultipleChoice (SPARKNLP-1105)
20+
* DistilBertForMultipleChoice (SPARKNLP-1106)
21+
* RoBertaForMultipleChoice (SPARKNLP-1107)
22+
* XlmRoBertaForMultipleChoice (SPARKNLP-1108)
23+
24+
* New file format support:
25+
* Excel files reader (SPARKNLP-1102)
26+
* PowerPoint files reader (SPARKNLP-1103)
27+
* PDF reader (SPARKNLP-1098)
28+
* Text reader (SPARKNLP-1113)
29+
30+
* Other improvements:
31+
* AutoGGUFVisionModel for vision model support (SPARKNLP-1079)
32+
* Added Extractor to SparkNLP (SPARKNLP-1109)
33+
* Updated Python and Scala model names
34+
* Improved error handling for AutoGGUF models
35+
36+
----------------
37+
Bug Fixes
38+
----------------
39+
* Fixed typo in MXBAI notebook
40+
141
========
242
5.5.3
343
========

README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed environment.
2121

22-
Spark NLP comes with **83000+** pretrained **pipelines** and **models** in more than **200+** languages.
22+
Spark NLP comes with **100000+** pretrained **pipelines** and **models** in more than **200+** languages.
2323
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
2424

2525
**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Llama-2**, **M2M100**, **BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, **Llama**, **Mistral**, **Phi**, **Qwen2**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
@@ -63,7 +63,7 @@ $ java -version
6363
$ conda create -n sparknlp python=3.7 -y
6464
$ conda activate sparknlp
6565
# spark-nlp by default is based on pyspark 3.x
66-
$ pip install spark-nlp==5.5.3 pyspark==3.3.1
66+
$ pip install spark-nlp==6.0.0 pyspark==3.3.1
6767
```
6868

6969
In Python console or Jupyter `Python3` kernel:
@@ -129,10 +129,11 @@ For a quick example of using pipelines and models take a look at our official [d
129129

130130
### Apache Spark Support
131131

132-
Spark NLP *5.5.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
132+
Spark NLP *6.0.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
133133

134134
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
135135
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
136+
| 6.0.x | YES | YES | YES | YES | YES | YES | NO | NO |
136137
| 5.5.x | YES | YES | YES | YES | YES | YES | NO | NO |
137138
| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
138139
| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO |
@@ -146,6 +147,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
146147

147148
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
148149
|-----------|------------|------------|------------|------------|------------|------------|------------|
150+
| 6.0.x | NO | YES | YES | YES | YES | NO | YES |
149151
| 5.5.x | NO | YES | YES | YES | YES | NO | YES |
150152
| 5.4.x | NO | YES | YES | YES | YES | NO | YES |
151153
| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
@@ -157,7 +159,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
157159

158160
### Databricks Support
159161

160-
Spark NLP 5.5.3 has been tested and is compatible with the following runtimes:
162+
Spark NLP 6.0.0 has been tested and is compatible with the following runtimes:
161163

162164
| **CPU** | **GPU** |
163165
|--------------------|--------------------|
@@ -174,7 +176,7 @@ We are compatible with older runtimes. For a full list check databricks support
174176

175177
### EMR Support
176178

177-
Spark NLP 5.5.3 has been tested and is compatible with the following EMR releases:
179+
Spark NLP 6.0.0 has been tested and is compatible with the following EMR releases:
178180

179181
| **EMR Release** |
180182
|--------------------|
@@ -184,6 +186,13 @@ Spark NLP 5.5.3 has been tested and is compatible with the following EMR release
184186
| emr-7.0.0 |
185187
| emr-7.1.0 |
186188
| emr-7.2.0 |
189+
| emr-7.3.0 |
190+
| emr-7.4.0 |
191+
| emr-7.5.0 |
192+
| emr-7.6.0 |
193+
| emr-7.7.0 |
194+
| emr-7.8.0 |
195+
187196

188197
We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support)
189198

@@ -205,7 +214,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
205214
from our official documentation.
206215

207216
If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
208-
projects [Spark NLP SBT S5.5.3r](https://github.com/maziyarpanahi/spark-nlp-starter)
217+
projects [Spark NLP SBT S6.0.0r](https://github.com/maziyarpanahi/spark-nlp-starter)
209218

210219
### Python
211220

@@ -250,7 +259,7 @@ In Spark NLP we can define S3 locations to:
250259

251260
Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.
252261

253-
## Document5.5.3
262+
## Documentation
254263

255264
### Examples
256265

@@ -283,7 +292,7 @@ the Spark NLP library:
283292
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
284293
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
285294
}
286-
}5.5.3
295+
}
287296
```
288297

289298
## Community support

build.sbt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
66

77
organization := "com.johnsnowlabs.nlp"
88

9-
version := "5.5.3"
9+
version := "6.0.0"
1010

1111
(ThisBuild / scalaVersion) := scalaVer
1212

@@ -163,7 +163,8 @@ lazy val utilDependencies = Seq(
163163
poiDocx
164164
exclude ("org.apache.logging.log4j", "log4j-api"),
165165
scratchpad
166-
exclude ("org.apache.logging.log4j", "log4j-api")
166+
exclude ("org.apache.logging.log4j", "log4j-api"),
167+
pdfBox
167168
)
168169

169170
lazy val typedDependencyParserDependencies = Seq(junit)

conda/meta.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
{% set name = "spark-nlp" %}
2-
{% set version = "5.5.3" %}
2+
{% set version = "6.0.0" %}
33

44
package:
55
name: {{ name|lower }}
66
version: {{ version }}
77

88
source:
99
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
10-
sha256: b620487092256d02bf8d277374c564cd22384d437c97a4bb5b3b0f1fdfc696e8
10+
sha256: 58f4f530105d5c5522fc37ce4d3b63af1e2463b43e000cf69838e0854b468365
1111

1212
build:
1313
noarch: python

docs/_layouts/landing.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
201201
<div class="highlight-box">
202202
{% highlight bash %}
203203
# Using PyPI
204-
$ pip install spark-nlp==5.5.3
204+
$ pip install spark-nlp==6.0.0
205205

206206
# Using Anaconda/Conda
207207
$ conda install -c johnsnowlabs spark-nlp

docs/api/com/index.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
<head>
44
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
6-
<title>Spark NLP 5.5.3 ScalaDoc - com</title>
7-
<meta name="description" content="Spark NLP 5.5.3 ScalaDoc - com" />
8-
<meta name="keywords" content="Spark NLP 5.5.3 ScalaDoc com" />
6+
<title>Spark NLP 6.0.0 ScalaDoc - com</title>
7+
<meta name="description" content="Spark NLP 6.0.0 ScalaDoc - com" />
8+
<meta name="keywords" content="Spark NLP 6.0.0 ScalaDoc com" />
99
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
1010

1111

@@ -28,7 +28,7 @@
2828
</head>
2929
<body>
3030
<div id="search">
31-
<span id="doc-title">Spark NLP 5.5.3 ScalaDoc<span id="doc-version"></span></span>
31+
<span id="doc-title">Spark NLP 6.0.0 ScalaDoc<span id="doc-version"></span></span>
3232
<span class="close-results"><span class="left">&lt;</span> Back</span>
3333
<div id="textfilter">
3434
<span class="input">

docs/api/com/johnsnowlabs/client/CloudClient.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
<head>
44
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
6-
<title>Spark NLP 5.5.3 ScalaDoc - com.johnsnowlabs.client.CloudClient</title>
7-
<meta name="description" content="Spark NLP 5.5.3 ScalaDoc - com.johnsnowlabs.client.CloudClient" />
8-
<meta name="keywords" content="Spark NLP 5.5.3 ScalaDoc com.johnsnowlabs.client.CloudClient" />
6+
<title>Spark NLP 6.0.0 ScalaDoc - com.johnsnowlabs.client.CloudClient</title>
7+
<meta name="description" content="Spark NLP 6.0.0 ScalaDoc - com.johnsnowlabs.client.CloudClient" />
8+
<meta name="keywords" content="Spark NLP 6.0.0 ScalaDoc com.johnsnowlabs.client.CloudClient" />
99
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
1010

1111

@@ -28,7 +28,7 @@
2828
</head>
2929
<body>
3030
<div id="search">
31-
<span id="doc-title">Spark NLP 5.5.3 ScalaDoc<span id="doc-version"></span></span>
31+
<span id="doc-title">Spark NLP 6.0.0 ScalaDoc<span id="doc-version"></span></span>
3232
<span class="close-results"><span class="left">&lt;</span> Back</span>
3333
<div id="textfilter">
3434
<span class="input">

docs/api/com/johnsnowlabs/client/CloudManager.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
<head>
44
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
6-
<title>Spark NLP 5.5.3 ScalaDoc - com.johnsnowlabs.client.CloudManager</title>
7-
<meta name="description" content="Spark NLP 5.5.3 ScalaDoc - com.johnsnowlabs.client.CloudManager" />
8-
<meta name="keywords" content="Spark NLP 5.5.3 ScalaDoc com.johnsnowlabs.client.CloudManager" />
6+
<title>Spark NLP 6.0.0 ScalaDoc - com.johnsnowlabs.client.CloudManager</title>
7+
<meta name="description" content="Spark NLP 6.0.0 ScalaDoc - com.johnsnowlabs.client.CloudManager" />
8+
<meta name="keywords" content="Spark NLP 6.0.0 ScalaDoc com.johnsnowlabs.client.CloudManager" />
99
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
1010

1111

@@ -28,7 +28,7 @@
2828
</head>
2929
<body>
3030
<div id="search">
31-
<span id="doc-title">Spark NLP 5.5.3 ScalaDoc<span id="doc-version"></span></span>
31+
<span id="doc-title">Spark NLP 6.0.0 ScalaDoc<span id="doc-version"></span></span>
3232
<span class="close-results"><span class="left">&lt;</span> Back</span>
3333
<div id="textfilter">
3434
<span class="input">

0 commit comments

Comments
 (0)