Skip to content

Commit ad453cc

Browse files
authored
Remove checking multimedia content type (#12)
* #10 - Remove getting content type from url header as this cause performance issue * fix warnings * fix error in test * Mock mimetype test when run in github action. Python image for github action does not support mimetype * Resolve Deprecation warning for python 3.12 * Minor fix * Update readme * update build version * update build version to v0.2.0
1 parent bd01465 commit ad453cc

10 files changed

+398
-214
lines changed

.github/workflows/publish-release.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
run: |
3636
echo ${{ github.workspace }}
3737
cd ${{ github.workspace }}/tests
38-
poetry run pytest
38+
poetry run pytest --github-action-run=True
3939
- name: Build
4040
id: build-step
4141
run: |

.github/workflows/publish-test.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
run: |
3434
echo ${{ github.workspace }}
3535
cd ${{ github.workspace }}/tests
36-
poetry run pytest
36+
poetry run pytest --github-action-run=True
3737
- name: Build
3838
id: build-step
3939
run: |

.github/workflows/run-tests.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -48,5 +48,5 @@ jobs:
4848
run: |
4949
echo ${{ github.workspace }}
5050
cd ${{ github.workspace }}/tests
51-
poetry run pytest
51+
poetry run pytest --cov=dwcahandler --github-action-run=True
5252

README.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -61,15 +61,15 @@ pip install -i https://test.pypi.org/simple/ dwcahandler
6161
### Examples of dwcahandler usages:
6262

6363
* Create Darwin Core Archive from csv file
64-
* In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url or extracting content type of the url which will slow down the creation of dwca depending on how large the dataset is.
64+
* In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
6565

6666
```python
6767
from dwcahandler import CsvFileType
6868
from dwcahandler import DwcaHandler
6969
from dwcahandler import Eml
7070

7171
core_csv = CsvFileType(files=['/tmp/occurrence.csv'], type='occurrence', keys=['occurrenceID'])
72-
ext_csvs = [CsvFileType(files=['/tmp/multimedia.csv'], type='multimedia')]
72+
ext_csvs = [CsvFileType(files=['/tmp/multimedia.csv'], type='multimedia', keys=['occurrenceID'])]
7373

7474
eml = Eml(dataset_name='Test Dataset',
7575
description='Dataset description',
@@ -81,6 +81,7 @@ DwcaHandler.create_dwca(core_csv=core_csv, ext_csv_list=ext_csvs, eml_content=em
8181
```
8282
 
8383
* Create Darwin Core Archive from pandas dataframe
84+
* In creating a dwca with multimedia extension, provide format and type values in the Simple Multimedia extension, otherwise, dwcahandler will attempt to fill these info by guessing the mimetype from url.
8485

8586
```python
8687
from dwcahandler import DwcaHandler
@@ -92,7 +93,7 @@ core_df = pd.read_csv("/tmp/occurrence.csv")
9293
core_frame = DataFrameType(df=core_df, type='occurrence', keys=['occurrenceID'])
9394

9495
ext_df = pd.read_csv("/tmp/multimedia.csv")
95-
ext_frame = [DataFrameType(df=ext_df, type='multimedia')]
96+
ext_frame = [DataFrameType(df=ext_df, type='multimedia', keys=['occurrenceID'])]
9697

9798
eml = Eml(dataset_name='Test Dataset',
9899
description='Dataset description',

poetry.lock

+246-145
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "dwcahandler"
3-
version = "0.2.0.b2"
3+
version = "0.2.0"
44
description = "Python package to handle Darwin Core Archive (DwCA) operations. This includes creating a DwCA zip file from one or more csvs, reading a DwCA, merge two DwCAs, validate DwCA and delete records from DwCA based on one or more key columns"
55
authors = ["Atlas of Living Australia data team <support@ala.org.au>"]
66
maintainers = ["Atlas of Living Australia data team <support@ala.org.au>"]
@@ -14,6 +14,7 @@ pandas = "^2.2.0"
1414
requests = "^2.32.0"
1515
pytest = "^8.2.0"
1616
pytest-mock = "^3.12.0"
17+
pytest-cov = "^5.0.0"
1718
metapype = "^0.0.26"
1819
flake8 = "^7.1.1"
1920

@@ -25,4 +26,4 @@ requires = ["poetry-core"]
2526
build-backend = "poetry.core.masonry.api"
2627

2728
[tool.pytest.ini_options]
28-
pythonpath = "src"
29+
pythonpath = "src"

src/dwcahandler/dwca/core_dwca.py

+29-39
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818
from zipfile import ZipFile
1919

2020
import pandas as pd
21-
import requests
2221
from numpy import nan
2322
from pandas.errors import EmptyDataError
2423
from pandas.io import parsers
@@ -651,7 +650,7 @@ def get_content(self, name_space):
651650
def add_multimedia_info_to_content(self, multimedia_content: DfContent):
652651
"""
653652
Attempt to populate the format and type from the url provided in the multimedia ext if none is provided
654-
:param multimedia_content: Multimedia content type derived from the extension of this Dwca class object
653+
:param multimedia_content: Multimedia content derived from the extension of this Dwca class object
655654
"""
656655
def get_media_format_prefix(media_format: str):
657656
media_format_prefixes = ["image", "audio", "video"]
@@ -678,59 +677,50 @@ def get_media_type(media_format: str):
678677

679678
def get_multimedia_format_type(row: dict):
680679
url = row['identifier']
681-
mime_type = mimetypes.guess_type(url)
682680
media_format = None
683-
if mime_type and len(mime_type) > 0 and mime_type[0]:
684-
media_format = mime_type[0]
685-
else:
681+
if url:
686682
try:
687-
# Just check header without downloading content
688-
response = requests.head(url, allow_redirects=True)
689-
if 'content-type' in response.headers:
690-
content_type = response.headers['content-type']
691-
if get_media_format_prefix(content_type):
692-
media_format = content_type
693-
683+
mime_type = mimetypes.guess_type(url)
684+
if mime_type and len(mime_type) > 0 and mime_type[0]:
685+
media_format = mime_type[0]
694686
except Exception as error:
695-
log.error("Error getting header info from url %s: %s", url, error)
687+
log.error("Error getting mimetype from url %s: %s", url, error)
696688

697689
media_type = ''
698-
if 'type' not in row or not row['type']:
690+
if 'type' not in row or not row['type'] or row['type'] is nan:
699691
media_type = get_media_type(media_format)
700692
else:
701693
media_type = row['type']
702694

703-
row['format'] = media_format if media_format else nan
704-
row['type'] = media_type if media_type else nan
695+
row['format'] = media_format if media_format else None
696+
row['type'] = media_type if media_type else None
705697
return row
706698

707-
def populate_format_type(row: dict):
708-
return get_multimedia_format_type(row)
699+
if len(multimedia_content.df_content) > 0:
709700

710-
multimedia_df = multimedia_content.df_content
701+
multimedia_df = multimedia_content.df_content
711702

712-
if 'format' in multimedia_df.columns:
713-
multimedia_without_format = multimedia_df[multimedia_df['format'].isnull()]
714-
if len(multimedia_without_format) > 0:
715-
multimedia_without_format = multimedia_without_format.apply(
716-
lambda row: populate_format_type(row),
717-
axis=1)
718-
multimedia_df.update(multimedia_without_format)
719-
else:
720-
multimedia_df = multimedia_df.apply(
721-
lambda row: populate_format_type(row), axis=1)
703+
if 'format' in multimedia_df.columns:
704+
multimedia_without_format = multimedia_df[multimedia_df['format'].isnull()]
705+
if len(multimedia_without_format) > 0:
706+
multimedia_without_format = multimedia_without_format.apply(
707+
lambda row: get_multimedia_format_type(row),
708+
axis=1)
709+
multimedia_df.update(multimedia_without_format)
710+
else:
711+
multimedia_df = multimedia_df.apply(lambda row: get_multimedia_format_type(row), axis=1)
722712

723-
multimedia_without_type = multimedia_df
724-
# In case if the type was not populated from format
725-
if 'type' in multimedia_df.columns:
726-
multimedia_without_type = multimedia_df[multimedia_df['type'].isnull()]
727-
multimedia_without_type = multimedia_without_type[multimedia_without_type['format'].notnull()]
713+
multimedia_without_type = multimedia_df
714+
# In case if the type was not populated from format
715+
if 'type' in multimedia_df.columns:
716+
multimedia_without_type = multimedia_df[multimedia_df['type'].isnull()]
717+
multimedia_without_type = multimedia_without_type[multimedia_without_type['format'].notnull()]
728718

729-
if len(multimedia_without_type) > 0:
730-
multimedia_without_type.loc[:, 'type'] = multimedia_without_type['format'].map(lambda x: get_media_type(x))
731-
multimedia_df.update(multimedia_without_type)
719+
if len(multimedia_without_type) > 0:
720+
multimedia_without_type.loc[:, 'type'] = multimedia_without_type['format'].map(lambda x: get_media_type(x))
721+
multimedia_df.update(multimedia_without_type)
732722

733-
multimedia_content.df_content = multimedia_df
723+
multimedia_content.df_content = multimedia_df
734724

735725
def _extract_media(self, content, assoc_media_col: str):
736726
"""Extract embedded associated media and place it in a media extension data frame

tests/conftest.py

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
2+
def pytest_addoption(parser):
3+
parser.addoption(
4+
"--github-action-run", action="store", default=False, help="Set this to True if it's been called from github action"
5+
)

0 commit comments

Comments
 (0)