Open data from Brasil

Read this in other language (Leia em outro idioma):

Summary

This project was made to explore the Brazil open data about companies.

This project followed the following steps:

Define project scope and collect data
Explore and evaluate the data
Define the data model
Run ETL to model the data
Describe and document the Project

1. Definition of project scope and data collection

Project scope

This project's ultimate goal is to allow people to make analyzes related to Brazilian companies and provide a basis for decision-making, based on competition and other observed factors, such as:

In which city is a good place to start a business?
Who are the customers (companies or people who are partners of any company), cities or regions where it is possible to operate or offer services and products?
How big is the market available?
Who are my customers, what companies do they own, is there any contact information available?
Do I have a different approach for some clients due to the existence of a connection (society) between clients?

2. Explore and evaluate the data

Description of data sets

The datasets used initially come from the Brazilian government and are made available openly.

CNPJ Open Data

Feature	Description	Status
Update frequency:	monthly
data format:	csv
Origin:	Click here
Layout:	Click here

Files and examples:

Establishments

Field Dictionary

Companies

Field Dictionary

Partners

Field Dictionary

CNAE - National Classification of Economic Activities

Field Dictionary

Agency responsible for this classification: concla

Legal Nature

Field Dictionary

Agency responsible for this classification: concla

Member Qualification

Field Dictionary

City Code

Field Dictionary

Country code

Field Dictionary

Data from the Simple National

Field Dictionary

Known Data Issues and How to Address

Archive: K3241.K03200Y0.D10410.ESTABELE

Field: Date registration status

Value: 0

Problem: Invalid format

"30005475";"0001";"31";"1";"";"2";"0";"0";"";"";"20180322";"6204000";"6209100,7490104";"AVENIDA";"PAULISTA";"2202";"CONJ  54-B";"BELA VISTA";"01310300";"SP";"7107";"11";"59085410";"";"";"";"";"CEFISCO@UOL.COM.BR";"";""

Archive:

Field: Date registration status

Value: 4100813

Problem: Invalid format

"18825426";"0001";"40";"1";"ALAMBIQUE SANTO ANTONIO";"8";"20150209";"73";"";"";"4100813";"5611204";"";"RUA";"DEOLINDO PERIM";"79";"";"ITAPUA";"29101811";"ES";"5703";"27";"98921990";"27";"";"";"";"JFJUNCAL@GMAIL.COM";"";""

Archive: K3241.K03200Y0.D10410.ESTABELE

Field: Date registration status

Value: 4100813

Problem: Complement "EDIF HORTO SAO RAFAEL;BLOCO 2;ANDAR 805" it has a semicolon which is the file separator and depending on the csv parser being used it gets lost and messes up the columns.

"36452531";"0001";"62";"1";"AMPPLA CREATIVE STUDIO";"2";"20200221";"0";"";"";"20200221";"1821100";"5819100,5811500,5812302,1813001,5912099,5812301,7319002,5813100";"ESTRADA";"DO MANDU";"560";"EDIF HORTO SAO RAFAEL;BLOCO 2;ANDAR 805";"SAO MARCOS";"41250400";"BA";"3849";"71";"99479533";"";"";"";"";"JONATASMA@GMAIL.COM";"";""

Necessary care

Found records with characters that break the apache spark default parser using DataFrameReader as "\" which is the default scape character. It was necessary to implement a custom csv's reading to prevent the columns of the files from being broken (with more or less columns).
Take care of fields that may be null and evaluate if the field contains values like: null or empty ("")

public static String fixStringValues(String value) {
        if (value == null || value.equals("null"))
            return null;
        return value.replaceAll("^\"|\"$", "");
    }

It was necessary to treat numeric values (integers) that contained leading zeros, example: 0001. In these cases, the leading zeros were removed before converting to integer. Also it was necessary to check for null, empty values before trying to convert.
To convert monetary values it was necessary to use local format from Brazil

//sample
public static String fixStringValues(String value) {
    if (value == null || value.equals("null"))
        return null;
    return value.replaceAll("^\"|\"$", "");
}
NumberFormat nf = NumberFormat.getInstance(new Locale("pt", "BR"));
String numberString = fixStringValues("000000010000,00");
return nf.parse(numberString).toString();

Invalid dates were treated as null

3. Define the data model

create_database.sql

Data Dictionary

4. Run ETL to model the data

Run Spark job to process CSV files and create ORC files

You need to download the csv files from: http://200.152.38.155/CNPJ/
Unzip the files to some folder

# Running on spark cluster or local mode
spark-submit --class application.batch.App opendata_etl-1.0.jar --spark-conf spark.app.name=brasil-open-etl --input-path E:\\hdfs\\cnpj\\2021-04-14\\allfilesdev\\ --input-type cnpj_raw --input-format csv --output-type cnpj_lake --output-format orc
# or locally
java -jar opendata_etl-1.0.jar --spark-conf spark.master=local[*],spark.app.name=brasil-open-etl --input-path E:\\hdfs\\cnpj\\2021-04-14\\allfilesdev\\ --input-type cnpj_raw --input-format csv --output-type cnpj_lake --output-format orc

Sample result:

After save spark output on s3, manualy run airflow dag to load the ORC files into Redshift.

5. Describe and document the Project

Defending Decisions

Data Model

The main reason for organizing the data in this way was thinking about the end users of this database. The raw format files and layout of these files provided by the Brazilian government are organized in this way and reflecting this organization in the analytical database tables will simplify use by end users.

Other reasons:

Each company can have N branches (establishments) and all dim_company data is repeated in these branches. As there are many company registries (approximately 1.9 GB), it was considered a good arrangement for this scenario. If we consider the 100x magnification scenario, it is even more consistent because it would be a duplication of approximately 185 GB of data for the company table.
The same logic applies to dim_simple_national, although it is a smaller table.
Leaving the records of companies with special regimes (simple national and mei) separated in the dim_simple_national table allows accounting of the totals of these companies without having to use the larger table fact_establishment
For cases in which an analysis is needed considering a join between the fact_establishment, dim_company and dim_simple_national tables, a data distribution by the join key of these tables was adopted to speed up the joins of the data in these tables
The downside of this approach is that it needs extra join if you need to filter by type of legal nature, companysize or simple national

Results for the questions mentioned in scope:

1 - how big is my market?

Let's assume that your potential customers are services company
Those provide services related to: Installation, maintenance, repair or rental of machines

-- Possible customers:
create temporary table possible_clients
as
select * from dim_cnae dc 
where dc.code not in (1821100,1822901,2539001,5912001,9603303,9603305)
and(
dc.description like '%Instalação%' or dc.description like '%manutenção%'
or dc.description like '%Serviços de%' or dc.description like '%Reparação%'
or dc.description like '%Aluguel de máquinas%' or dc.description like '%Aluguel de outras máquinas%'
);

-- 2 - ACTIVE (only clients active)
select count(0) as total from fact_establishment e 
where e.mainCnae in (select code from possible_clients) and e.registrationstatus=2;
-- 1968732

-- total by state
select e.state, count(0) as total from fact_establishment e 
where e.mainCnae in (select code from possible_clients) and e.registrationstatus=2
group by e.state;

/*
state|total |
-----|------|
SP   |613669|
MG   |227137|
RJ   |179489|
PR   |142464|
RS   |136611|
SC   | 98002|
BA   | 87542|
GO   | 64873|
PE   | 51019|
CE   | 41683|
ES   | 40517|
MT   | 35882|
DF   | 35304|
PA   | 30723|
MS   | 28049|
RN   | 24213|
PB   | 22624|
MA   | 17791|
AL   | 16544|
AM   | 16116|
SE   | 13443|
TO   | 12362|
PI   | 11762|
RO   | 11397|
AP   |  2694|
RR   |  2668|
AC   |  2601|
EX   |  1553|
*/

2 - Who are my customers, what companies do they own, is there any contact information available?

select 
	e.fantasyname, 
	e.basiccnpj,
	dp.partnername,
	dp.partnerstartdate,
	dp.agerange,
	e.taxpayeremail,	
	e.telephone1areacode,
	e.telephone1
from fact_establishment e 
join dim_partner dp on dp.basiccnpj = e.basiccnpj 
where 
	e.mainCnae in (select code from possible_clients) 
	and e.registrationstatus=2 
	and len(e.fantasyname)> 1
limit 5;

/*
fantasyname                         |basiccnpj|partnername                       |partnerstartdate|agerange|taxpayeremail        |telephone1areacode|telephone1|
------------------------------------|---------|----------------------------------|----------------|--------|---------------------|------------------|----------|
INSTITUTO ESTER GOMES - UNIDADE SEDE|00000138 |REGILENE GOMES RODRIGUES          |      2013-07-18|       7|sample@email.com	 |11                |99999999  |
INSTITUTO ESTER GOMES - UNIDADE SEDE|00000138 |REGILENE GOMES RODRIGUES          |      2013-07-18|       7|sample@email.com	 |11                |99999999  |
D A LOGISTICA S/A                   |00001164 |LEONARDO GUILHERME LOURENCO MOISES|      2013-05-20|       5|sample@email.com	 |81                |99999999  |
D A LOGISTICA S/A                   |00001164 |LEONARDO GUILHERME LOURENCO MOISES|      2013-05-20|       5|sample@email.com	 |81                |99999999  |
D A LOGISTICA S/A                   |00001164 |GUILHERME AUGUSTO MACHADO         |      2018-01-22|       7|sample@email.com	 |81                |99999999  |
*/

3 - In which city is a good place to start a business?

-- total by state an city
select e.state, c.description as city, count(0) as total 
from fact_establishment e 
join dim_city_code c on c.code=e.cityCode
where e.mainCnae in (select code from possible_clients) and e.registrationstatus=2
group by e.state, c.description order by count(0) desc limit 10;

/*
state|city          |total |
-----|--------------|------|
SP   |SAO PAULO     |206613|
RJ   |RIO DE JANEIRO| 86442|
MG   |BELO HORIZONTE| 54512|
PR   |CURITIBA      | 35675|
DF   |BRASILIA      | 35304|
BA   |SALVADOR      | 28054|
RS   |PORTO ALEGRE  | 23443|
CE   |FORTALEZA     | 23303|
GO   |GOIANIA       | 21265|
SP   |CAMPINAS      | 18149|
*/

Distribution strategy:

Node size: 1 x dc2.large (160 GB storage) com 2 vCPU (2 slice), 15 GiB RAM

(https://docs.aws.amazon.com/pt_br/redshift/latest/mgmt/working-with-clusters.html)

-- https://docs.aws.amazon.com/pt_br/redshift/latest/dg/r_SVV_TABLE_INFO.html
SELECT 
	"schema", 
	"table", 
	tbl_rows,
	"size", -- the size in blocks of 1 MB
	diststyle, 
	sortkey1 
from SVV_TABLE_INFO;

/*
schema   |table                    |tbl_rows|size|diststyle     |sortkey1     |
---------|-------------------------|--------|----|--------------|-------------|
open_data|dim_partner              |40666844|1236|KEY(basiccnpj)|basiccnpj    |
open_data|dim_cnae                 |    1358|   5|ALL           |AUTO(SORTKEY)|
open_data|fact_establishment       |48085895|4525|KEY(basiccnpj)|basiccnpj    |
open_data|dim_company              |45485995|1850|KEY(basiccnpj)|basiccnpj    |
open_data|dim_simple_national      |27600101| 456|KEY(basiccnpj)|basiccnpj    |
open_data|dim_country_code         |     255|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_city_code            |    5571|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_legal_nature         |      88|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_partner_qualification|      68|   5|ALL           |AUTO(SORTKEY)|
*/

-- total size: 1236 + 5 + 4525 + 1850 + 456 + (4 * 5) = 8092 MB (8 GB)

As we know the frequent access pattern, we defined the following strategy:

Was not used "Even" distribution because all tables can be joined, and the cost is high to do this operation- Was used "All" distribution for small tables to speed up joins (dim_city_code, dim_cnae, dim_country_code, dim_legal_nature, dim_partner_qualification)
Was used KEY distribution for larger tables to put similar values in the same slice and speed up queries
Was distributed dim_company, dim_simple_national, dim_partner and fact_establishment on the joining key (basiccnpj) to eliminates shuffling. This column is good because all information about one company is distributed between these tables and this information is joined by this key.

Sorting strategy:

It was used sorting key to minimizes the query time.

"Sorting enables efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans." (https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html)

Frequent queries:

establishment, dim_company and dim_simple_national by: city, cnae, country, legal nature, matrix, companysize etc;
dim_partner by: partnertype, partnerqualification, country, agerange;

SORTKEY considerations:

Was sorted by DISTKEY to speed up the join with related tables (dim and facts tables)
Was sorted by the frequent query filters (city, cnae, country, legal nature, matrix, companysize and partner qualification, etc) to speed up query time
I choose COMPOUND SORTKEY because the data will be sorted in the same order that the sortkey columns and the table filter will be probably done by sortkey columns and the type INTERLEAVED isn't a good choose for columns like datetime and autoincrements id's

Redshift

Redshift was used to be the data warehouse for the following reasons:

Amazon Redshift is a fully managed (We don't need to do maintenance), petabyte-scale, cloud-based data warehouse service who manages the work needed to set up, operate, and scale a data warehouse. For example, provisioning the infrastructure capacity, automating ongoing administrative tasks such as backups, and patching, and monitoring nodes and drives to recover from failures. Redshift also has automatic tuning capabilities, and surfaces recommendations for managing your warehouse in Redshift Advisor
On-premises data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial costs associated with building, maintaining, and growing self-managed, on-premises data warehouses are very high
Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets
Amazon Redshift integrates with various data loading, ETL, business intelligence reporting and analytics tools
Can easily support thousands of concurrent users and concurrent queries, with consistently fast query performance
Can load data in columnar formats and execute queries in a parallel and optimized way (Massively Parallel Processing - MPP)
Supports optimized file formats like ORC and PARQUET
Redshift lets you easily save the results of your queries back to your S3 data lake using open formats, like Apache Parquet
Allows scale up and scale down

Seealso:

https://aws.amazon.com/redshift/features/concurrency-scaling/?nc=sn&loc=2&dn=3
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://aws.amazon.com/redshift/faqs/?nc1=h_ls

Airflow

Airflow was used to orchestrate the jobs for several reasons:

Allow schedule, monitoring and view the whole history of executions
Allow visualize the workflow executions and logs
Stable and widely used tool
Allows customization through the creation of custom plugins
Allows scale up

Apache Spark

Apache Spark was used for several reasons:

The dataset is not small and needs some tool that can be able to handle it
Apache Spark is a unified analytics engine for large-scale data processing, stable and widely used tool
Spark offers over 80 high-level operators that make it easy to build parallel apps
Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine;
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources;

Seealso: https://spark.apache.org/

Addressing Other Scenarios

A logical approach to this project under the following scenarios:

If the data was increased by 100x.
If the pipelines would be run on a daily basis by 7 am every day.
If the database needed to be accessed by 100+ people.

All the tools used in this project are ready to deal with the possible scenarios mentioned above because they are tools able to process data on a large scale and allow us increasing or decrease resource consumption as needed.

Data increased by 100 x :

Current storage consumption by the Redshift:

-- https://docs.aws.amazon.com/pt_br/redshift/latest/dg/r_SVV_TABLE_INFO.html
SELECT 
	"schema", 
	"table", 
	tbl_rows,
	"size", -- the size in blocks of 1 MB
	diststyle, 
	sortkey1 
from SVV_TABLE_INFO;

/*
schema   |table                    |tbl_rows|size|diststyle     |sortkey1     |
---------|-------------------------|--------|----|--------------|-------------|
open_data|dim_partner              |40666844|1236|KEY(basiccnpj)|basiccnpj    |
open_data|dim_cnae                 |    1358|   5|ALL           |AUTO(SORTKEY)|
open_data|fact_establishment       |48085895|4525|KEY(basiccnpj)|basiccnpj    |
open_data|dim_company              |45485995|1850|KEY(basiccnpj)|basiccnpj    |
open_data|dim_simple_national      |27600101| 456|KEY(basiccnpj)|basiccnpj    |
open_data|dim_country_code         |     255|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_city_code            |    5571|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_legal_nature         |      88|   5|ALL           |AUTO(SORTKEY)|
open_data|dim_partner_qualification|      68|   5|ALL           |AUTO(SORTKEY)|
*/

-- total size: 1236 + 5 + 4525 + 1850 + 456 + (4 * 5) = 8092 MB (8 GB)

If the data increased by 100 x the numbers of the data would be:

CSV source files: 17 GB x 100 = 1700 GB (1,7 TB)
Lake files (ORC format - output of spark processing): 5 GB x 100 = 500 GB (0,5 TB)
Redshift storage consumption: 8 GB x 100 = 800 GB

Redshift recomendation for this scenario:

For datasets under 1 TB is to use DC2 nodes and choose the number of nodes based on data size and performance requirements
For 800 GB we can continue to use dc2.large and increase the node number to 10. We will have a total storage of 10x160 = 1600 GB (1,600TB) - Double the size is being considered due to redshift mirroring the data on another node's disks to reduce the risk of data loss
Increasing the number of nodes, we also increase the query performance because Amazon Redshift distributes and executes queries in parallel across all of a cluster’s compute nodes.
The impact on cost at this size would be: 10 x $0.25 ($2,50) per hour. Considering 720 hours at the end of the month we will have total cost of 720 x 2,50 ($1.800)
We can reserve nodes for steady-state production workloads, and receive significant discounts over on-demand nodes (at least 20 percent discount over on-demand rates.)
We can pause the cluster during unused time and leave it on only during working hours and reduce costs this way
If we consider 8 hours per day of work and at least 20 percent discount over on-demand rates we will have 240 x 2,50 ($600), subtracting the discounts the total cost at the end of month would be $480 (600-120).
We can also try to reduce to 7 the number of nodes to reduce cost (Testing before running in production of course)
We can also try to use Redshift Spectrum to query directly on s3 and reduce the number of nodes (test before running in production, of course)
Another possibility would be to assess the costs and response time for using AWS Athena as a way to query the data stored in s3
It will not be necessary to replicate the cluster because the objective would be for internal use and for analysis that can take a few hours to re-establish the cluster in case of any problem that makes the cluster unavailable. In addition to the form of use, another point that imposes a certain limit on replicating the cluster is the cost of the service.

DC2 nodes store data locally for high performance, and as the data size grows, we can add more compute nodes to increase the storage capacity of the cluster
AWS recommends DC2 node types for the best performance at the lowest price
When you run a cluster with at least two compute nodes, data on each node is mirrored on disks of another node to reduce the risk of incurring data loss.

Seealso:

https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
https://aws.amazon.com/redshift/pricing/

S3 and Spark Job impact (Data increased by 100 x):

The actual data distribution of the major table (fact_establishment):

select 
	e.state
	,count(0) as total_by_state 
	,(count(0)*100.0)/48085895.0 as percentage_of_total
from open_data.fact_establishment e group by e.state
order by count(0) desc;

/*
state|total_by_state|percentage_of_total|
-----|--------------|-------------------|
SP   |      13771427|      28.6392236226|
MG   |       5277500|      10.9751518610|
RJ   |       4141697|       8.6131224135|
RS   |       3496187|       7.2707121287|
PR   |       3221563|       6.6996007872|
BA   |       2458328|       5.1123681902|
SC   |       2140080|       4.4505358588|
GO   |       1648965|       3.4292072550|
PE   |       1471800|       3.0607728108|
CE   |       1370095|       2.8492658813|
ES   |        995624|       2.0705115294|
PA   |        974539|       2.0266629122|
MT   |        875544|       1.8207917311|
DF   |        805872|       1.6759010100|
MA   |        750845|       1.5614662054|
MS   |        649279|       1.3502483420|
PB   |        610967|       1.2705742505|
RN   |        562458|       1.1696943563|
AM   |        503559|       1.0472072943|
AL   |        463822|       0.9645697558|
PI   |        431652|       0.8976686406|
RO   |        345790|       0.7191090027|
TO   |        326762|       0.6795381473|
SE   |        319961|       0.6653947067|
EX   |        148921|       0.3096978854|
AC   |        117169|       0.2436660480|
AP   |        112170|       0.2332700680|
RR   |         93318|       0.1940652243|
BR   |             1|       0.0000020796|
 */

select 
	e.maincnae
	,count(0) as total_by_maincnae
	,(count(0)*100.0)/48085895.0 as percentage_of_total
from open_data.fact_establishment e group by e.maincnae 
order by count(0) desc limit 10;

/*
maincnae|total_by_maincnae|percentage_of_total|
--------|-----------------|-------------------|
 4781400|          2822601|       5.8699146600|
 9492800|          2798437|       5.8196629177|
 8888888|          1808469|       3.7609136733|
 5611203|          1590811|       3.3082695039|
 4712100|          1415944|       2.9446140079|
 9602501|          1276983|       2.6556290571|
 5611201|           855265|       1.7786192811|
 4399103|           779004|       1.6200259972|
 7319002|           692591|       1.4403204931|
 9430800|           607662|       1.2637011331|
 */ 

-- Brazilian National Classification of Economic Activities
select * from open_data.dim_cnae limit 3;

/*
code  |description                                              |
------|---------------------------------------------------------|
111302|Cultivo de milho                                         | (corn cultivation)
111399|Cultivo de outros cereais não especificados anteriormente| (Cultivation of other cereals)
112102|Cultivo de juta                                          | (Jute cultivation)
 */ 

select count(0) from open_data.dim_cnae;
-- 1358

To define which partitioning strategy to use was taking into account the following best practices:

If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

(https://docs.databricks.com/delta/best-practices.html)

Considering the increase in data size by 100x and the daily execution, partitioning could be done:

As the most frequent analysis in the database will be considering economic activity (one or a restricted group of activities) and taking into account the distribution of data, we can have the main economic activity code as the main partition.
It would be a maximum of 1358 partitions.
Partitioning by economic activity proved better than partitioning by geographic distribution because the geographic distribution would make the data unevenly distributed. São Paulo, for example, would have 28% of the data.
Partitioning by economic activity can be used to distribute data from dim_company, dim_partner, dim_simple_national and fact_establishment tables because the data fetching logic is the same.
Other tables do not need partitioning because they are small.

The pipelines would be run on a daily basis by 7 am every day

The CNPJ data source there's no need to run on a daily basis (The CNPJ data source is released on a monthly basis), but we will suppose this scenario for the purpose of this project.

Will be necessary to adjust the airflow job:

Will need to build a new task on the current dag (cnpj_dag.py) to automate the spark work that processes CNPJ CSV files and saves to s3
The new Spark task will call the EMR API to run work on a Spark cluster because the airflow is not meant to do this kind of processing
Configure the dag schedule (cnpj_dag.py) to run on daily basis at 7 am every day. (schedule_interval="0 7 * * *")
There will be an increase in cost due to the need for a daily EMR run. The cluster does not need to be running for 24 hours and will always be turned off after finishing the processing routine. (Will need to simulate the increase in data size and run the spark job to assess the total execution time in this scenario)
There is no need to increase the airflow infrastructure because all the heavy lifting is done in Spark Cluster or Redshift Cluster
We can do studies and tests to run airflow and Spark on the same cluster of kubernetes to assess cost savings
The impact of running the routine daily would be to provide updated information with a maximum daley of 1 day. This allows to generate more accurate analysis and insights

Seealso:

https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html
https://en.wikipedia.org/wiki/Cron#CRON_expression
https://www.upsolver.com/blog/partitioning-data-s3-improve-performance-athena-presto

The database needed to be accessed by 100+ people:

For this project in particular replication does not make sense because

If necessary, we can:

Activate Concurrency Scaling feature who can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance
Resize the Amazon Redshift clusters by adding more nodes or changing node types with no downtime
Use Elastic Resize who adds or removes nodes automatically
Spin up multiple Amazon Redshift clusters by restoring data from a snapshot
Create different query queues and prioritize according to usage. It is possible to use automatic WLM or manual.

Seealso:

https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html
https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-wlm-etl-queues/

Future improvements

Adding new datasets

Population data

Collect data related to the Brazilian population to support analyzes related to starting a business or detecting opportunities.

Information related to land and real estate values

Collect data related to the purchase and sale price of properties and land so that it is possible to subsidize analyzes related to starting a business, detecting opportunities and competition.

Complaints related to companies

Collect data related to complaints opened or reported by customers on sites like Complain Here, procon para can analyze companies from this perspective. It can measure the level of customer satisfaction.

Weather information

Collect information related to temperature and weather to support analysis of starting a business, detecting opportunities and risks.

Data related to available workforce

Data on education and educational level.

CNAE - Detailed structure and explanatory notes

Describes in more detail which activities are included or not in each CNAE. This dataset can be used to more accurately locate companies by activities they may or may not perform.

The PDF shows a pattern in the way of describing which activities are or are not included in each code, which allows for programmatic extraction.

https://concla.ibge.gov.br/images/concla/downloads/revisao2007/PropCNAE20/CNAE20_NotasExplicativas.pdf

Technical references

https://github.com/jonatasemidio/multilanguage-readme-pattern/blob/master/README.md
https://github.com/tiimgreen/github-cheat-sheet/blob/master/README.md
https://udacity.github.io/git-styleguide/
https://shields.io/
https://tabletomarkdown.com/convert-spreadsheet-to-markdown/
https://github.com/georgevbsantiago/qsacnpj/
https://www.kaggle.com/hugomathien/soccer/home
https://www.w3schools.com/python/python_intro.asp
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
https://github.com/wuga214/TEACHING_OSU_DocAnalysis_Fall2015_Assign4/blob/master/code/a4example/BadRecordCount.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-en-US.md

README-en-US.md

Open data from Brasil

Summary

1. Definition of project scope and data collection

Project scope

2. Explore and evaluate the data

Description of data sets

Known Data Issues and How to Address

Necessary care

3. Define the data model

4. Run ETL to model the data

5. Describe and document the Project

Defending Decisions

Addressing Other Scenarios

Future improvements

Adding new datasets

Technical references

Files

README-en-US.md

Latest commit

History

README-en-US.md

File metadata and controls

Open data from Brasil

Summary

1. Definition of project scope and data collection

Project scope

2. Explore and evaluate the data

Description of data sets

Known Data Issues and How to Address

Necessary care

3. Define the data model

4. Run ETL to model the data

5. Describe and document the Project

Defending Decisions

Addressing Other Scenarios

Future improvements

Adding new datasets

Technical references