This project, undertaken as part of Ca’ Foscari University of Venice's "Software Performance and Scalability" course, focuses on deploying and optimizing a web application built to enable users to query a local copy of the IMDb dataset. The goal is to design the application with scalability in mind, identify performance bottlenecks, and apply optimizations for improved response times and resource efficiency.
The web application provides a platform where users can perform queries on a local copy of the IMDb dataset. The application is built using Docker and Ruby on Rails, with PostgreSQL as the database. It aims to simulate a realistic server environment and handle queries on movie and TV show data efficiently. This project emphasizes identifying inefficiencies and implementing strategies to improve the application's scalability and responsiveness.
- Model the IMDb dataset: Normalize the database schema for optimized queries.
- Performance Bottleneck Identification: Identify and analyze bottlenecks in database queries, network latency, and resource management.
- Optimization and Testing: Implement optimizations for query performance, CPU usage, and caching, followed by extensive load testing.
The project leverages Docker to containerize the application, creating isolated environments for both the web application and the database. The containers communicate over a virtual network and are configured to optimize resource allocation.
- Web Application Container: Hosts the Ruby on Rails application, which handles client requests, processes data, and returns JSON responses.
- Database Container: Hosts PostgreSQL and manages data storage, indexing, and query handling for the IMDb dataset.
Docker enables efficient deployment and resource management for the application and database. Containers allow easy scaling and ensure consistent performance across different environments.
- Configuration: Two Docker containers are configured — one for the web application and another for PostgreSQL. Both connect over a Docker-managed network.
- Resource Allocation: CPU cores and memory are carefully allocated to each container, supporting performance-focused resource distribution.
The Ruby on Rails framework powers the web application, utilizing the Model-View-Controller (MVC) design pattern to organize code and simplify database interactions.
- ActiveRecord ORM: Rails uses ActiveRecord for Object-Relational Mapping (ORM), translating database rows into Ruby objects and simplifying complex queries.
- Workflow:
- User Request: Rails listens for HTTP requests on port 3000.
- ORM to SQL Translation: ActiveRecord converts ORM commands to SQL for PostgreSQL.
- Response Formatting: Rails converts query results to JSON and sends it back to the client.
PostgreSQL serves as the relational database, providing powerful indexing, parallel processing, and scalability features. These capabilities are essential in handling the large IMDb dataset efficiently.
- Indexing: Indexes were added after initial testing to optimize query performance.
- Parallelization: PostgreSQL’s ability to parallelize queries was crucial in the optimization phase.
The application was tested on a server environment and a testing machine, each with specific configurations to simulate real-world scenarios and gather performance data.
- CPU: AMD Ryzen 7 5800X (8 cores, 16 threads)
- RAM: 16GB DDR4
- Storage: SSD M.2 PCIe Gen4
- Operating System: Ubuntu Desktop 24.04 LTS
The testing machine, equipped with superior processing power, runs extensive performance tests on the application:
- CPU: Intel Core i9-12900K (16 cores, 24 threads)
- RAM: 32GB DDR5
- Storage: SSD M.2 PCIe Gen4
- Operating System: Ubuntu Desktop 24.04 LTS
Due to ISP restrictions, we utilized ZeroTier to establish a peer-to-peer connection between the server and the testing machine, simulating a VPN setup. This setup impacts latency, introducing an average round-trip time (RTT) of ~34ms.
The IMDb dataset consists of seven TSV files covering a wide range of information:
- Title Basics: General information about titles.
- Title Akas: Alternative titles.
- Title Crew: Information on directors and writers.
- Title Episodes: Episode and season information.
- Title Principals: Cast information.
- Title Ratings: IMDb ratings and votes.
- Name Basics: Information on individuals involved in titles.
The database schema was normalized to improve scalability, reducing redundancy and ensuring efficient query performance.
Through load testing, we identified several key bottlenecks:
- Query Processing: Complex, unoptimized queries were taking excessive time.
- Resource Allocation: Inefficient distribution of CPU cores and memory affected performance.
CPU cores were allocated to balance load between the application and database containers, significantly improving response times.
Rails’ caching mechanisms were used to store frequently requested data in memory, reducing database load and speeding up response times for common queries.
Indexes were added to frequently queried columns, drastically improving query performance by reducing data retrieval time.
SQL queries were manually optimized to eliminate unnecessary joins, reduce complexity, and improve overall database efficiency.
After applying the above optimizations, we conducted extensive load testing using JMeter to simulate different user workloads. The final results indicated:
- Increased Query Efficiency: Indexed queries performed significantly faster, reducing response times by up to 40%.
- Reduced Bottlenecks: Optimizations in resource allocation and caching minimized delays in high-traffic scenarios.
- Scalability: The application demonstrated improved scalability, efficiently handling higher loads with consistent performance.