Building a Content Analytics Reporting System
Explore how we modernized the client’s application by developing a dedicated content analytics system.
Our customer
Trinity Audio is a company that specializes in developing an AI-driven ecosystem of solutions that help manage audio experiences for publishers and content creators. These solutions encompass a wide range of features, including voice editing, content discovery, virtual assistant skills, and data analytics among many others.
The obstacles they faced
The customer wanted to effortlessly generate dynamic real-time reports for their solution by conducting a comprehensive analysis of the large volumes of data about content performance, such as loads and clicks.
How we helped
Romexsoft helped to develop a scalable yet cost-effective reporting system with flexible architecture for data pipelines. This system was designed to process and analyze the required content data of the client’s solution.
Generating Analytical Insights from Large Data Volumes
The main challenge was to arrange the analysis of real-time data inflow occurring concurrently at extremely high speed, with data arriving every second. Along with the need to receive huge amounts of data at a given moment, Trinity Audio faced another poignant need to accommodate, store and manage historical data.
For instance, the client wanted to get valuable insights into the top-performing articles published by a specific domain within the last 24 hours while simultaneously accessing in-depth reports to meticulously examine historical data spanning several years.
Data Management and Processing Optimized for Content Analysis
Data streaming
- Integration of Apache Kafka was the opening move to handle real-time data effectively. This approach delivers horizontal scaling, ensuring that as data volume grows, Kafka can handle the load without major architectural changes.
- Apache Spark Streaming implementation was employed to consume and process real-time data streaming through Kafka. The inherent ability of Spark to process large data volumes with low latency was instrumental in handling live stream data for this type of solution.
Data storage
- We used Apache Hive infrastructure as a data warehouse for the gathered historical data. It ensures information managing and processing into a readable and structured format for query and analysis.
- The processed and aggregated data were then stored in PostgreSQL as a source for the reports generated by the system.
- Raw data are stored in Amazon S3 object storage service to ensure the cost-effectiveness of the reporting solution.
Data processing
- Trino (PrestoSQL) provides the ability to join historical datasets about content performance (from Hive, PostgreSQL bases, and S3 raw data) with the advertising data from relational databases.
- Amazon QuickSight reports, which showcase required content metrics for ensuring data-driven decision-making from the side of the client.
- Custom dashboards, which get the data from PostgreSQL and Hive databases, represent the usage of the solution and specific content consumed by its users.
Technology stack
- Apache Kafka
- Apache Spark
- Apache Hive
- Trino