October 15, 2024
  • Actual-Time APIs (backed by the Cassandra database) for asset metadata entry don’t match analytics use instances by information science or machine studying groups. We construct the information pipeline to persist the belongings information within the iceberg in parallel with cassandra and elasticsearch DB. However to construct the information information, we’d like the entire information set within the iceberg and never simply the brand new. Therefore the prevailing belongings information was learn and copied to the iceberg tables with none manufacturing downtime.
  • Asset versioning scheme is advanced to assist the foremost and minor model of belongings metadata and relations replace. This function assist required a major replace within the information desk design (which incorporates new tables and updating present desk columns). Current information obtained up to date to be backward appropriate with out impacting the prevailing working manufacturing site visitors.
  • Elasticsearch model improve which incorporates backward incompatible adjustments, so all of the belongings information is learn from the first supply of fact and reindexed once more within the new indices.
  • Information Sharding technique in elasticsearch is up to date to supply low search latency (as described in blog publish)
  • Design of latest Cassandra reverse indices to assist completely different units of queries.
  • Automated workflows are configured for media belongings (like inspection) and these workflows are required to be triggered for previous present belongings too.
  • Belongings Schema obtained advanced that required reindexing all belongings information once more in ElasticSearch to assist search/stats queries on new fields.
  • Bulk deletion of belongings associated to titles for which license is expired.
  • Updating or Including metadata to present belongings due to some regressions in consumer software/inside service itself.
Determine 1. Information Reprocessing Pipeline Circulate
Determine 2. Cassandra Desk Design
Determine 3. Cassandra Information Fetch Question
Determine 4: Processing clusters
  • Relying on present information dimension and use case, processing could influence the manufacturing stream. So establish the optimum occasion processing limits and accordingly configure the patron threads.
  • If the information processor is looking any exterior providers, verify the processing limits of these providers as a result of bulk information processing could create surprising site visitors to these providers and trigger scalability/availability points.
  • Backend processing could take time from seconds to minutes. Replace the Kafka shopper timeout settings accordingly in any other case completely different shopper could attempt to course of the identical occasion once more after processing timeout.
  • Confirm the information processor module with a small information set first, earlier than set off processing of the entire information set.
  • Accumulate the success and error processing metrics as a result of generally previous information could have some edge instances not dealt with accurately within the processors. We’re utilizing the Netflix Atlas framework to gather and monitor such metrics.