Come up with a solution for consuming crawler events (original #457) #1765
Labels
π» aspect: code
Concerns the software code in the repository
β¨ goal: improvement
Improvement to an existing user-facing feature
π¨ priority: medium
Not blocking but should be addressed soon
𧱠stack: catalog
Related to the catalog and Airflow DAGs
π tech: python
Involves Python
Projects
This issue has been migrated from the CC Search Catalog repository
We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:
image_metadata_updates
topicimage_metadata_updates
topic.link_rot
topicThis data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.
We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.
The text was updated successfully, but these errors were encountered: