Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking β€œSign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a solution for consuming crawler events (original #457) #1765

Closed
obulat opened this issue Apr 21, 2021 Β· 2 comments
Closed

Come up with a solution for consuming crawler events (original #457) #1765

obulat opened this issue Apr 21, 2021 Β· 2 comments
Labels
πŸ’» aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐍 tech: python Involves Python
Projects

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021 β€’

This issue has been migrated from the CC Search Catalog repository

Author: aldenstpage
Date: Wed Jul 08 2020
Labels: ✨ goal: improvement,πŸ™… status: discontinued

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

  • We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
  • In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
  • We record 404s in the link_rot topic

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature πŸ’» aspect: code Concerns the software code in the repository labels Nov 18, 2022
@obulat obulat mentioned this issue Feb 17, 2023
2 tasks
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@AetherUnbound
Copy link
Contributor

Closing this as the commoncrawl ingestion process is no longer used.

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024
@obulat
Copy link
Contributor Author

Closing this as the commoncrawl ingestion process is no longer used.

I think this referred to the Polite Crawler processing the images from the provider sites and sending events with the specific data, not Common Crawl. But I agree with closing this issue since we are going to create a separate project for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
πŸ’» aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐍 tech: python Involves Python
Projects
Archived in project
Openverse
  
Backlog
Development

No branches or pull requests

3 participants