Come up with a solution for consuming crawler events (original #457) #1765

obulat · 2021-04-21T12:16:34Z

This issue has been migrated from the CC Search Catalog repository

Author: aldenstpage
Date: Wed Jul 08 2020
Labels: ✨ goal: improvement,🙅 status: discontinued

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
We record 404s in the link_rot topic

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2024-03-18T18:37:48Z

Closing this as the commoncrawl ingestion process is no longer used.

obulat · 2024-03-19T09:06:30Z

Closing this as the commoncrawl ingestion process is no longer used.

I think this referred to the Polite Crawler processing the images from the provider sites and sending events with the specific data, not Common Crawl. But I agree with closing this issue since we are going to create a separate project for them.

AetherUnbound added the 🐍 tech: python Involves Python label Jan 24, 2022

krysal added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Nov 18, 2022

obulat mentioned this issue Feb 17, 2023

Polite Crawler #417

Closed

2 tasks

obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a solution for consuming crawler events (original #457) #1765

Come up with a solution for consuming crawler events (original #457) #1765

obulat commented Apr 21, 2021 •

edited

AetherUnbound commented Mar 18, 2024

obulat commented Mar 19, 2024

Come up with a solution for consuming crawler events (original #457) #1765

Come up with a solution for consuming crawler events (original #457) #1765

Comments

obulat commented Apr 21, 2021 • edited

AetherUnbound commented Mar 18, 2024

obulat commented Mar 19, 2024

obulat commented Apr 21, 2021 •

edited