Page MenuHomePhabricator

Dark archive for Commons
Open, LowPublic

Description

Everyday thousands of pictures get deleted from Commons because they are not public domain yet (any picture of recent architecture in France, for instance). Most of these pictures are probably lost forever, as the contributors are unlikely to store them safely and try again 30 years later.

Instead of deleting them, these pictures should be put in a "dark archive" with a reconsideration date. Picture of the dark archive can not be seen by anyone (except probably a few trusted Wikimedia employees), only metadata can be seen (description, categories, copyright status, reconsideration date, file size), some of this metadata can be edited. In some cases a low-quality thumbnail might be legal. -- Syced (talk) 07:50, 11 November 2015 (UTC)


Open questions from comment below: (T120454#2106676)


This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Commons#Dark_archive
This proposal received 52 support votes, and was ranked #13 out of 107 proposals.

See also:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.
01tonythomas subscribed.
NOTE: This task is a proposed project for Google-Summer-of-Code (2016) and Outreachy-Round-12 : GSoC 2016 and Outreachy round 12 is around the corner, and this task is listed as a Possible-Tech-Projects for the same. Projects listed for the internship programs should have a well-defined scope within the timeline of the event, minimum of two mentors, and should take about 2 weeks for a senior developer to complete. Interested in mentoring? Please add your details to the task description. Prospective interns should go through Life of a successful project doc to find out how to come up with a strong proposal for the same.

Hello. I would like to work on this project for GSoC 2016. But as of now no mentors are assigned to the task. Is anyone willing to mentor this project?

I think this is a little ambitious for a GSoC project as our file handling code is rather complicated and would need (I think) some refactoring to handle the concept of non-public files. I think it could be done in a summer, provided the student was already rather familiar with the file handling code and Wikimedia's rather unique setup.

I think this is a little ambitious for a GSoC project as our file handling code is rather complicated and would need (I think) some refactoring to handle the concept of non-public files. I think it could be done in a summer, provided the student was already rather familiar with the file handling code and Wikimedia's rather unique setup.

.. and, will you be available to mentor this one @Legoktm , this summer ? If yes, this should be an easy go.

Oops, I meant a little too ambitious. I wouldn't be a good mentor for this task, and am not really interested in it.

Oops, I meant a little too ambitious. I wouldn't be a good mentor for this task, and am not really interested in it.

Alright. In that case, I will move this to 'Needs-Discussion' column :)

IMPORTANT: This is a message posted to all tasks under "Need Discussion" at Possible-Tech-Projects. Wikimedia has been accepted as a mentor organization for GSoC '16. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

Summarizing a recent meeting with Community-Tech and the potential seperate aspects we found:

  • Requires a search feature for deleted images, to prevent reuploads? (Or comparing the [[ Secure_Hash_Algorithm | SHA ]], but that would have to exactly the same image?)
  • Way to automatically undelete when images become legal (e.g. copyright expiration after e.g. 70 years after death of author)? Currently such categories in Commons exist but work is done manually. (Plus copyright laws might change in the meantime due to legislation - might be an argument for keeping it manual.)
  • Wondering: Are there more ways to hide images (the media itself), e.g. via "oversight" instead of deletion?
  • Stakeholders potentially involved here: Discovery-ARCHIVED team, Multimedia team, WMF-Legal team.

Trying to break this down, one potential Possible-Tech-Projects item could be building a search for deleted images, potentially with low-quality thumbnails which might be legal?

The Wikimedia-Hackathon-2016 starts tomorrow and this task is featured at T119703. We want to use T130776: Wikimedia Hackathon 2016 Opening Session to promote these projects and help recruiting volunteers to work for them.

If this task is ripe for hackathon work, please follow these instructions. If it is not ready, remove it from T119703 in order to avoid volunteers' frustration. Thank you!

It sounds like the only part of this project that is doable during the Hackathon would be adding the search feature, which is basically covered under T109561. I'm going to remove this one from the Hackathon list (T109561 is already listed).

Hi Aklapper, thanks for your edit by T142298. I am very pleased that other desire such a function. I'm not a programmer: I can contribute something else?

Discovery-ARCHIVED and Multimedia - Your input on this task would be welcomed, especially by adding any additional questions / answers / clarifications to the task description, and helping to determine the possible/recommended next steps. Thanks. :)

I'm not sure what I can add from a discovery perspective. Detecting non-exact image duplicates is well outside the expertise of the current discovery team, The non-exact title search for deleted pages is making it's way through, and may provide some help.

This seems like an interesting proposal. I echo the concerns of the idea's author - storing files digitally has the potential to create a 'dark age' in our shared culture as artifacts do not survive changing file/storage formats and what not. Idealistically, I wish very much for this archive to exist.

I have a few thoughts to share. They are my own, not vetted by the Discovery team, etc. I am not a lawyer.

Tech debt - laws change, technology marches on. This archive would have to be maintained while technology advances for something that may never pay out in the developers life time - by design. "Planting a tree knowing you'll never sit in its shade" is a nice platitude, but humans are not wired for really long time frames. I could see us spending a lot of time setting this up, managing it for a few decades, until only a few interested souls were left with an insurmountable amount of work - just being gnomes, much less technology! (Yes, I realize this is how all of Wikimedia survives).

There's a legal debt as well. We'd have to keep on top of changes in laws and update the metadata accordingly. Law is complex. In some cases the law might state 70 years past an author's death, but there's extensions, changes of law, etc. That might make it really hard to narrow down when X file can be un-archived on Y date. As things move and change you'd have to keep track of every file.

Scope - On an infinite time scale every work would eventually be public domain :). So does that mean the archive would be open to everything? What goes in? Only files already uploaded but deleted? New files? Can I grab a bunch of press images from a new movie from Paramount Studios and upload them to the dark archive?

All laws are local - How would works that are under a license in another country impact our retention policies?

Content - So I take a photo, copyright law applies until x date (or my death or whatever), and I upload it under a non public domain license. What about if there's an identifiable person in the photo? What are their rights? If we find out years later that someone did not consent to the photo what would happen? I'm thinking about Personalty rights.

Explain it simply - We face a daily (hourly?) challenge in informing active members of our community on the in-and-outs of contemporary copyright law. New, often Inexperienced folks are met with a wall of warnings and notifications when they upload something in good faith - but incorrectly. To add to that workload we'd have to be able to succinctly explain to someone totally new to the internet and/or Commons (think of a person with limited technical proficiency) that, "Yes, you can upload that, but you can't see it for 70 years. Thanks!". :)

Legal standing - 🚨Again, I am not a lawyer.🚨 Would the transparent nature of our movement mean that if a copyright holder asked, "Hey do you have anything in there that belongs to us?" We'd have to respond with the affirmative. (The metadata would be public I assume) That's another issue for works that are currently under copyright (say expiring in the near future - 2040), but the owner of the copyright changes - authors die, companies buy other companies, etc. So when that happens, who does the legal work of handling Yet Another Request for review of the archive's content pertaining to company X now a subsidiary of company y?


I don't want cultural artifacts to go down the memory hole, but this seems far more complex than a metadata and search addition.

Hi CKoerner_WMF, thank you for your opinion. I can understand your concerns very well. If we were not on the safe side, if there were no automatic activation? A file of the "dark archive" should be activated only by Admins (or OTRS team). Based on meta data, an interested party could submit an request to publish files. The admins (or OTRS team) then decide on the release. (At the then current legal conditions)

How would that be different from regular page deletion, then?

If a deletion is required, is a discussion only among the Admins and those who the file being uploaded.

(There should be a different password for the "dark archive".)

That seems like a great idea actually.

I see that some discussion did happen on this task. Outreachy-13 is about to start. Can we have this project as a potential one and is anyone willing to mentor?

I agree with legoktm's assessment above that this task would be rather ambitious for an outreachy project.

As an aside it should be noted, that commons sometimes categorizes deleted pages as "Undelete in year XXXX", as a manual way of achieving this.

There might be an automated way to do this with wikidata. But I'm not sure how metadata is stored on commons, I vaguely remember some discussion that these would be different?

If in wikidata, we could point to these dark archive files, with an author attached, and the author had a date of death present and the date of death that indicated the work was out of copyright, it would be possible to periodically search for persons with authored images whose works had expired out of copyright.

But, I'm not sure how metadata is stored on commons and how it would work?

Theoretically, this could be done rather robustly per @Mvolz , once we start sharing structured metadata described per T68108 . @Lydia_Pintscher & the Wikidata team probably will have a better sense if that is tangible -- but if they think structured data on commons is likely in the next year or two, I would not rush into building the Dark Archive, because semi-automatic function are much easier at that point.

A semi-automatic function would be perfectly adequate. No one knows the legal conditions at the time of automatic release.

MarkTraceur moved this task from Untriaged to Desired epics on the Multimedia board.

I can't judge unfortunately, how important is the change of status. (Is this a step towards implementation or a downgrade of priority?)

Therefore the question / suggestion: Could the project be tackled in two stages:

Level 1: Images can be uploaded and categorized in a protected area. (Only visible to admins and the uploaders.)

Stage 2: The legally complicated functions for automatic publication will be added later.

This would ensure that the effort is initially low and many images are not lost forever.

We're having a new round of Outreachy programs( GSoC 17', Outreachy - 14, RGSoC ). Is there a consensus to define and push this project for an internship round?

Yes, I think this definitely would be of interest. We would need input from WMF legal before deciding exactly what the archive could hold, as I am pretty sure that it will not be possible to include anything that – if publicly held on Commons or elsewhere – would be an infringement of US copyright (making things 'hidden' does not authorise us to sidestep copyright restrictions and to hold for future re-use copyright-infringements such as new Paramount images). However, the archive could well be used to hold images that are public domain in the US but which are still copyright-protected in their country of origin. However, that's a legal and community issue that needn't hold up technical development.

What I'd like to see initially is an archive area where these types of images can be maintained and curated, preferably using the same live categories that are available on Commons. A front-end that is very similar to Commons would be ideal, though it should at least initially not be possible for images within the archive to be used by direct linking on any of the Wikipedias.

There should be an easy way for Commons admins (not ordinary editors) to transfer into the archive images from Commons at the same time that they are being deleted there, and to transfer them back if and when they subsequently become OK. Most commons admins use tools which perform the deletions at the same time as closing Deletion Requests, and those would ideally need to be updated. Recovery from the archive should be done manually for now, at least, as automation could result in legal difficulties, though tools ought to be available to recover multiple files within a specified category all at once.

The question has been raised as to whether it should be possible to upload files directly to the archive, in addition to having them transferred from Commons. That may ultimately be a question for the community, but I would definitely include the technical capability to receive uploads. If there were to be no direct upload capability I can envisage users deliberately uploading large numbers of images to Commons which would then each time need to go through the Deletion Request procedure, using significant admin resources. A nice feature would be the ability to restrict uploads via a new type of user permission, so that the community could pre-approve users who can be relied upon not to upload obvious copyright infringements.

It has been suggested above that the archive should be visible only to admins and uploaders. I am not sure why that should be, and would prefer the images to be visible to all users (though with restrictions on bringing them back into Commons). On the basis that the files within the archive must not in any event be copyright protected under US law (this is not a place to store "fair use" images), there will be little legal difference between the archive files and, for example, files held on the English Wikipedia which are public domain in the US but still copyright protected in the UK. That being the case, there should be no problem in principle in allowing for Wikidata links, though there may be legal / community issues.

Happy to comment further if there are specific queries.

@Sumit Because of the significant complexities involved - in legal-areas, community-areas, and technology-areas - plus advice above from experienced developers (and it being marked it as "epic" on the multimedia workboard), I suggest removing Possible-Tech-Projects because this seems unsuitable for a newcomer-developer to work on. (As I understand it, possible-tech-projects should be well-defined and clearly-scoped, and this task is currently neither.)

I also believe that the legal issue is the most important.
Only just as far as it is legal: I would like to provide files, which can be released much later. (Probably then, if I no longer live ...)
If that is not possible, then it is not. In no case should it create legal problems for Commons!
(in German: Ich glaube auch, dass die rechtliche Frage die Wichtigste ist. Nur so weit wie es legal ist: Ich möchte Dateien zur Verfügung stellen, die erst viel später freigegeben werden können. (Wahrscheinlich erst dann, wenn ich nicht mehr lebe ...) Wenn das nicht möglich ist, dann eben nicht. In keinem Fall sollte es rechtliche Probleme für Commons schaffen!)

@srishakatux: Excuse me, what does that mean? Is this the end of this proposal? That would be a pity.