Page MenuHomePhabricator

[tracking] Community feedback for the WDQS Split the Graph project
Open, HighPublic

Description

This ticket is created as a tracking task for the WDQS Split the Graph project. Specific issues with specific queries should be reported as subtasks of this task, general discussion of the graph split can happen in comments to this task.

Event Timeline

Gehel triaged this task as High priority.Feb 6 2024, 2:52 PM
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.
Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Just here to say that I have a positive feedback querying for "collaboration platforms by creation date, with images".

Prod: https://w.wiki/94fS

Full: https://w.wiki/96$n

Main: https://w.wiki/96$q (a bit faster than prod, I see)


I also tried with a very-custom query of "software licenses, grouped by logical ones, by software count, with approved by OSI/FSF". But URL shortening fails indeed, so you have to copy-paste:

# Software licenses
# Taking in consideration license-sub-editions and counting software using them.
# Taking in consideration direct-usage by that license.
# Taking in consideration FSF and OSI approval.
# Author: [[User:Valerio Bozzolan]] and contributors
# Date: 2023
# License: CC 0, public domain
# https://phabricator.wikimedia.org/P52339
# https://www.wikidata.org/wiki/User:Valerio_Bozzolan

SELECT 
  ?count_broad_software
  ?count_exact_software
  ?license
  ?licenseLabel
  ?min_license_date
  ?approved_fsf
  ?approved_osi
WHERE
{

  # START SUB-QUERY: NO-LABEL
  {
    SELECT 
      ?license
      (COUNT (DISTINCT ?broad_software) AS ?count_broad_software)
      (SAMPLE(?count_exact_software)    AS ?count_exact_software)
      (SAMPLE(?min_license_date)        AS ?min_license_date)
    WHERE 
    {

      # START SUB-QUERY: SOFTWARE COUNTER
      {
        SELECT
          ?license
          (COUNT(DISTINCT ?software)  AS ?count_exact_software)
          (MIN   (?license_date) AS ?min_license_date)
        WHERE
        {

          # START SUB-QUERY: LICENSE
          {
            SELECT ?license WHERE {
              # This is a license.
              ?license wdt:P31/wdt:P279* wd:Q207621.
              
              # The license must not be confused with a software (it happens).
              MINUS {
                ?license wdt:P31/wdt:P279* wd:Q7397.
              }
            } GROUP BY ?license
          }
          # STOP SUB-QUERY: LICENSE

          # License must be used by software.
          ?software wdt:P275 ?license.
          wd:Q7397 ^wdt:P279*/^wdt:P31 ?software.

          # The license may have a publication date.
          OPTIONAL {
            ?license wdt:P577 ?license_date.
          }
          
        } GROUP BY ?license
      }
      # STOP SUB-QUERY: SOFTWARE COUNTER

      # License may have editions.
      # Software may use this license edition.
      OPTIONAL {
        ?child_license wdt:P629*/wdt:P279* ?license.
        ?broad_software wdt:P275 ?child_license.
        wd:Q7397 ^wdt:P279*/^wdt:P31 ?broad_software.
      }
    } GROUP BY ?license
  }
  # STOP SUB-QUERY: NO-LABEL  

  # The license may be approved by OSI / FSF.
  BIND (EXISTS{?license wdt:P790 wd:Q48413. } AS ?exists_fsf )
  BIND (EXISTS{?license wdt:P790 wd:Q845918.} AS ?exists_osi )
  BIND (IF(?exists_fsf, "✅ FSF", "❌ FSF")   AS ?approved_fsf)
  BIND (IF(?exists_osi, "✅ OSI", "❌ OSI")   AS ?approved_osi)

  # Helps get the label in your language, if not, then en language
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?count_broad_software)

(Spoiler: GNU GPL wins)

Also in this case a bit faster in the "main". So, thanks for this promising work.

Hi - how does the federation work? I'm experimenting with this by trying to get the list of names of authors on a scholarly article - the article data itself is in the scholarly article subgraph, but the human items for the authors are in the main one. So I need to do a federated query but it's not clear how? Can you provide an example? Do I start on the main graph and federate to the scholarly one, or vice versa?

Ok, I got federation to work - sort of. From the main query service I can query the scholarly subgraph - but if I try to use the resulting values I always get a timeout.

select ?author WHERE {
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
         wd:Q56977964 wdt:P50 ?author .
        }
}

works fine, but even

select ?author ?b ?c WHERE {
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
         wd:Q56977964 wdt:P50 ?author .
        }
  ?author ?b ?c .
} LIMIT 1

times out. What's going on here???

I tried to get the federation working, but got time outs too. The problem is that the current setup makes splits at a statement level. That is, given statements with some property (e.g. P2860 and P1433), some results are in one QS instance and some are in the other. That means a lot of federation-union combinations to get all results. I posted an example query that is affected (the first I tried) in this issue report: https://github.com/WDscholia/scholia/issues/2423

success criteria

I have tried to understand the graph split "experiment", but I don't understand the success criteria. My recommendation would be to work out the success criteria more in detail before starting the user feedback.

relation to movement strategy

I addition I don't understand how this activity supports the Wikimedia Movement Strategy. Making it more difficult to write SPARQL queries does not seem very inclusive to me.

alternatives

I wonder if blazegraph (in the current configuration) is still the best solution. Coincidentally I was seeing a talk about another large graph about all software source code 34b nodes. The approach was to rewrite the software - that was written in JAVA - in rust. I imaging rewriting blazegraph in rust might give a similar (one time) performance gain as well and might make the split unnecessary.

Another alternative is to translate SPARQL queries to PHP code and execute it on the mediawiki runners. Maybe some mariadb graph query extension could also be helpful. While the implementation of the sparql endpoint would be some effort, it would eliminate the effort of syncing the data from mariadb to blazegraph.

I tried to get the federation working, but got time outs too. The problem is that the current setup makes splits at a statement level. That is, given statements with some property (e.g. P2860 and P1433), some results are in one QS instance and some are in the other. That means a lot of federation-union combinations to get all results. I posted an example query that is affected (the first I tried) in this issue report: https://github.com/WDscholia/scholia/issues/2423

I got this query rewritten at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Number_of_articles_with_CiTO-annotated_citations_by_year, I agree that given the current split strategy we have to UNION the main and scholarly articles graph most of the time.

@Physikerwelt thanks for your feedback.

Blazegraph is definitely not the best solution and the work to move off of blazegraph should be tracked under https://phabricator.wikimedia.org/T330525 (see the initial exploration we have done). The solutions you suggest might be better discussed in their own tickets as a subtask of T335067.
This particular ticket is about collecting feedback regarding use-cases that might be affected by the split. This split is one of the solution we want to experiment to address the scalabity issues of WDQS. We are conscious about the usability issues that you raise but at this point we are more focused on understanding the feasibility and limitations of federation with such a split. It should be worth noting that one goal is to be sure that use-cases not relying on the scientific articles should still work without federation.

Thank you for your response.

This particular ticket is about collecting feedback regarding use-cases that might be affected by the split. This split is one of the solution we want to experiment to address the scalabity issues of WDQS. We are conscious about the usability issues that you raise but at this point we are more focused on understanding the feasibility and limitations of federation with such a split. It should be worth noting that one goal is to be sure that use-cases not relying on the scientific articles should still work without federation.

As a scientist it is hard to understand how to collect feedback without properly defined success criteria. I am also a bit concerned about discussing the strategy on a technical level, where you can not just buy a bigger machine to mitigate the problem until a real solution is found. To me, it seems that WDQS is a non-essential service for wikipedia.org so migrating to something new can be done with service interruptions.

On a less detailed level, it seems very hard to understand why citations should be split off when [citation needed] became a catchphrase for the Wikimedia movement at large. Overall this experiment seems to be a waste of donation money to me.

One reason is that citations are a large corpus with a fairly narrow range of schemas and uses, so it could conceivably be implemented with optimizations that can't be applied across the board. But I do think that after a split into a separate citations wikibase, we should then formally evaluate the split and consider whether emerging after a migration to a new graph db makes sense [I would hope it does -- 1B entities is not too much for a single db in other communities and contexts]