Page MenuHomePhabricator

Analyze the usage of different machine translation engines in Content Translation
Closed, ResolvedPublic

Description

There must be a dashboard, probably using Dashiki, that will show how many articles were translated with each of the translation engines.

The basic query is something along these lines:

select
  cxc_origin as engine,
  count(distinct(cxc_translation_id)) as translations
from
  cx_corpora
where
  cxc_timestamp between 20180901000000 and 20181001000000
group by
  cxc_origin;

And we'll probably want a line chart with a line for each engine.

Event Timeline

Change 469390 had a related patch set uploaded (by Amire80; owner: Amire80):
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

Change 469390 merged by Milimetric:
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

Change 473056 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Change 473056 merged by Elukey:
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Please review if it makes sense:
https://language-reportcard.wmflabs.org/cx2/#mt-engines

I'm taking this straight from the database. "Undefined" and "Scratch" may need some explanation. New engines will be added automatically when they will start getting used.

Etonkovidova subscribed.

@Amire80 - below are my notes (most likely not important):

(1) Running the query takes quite a long time:

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| Apertium   |         2175 |
| LingoCloud |          196 |
| Matxin     |           12 |
| Yandex     |         7253 |
| Youdao     |           68 |
| no-mt      |            2 |
| source     |        12573 |
| source-mt  |            5 |
| user       |        12063 |
+------------+--------------+
9 rows in set (2 min 40.82 sec)

And possible keys is NULL:

 EXPLAIN EXTENDED select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cx_corpora
         type: index
possible_keys: NULL
          key: cx_corpora_unique
      key_len: 85
          ref: NULL
         rows: 12537240
     filtered: 100.00
        Extra: Using where; Using index; Using filesort
1 row in set, 1 warning (0.00 sec)

(2) There are no undefined value for cxc_origin (and the number of scratch records is very small) .

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora  group by cxc_origin order by translations  desc;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| source     |       409877 |
| user       |       394384 |
| Yandex     |       206382 |
| Apertium   |        80694 |
| Youdao     |         1298 |
| LingoCloud |          502 |
| Matxin     |          265 |
| scratch    |          103 |
| source-mt  |           63 |
| no-mt      |           26 |
| TestClient |            5 |
| disable-mt |            1 |
+------------+--------------+
12 rows in set (2 min 22.71 sec)
Pginer-WMF subscribed.

This graph seems really useful. Thanks! I created a follow-up ticket with some minor adjustments (mainly labelling) that will help clarify potential confusions: T210135: Adjust MT graph to clarify the presented concepts