Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/gtin should expect URLs (URIs), as promoted by GS1 #3156

Open
danbri opened this issue Aug 9, 2022 · 10 comments
Open

/gtin should expect URLs (URIs), as promoted by GS1 #3156

danbri opened this issue Aug 9, 2022 · 10 comments
Assignees
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). Queued for Staging (webschemas.org) Editorial work provisionally complete; ready for final review/checks.

Comments

@danbri
Copy link
Contributor

We do not currently define /gtin as expecting URLs (URIs).

The most recent direction of the GTIN work at GS1 is all about enabling this, and the /gtin definition at Schema.org in its textual form already anticipates URI/URL values.

The lack of a declaration that the /gtin property has /rangeIncludes /URL is pure oversight.

Talking with @philarcher it may also be useful to include the regex for extracting textual GTINs from full URL/URI.

@danbri danbri self-assigned this Aug 9, 2022
@danbri danbri added the Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes. label Aug 9, 2022
@danbri
Copy link
Contributor Author

@philarcher suggests that

^https?:(\/\/((([^\/?#])@)?([^\/?#:])(:([^\/?#]))?))?([^?#])(((\/01\/)((\d{8}|\d{12}|\d{13}|\d{14})[^\/]+)(\/[^/]+\/[^/]+)?[/]?(\?([^?\n]))?(#([^\n]))?))

Is what most will need for GTIN extraction. Given the complexity I am wary of putting it in the spec as-is, but for now will point to it here, where discussion or tweaks could be more easily shared. This regex should parse all GS1 Digital Link that encode GTINs (there are others...). It allows a GTIN to be 8, 12, 13 or 14 digits long (but not any other length).

@danbri danbri added the Queued for Staging (webschemas.org) Editorial work provisionally complete; ready for final review/checks. label Aug 9, 2022
@danbri
Copy link
Contributor Author

@philarcher @alex-jansen can you take a look at this.

Raw schema file is https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/ext/pending/issue-1244.ttl

I will proceed towards staging but welcome your early review! (and anyone else's...)

@danbri danbri removed the Queued for Editorial Work Editor needs to turn issues/PRs into final code and release notes. label Aug 9, 2022
@danbri
Copy link
Contributor Author

I should add that the definition already encourages GTINs as URLs in the textual part, which is why things are currently confusing and inconsistent.

@philarcher
Copy link

Let me add a bit of background. The horrendously complicated RegEx covers a lot of the flexibility that we include in the GS1 identification system. We can simplify it a lot if:

  • you only want to include GTINs and not any of the other identifiers GS1 defines (locations (GLNs), shipping containers (SSCCs), assets (GIAIs) etc.)

  • you don't want to ever include anything else like batch/lot numbers, serial numbers, expiry dates etc.

  • you don't want to include those bits of URI syntax that you almost never see (port numbers, passwords and what have you)

With those important restrictions we can start to make things simpler.

Note that the idea is to convey the GTIN as a data element. By encoding it in a URI, you provide one possible place where that GTIN can be looked up - the default online place to look it up - but it doesn't need to be the only one. In other words, the domain name is not part of the identifier.

For that to work, the structure of the URI must be adhered to. You can't just use any old URL for schema:gtin that happens to include the digits and you absolutely definitely categorically cannot use just any old URL that doesn't include a GTIN at all. And btw, GTINs have structure too. They're based on a issued prefix, there is a check-digit and so on. They're not just numbers. See your local GS1 for details.

OK, so here's the structure you need:

  • http or https
  • any domain name
  • optional path elements leading to...
  • /01/{gtin} (i.e. a path element of literally /01/ followed by the GTIN itself. The 01 is the GS1 'application identifier' for GTIN
  • No other path elements after the GTIN
  • If you add a query string, it MUST NOT include any parameter names that are all numeric

Again, this is all because I'm offering a restricted set of options to keep this simple. Chapter and verse is in the GS1 Digital Link URI Syntax standard.

So these are OK:
https://example.com/01/09506000134352
http://example.com/foo/bar/01/09506000134352

These are not OK:
https://example.com/09506000134352
https://example.com/01/09506000134352/
https://example.com/01/09506000134352/foo
https://example.com/01/09506000134352?0398=ABCD

OK, so with all that done, here's a slightly simpler regex:

^https?://([^\/?#:])?([^?#])?/01/(\d{8}|\d{12,14})(?|$)

However, this is not foolproof. For example the fourth 'bad example' above will actually match this. That's because the regex isn't sophisticated enough to check whether the URL includes an all-numeric parameter. If you can improve on this, OK!

The simplest case, where there is just a domain name followed by /01/{gtin} and nothing else, will be matched by this

^https?://([^\/?#:])*?/01/(\d{8}|\d{12,14})$

But that's more restrictive than we want to be.

HTH - shout in my direction if you need more.

@danbri
Copy link
Contributor Author

Thanks @philarcher !

Would it make sense to nudge the path part of the URL towards using .well-known ? I appreciate the spec is already mostly baked, though...

@philarcher
Copy link

That's a potential help, @danbri yes, but there are problems with that kind of thing when we're shooting for mass adoption. Something for you and I to chew on. I'm v aware of the need to stick within URI design best practice (sovereignty of the server and all that). And use of /.well-known/ is part of that. As ever, we're navigating a path between Web purity and practical reality.

@danbri
Copy link
Contributor Author

@philarcher yes - I understand those tradeoffs! But given the existence of /.well-known/ why not at least register a path name for that, so that those who want to use a .well-known URL have something available?

Would it be 'gs1'? 'gtin'? 'digital-link'?

(getting offtopic but) for those serving pages at these URLs do you say anything about what structured data you hope they'll provide?

@cyberandy
Copy link

I am in favor of registering a path name and of covering both cases:

  • sites providing a proper GS1 Digital Link URI (because it can be used in many contexts other than just SEO)
  • sites providing GTIN information within the structured data markup

@github-actions
Copy link

This issue is being nudged due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Nov 16, 2022
@MatthiasWiesmann
Copy link
Contributor

Sorry for coming late, I have a bunch of questions:

  • Would we expect the same treatment for the GLN attribute (http://example.com/414/.…)?
  • What about the CPV, this is very important for online retail?
  • If the annotated object is an IndividualProduct, should the lot and serial id be allowed?

I wonder if would not be useful to have a generic mechanism for attaching a digital-data-link to various objects (product, organization)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). Queued for Staging (webschemas.org) Editorial work provisionally complete; ready for final review/checks.
Projects
None yet
Development

No branches or pull requests

5 participants