??ࡱ?>?? ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????_?(`??` ? /? 0??0?DTimes New Roman?b?bD?b??0h?bh?b??0?DSymbolew Roman?b?bD?b??0h?bh?b??0 ?DMonotype Sorts?b?bD?b??0h?bh?b??00?DCourier Newts?b?bD?b??0h?bh?b??01?  ??@  @@``???  @?n???" dd@?????????  @@``?? "??`?k?%*&3% $% !"#  &')S ?~??????????1???????????0? ??????n?@???????8???????g??42d2dt?b??0l?b>$ ????????p?pp?0 ? <?4BdBd?b?b>$ ???????-?42000. Yu.Demchenko. TERENA ??Multilingual Issues in Information Retrieval and Resource DescriptionO? ?=??U???SMultilingual Issues in Information Retrieval and Resource Description Overview?dT+' '/+#+??*Yuri Demchenko, TERENA demchenko@terena.nl?++?8A??In this presentation?  ??VMultilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background Data presentation and resource description format Standards Overview Metadata and Cataloging Recent Development in Subject Gateways and SE Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework ?bwEk/wEk/?V  ?4<??=TERENA Multilingual Community and TERENA Technical Programme? > ??:TERENA has 43 members from 34 countries speaking 30 languages Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF?L?LP?LP?    ?=E??(Multilinguality: trends and developments? ) ??'Storing, processing, presentation and exchange of information in many languages Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language unification ??P~"F(?P~"F(?? ( ?6>??Internet Applications?  ???None-interactive Application: Electronic Mail Correct Message Composition and Rendering Interactive applications WWW: HTTP/HTML http-equiv="Content-Type" Content="text/html; charset=euc-jp" Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication Operational Applications XML/DOM (Internationalised) DNS LDAP and X.500 (Language Support ?)??.*??D.*??D?~?, ,? ?7???0I18n and ML issues at IETF and other STD bodies ? 1 ??AIETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 Language and Charset/Encoding tagging Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec) Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension?~U&*G>?nU&*G>?n??U^                 8  0  2  P ?*??IIETF Architectural Model of Multilingual support in Internet Applications???User Interface Presentation Culture Locale Language On-the-wire Coded Character Set - Repertoire of ISO-10646 Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 Transfer Encoding Scheme (Base64, QP) ?Z% ?% ????>F??(Content Negotiation Framework (IETF/W3C)? ) ???Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged Abstract framework for content negotiation (Content) (Transmit.data) (Data document) [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703 Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content negotiation??*?+?ql@W*?+zql@W? #  ? ? +??Charset and Language tagging??MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme base64 quoted-printable Other media attributes and features (e.g., resolution, color, language, etc.) Language RFC 1766 ISO639-2 ?~eN eN ?,!?.?:B??WWW: HTTP/HTML? ??HTTP header includes information about the type of the transferred information and the character encoding for text-based information: http-equiv="Content-Type" Content="text/html; charset=euc-jp" The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: http-equiv="Content-Type" Content-Language=se Character encoding information in the META information of the HTML document: ?X?=?-ND???  ,    ?  ~  ,     ?(8??XML: Character Set tagging???Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16 Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S  encoding Eq    EncName    |    EncName    ) <? xml encoding+ UTF-8 ?> <? xml encoding+ EUC-JP ?> Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding?l!'mL?V!'mL?V??  G ?;C??XML: Language tagging?  ???Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages LanguageID : : = Langcode ( - Subcode) Langcode : : = ISO639Code | IanaCode | UserCode Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit hei?em Bem?h'n.</l> </sp>?z']Z BG']Z BF? ?          ?           &                          $    ?@H??Unicode Technical Reports?  ??The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1 http://www.stri.is/TC304/p10_1998_05_30.pdf Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags ?dg??%/????B    Q  Z    ^  ? ?&7??;Language Definition in DC Metadata set - DC.Language Format???<meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2 content = "eng"> <meta name = "DC.Language scheme = "rfc1766 content = "en-US"> <meta name = "DC.Language content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language content = "es"> <meta name = "DC.Language content = "german"> <meta name = "DC.Language lang = "fr content = "allemand">??????  l  )  )  *          ?<D??VLanguage Definition in DC Metadata set - Field content language labelling/attributing?V ??8A work in Spanish may be assigned the following metadata: <meta name = "DC.Language scheme = "rfc1766 content = "es"> <meta name = "DC.Title" lang = "es" content = "La Mesa Verde y la Silla Roja"> <meta name = "DC.Title" lang = "en" content = "The Green Table and the Red Chair">?xt           7 ?9@??DC in Multiple Languages?  ???The reference language of Int l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations Formal recognition and standardization procedure?6??1??1?LO  [  ?   P ??G??7Document Description with Unqualified DC and RDF syntax? 8 ??w Das Erdbeben in Chili Heinrich von Kleist XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding?D,+!++!??       P           "  { ?.;??=Recent Developments in Subject Gateways, Indexing, Searching???NRENs projects Subject gateways Commercial Search Engines Multilingual Text Retrieval and Processing TUSTEP system - using  fuzzy multilingual seaching Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST ?6e4Ze4Z? ?  [ ?*9??%Multilingual Subject Gateway (DESIRE)??Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes?Z<??-1<??-1?? k?+:??:Multilingual provision in popular Internet Search Engines???Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages Documents indexed as is Automatic translation - very simple and naive Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages Other sites that have dedicated national sites interface language language resources no special language policy Excite - 11 countries Lycos - 23 countries?~6F?0A+6F?0A+?b' \7 (?? ,??4TUSTEP TUebingen System of Text Processing Programs? $??/1. File structure 2. Multilingual capabilities 3. Internal data presentation 4. Database publishing/output data presentation 5. CGI 6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit Try entries like Smith or Meier or... http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery ?Z?7&5?7&5??? 0  ?AI??DCross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8? E ???TREC - Text REtrieval Conference - http://trec.nist.gov/ Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation Latent Semantic Indexing Generalised Vector Space Model, etc. Computer translation Machine-readable bilingual dictionaries MultilingualThesauri Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others ?Zr2>RQr2>RQ??             "   ?3??AREIS Project/Initiative Multilinguality framework - First attempt???Multiple language indexing multiple language documents/indexes Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using  fuzzy multilingual searching/matching Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like) Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging??$l#B$?"7$l#B$?"7/?? ????" ?'?)?,?-?/P???b???? `? ????????f??????`? ???3?????????????`? ???___?????????????>???" dd=??????????????" dd?=?????????????uA?4? d?O?" ?i ?n???" dd??????????   @@``P?P   4 O i`? p?@??@   ^ ?V ?)?? ?( ? ??p ? ? ?H??????d???? ?'W??? ? ? ?Z?$?v?a????a?????????? ??x8???? v ?T?? Click to edit Master title style?!? !?: ? ? ?T???v?a????a????????? ??Sf??? v ???RClick to edit Master text styles Second Level Third Level Fourth Level Fifth Level?!    ? S?, ?  ?`???v?a????a??????????? ?? ????? v ?????*? ????=44OOii?  ?   ?`?D?v?a????a??????????? ?? _???  v ?a??*? ???=44OOii?I ?!  ?`???v?a????a??????????? ??!????? v ???Slide_*?( ??$??=44OOii?Z?F ?1?lY ?$ ??~???~ ?" ? ?N?????????2?????1?l$?~ ?# ? ?N?????????2?????1IlY??F ??? ?) ???c?8 ?% s ?B?C{DE?8F?@??????????????????@????????F??h??=?Zhz?zFz?\F3? @???????????????????0 ?& s ?B?C?DE?4F?<??????????????????@????? ????i??<?????<??#i?????@???????????????g?5?0 ?' s ?B-C?DE?4F?<??????????????????@????? ??o?????*l??,J??????Jz?o@???????????????Arn*? ?( ? ??BKCoDE?4F?<?????????? ??(%+(J27JQ+E%nEQ7@???????????????????H ? ? ?0??@??޽h??? ?? ??????????f?????? ?"International.pot?&? ? ???0?% ?f?( ??4p? ~?p? ? ?^ ? ? ?6??????? ?@_??p ? ? ?H??????d???? ??_??? ? ? ?Z??3??a????a?????????? ???????? ? ?T?? Click to edit Master title style?!? !?? ? ? ?Z?d4??a????a?????????? ??HZjG ?? ? ?W??#Click to edit Master subtitle style?$? $? ?  ?`??4??a????a??????????? ???????? ? ?[??*????=44OOii? ?  ?`?$5??a????a??????????? ???S ???  ? ?]??*????=44OOii?9 ?  ?`??5??a????a??????????? ???????? ? ??? Slide 2_*?  ??$??=44OOii?H ? ? ?0??@??޽h??? ?? ??????????f?????????? 0 ??P??N?( ? ??  ? ? ?T?7??jJ??jJ??????? ???? ,G??  ? ?q??*? ??? ??? ? ? ?T?d7??jJ??jJ??????? ????l ?G?? ? ?s??*? ??? ???p ? ? ?0?????1? ????? ?? ??: ? ? ?T??7???g?ֳ??g?ֳ?????? ??? LL??? ? ???RClick to edit Master text styles Second level Third level Fourth level Fifth level?!    ? S? ? ? ?Z?$8??jJ??jJ???????? ?? ,l??  ? ?q??*? ??? ??? ? ? ?Z??8??jJ??jJ???????? ??l ?l?? v ?s??*? ??? ???H ? ? ?0??b?f?@???? ?? ??????̙33????????? ??@??0?( ? ??H ? ? ?0???b?f?@??? ?? ??????̙33??????????? 0?( ????( ? ??? ? # ?l?D6??g????g????????????? ? ?h???? ? ? ??? ? # ?l??6??g????g????????????? ? ?? ??_?? ? ? ??H ? ? ?0???@??޽h?? ?? ??????????f????????? ? ???`?x??( ?@? ?x?l ?x C ??:????x8????  ? ? ??l ?x C ??d:????Sf??? ? ? ??H ?x ? ?0???@??޽h?? ?? ??????????f????????? ? ???p?d??( ? ?d?l ?d C ??$;????x8????  ? ? ??l ?d C ???;????Sf??? ? ? ??H ?d ? ?0???@??޽h?? ?? ??????????f????????? ? ????????( ? ???l ?? C ??=????x8????  ? ? ??l ?? C ??d=????Sf??? ? ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ?????l??( ?V ?l?l ?l C ??T???x8????   ? ??l ?l C ??????Sf???  ? ??H ?l ? ?0???@??޽h?? ?? ??????????f????????? ? ?????p??( ?@<,8<, ?p?l ?p C ??t???x8????  ? ? ??l ?p C ??4???Sf??? ? ? ??H ?p ? ?0???@??޽h?? ?? ??????????f????????? ? K?C??$?????( ?? ???? ?? # ?l?4?a????a????????????? ??x8????  ? ? ??? ?? # ?l???a????a????????????? ??Sf- ?? ? ? ???8 ?L ??d ?$???L ?d?6?@ ????? ????????? ? ?  ?`??????????1???????????? ?B?? ?  ?? ? ? ? ?Z?ę???????1?????????O?? ?r??,Resolution Service / Directory (content MD)?--? - ?? ? ? ? ?Z?Ĝ????????1??????? P ?d ?B?? ?  ??? ? ?  ?`???????????1????????h *O ?[??Communication/Network??  ?xB ?? ? ?H??D????o?????L$ LP ?xB ?? ? ?H??D????o?????, ,D ?xB ??? ? ?H??D????o????????\ ?xB ??? ? ?H??D????o????????, ???@ ? ??$  ??? ??$ ?? ?? ? ?Z??????????1??????? ??$  ?B?? ?  ?? ? ? ? ?Z???????????1??????? ?? ?B?? ?  ?? ?? ? ?Z?$?????????1??????? ??? ?B?? ?  ??? ??  ?`????????????1??????? ??? ?N??Language? ?  ? ? ??  ?`?Ԙ?????????1???????. ??  ?u??Presentation Culture Locale?$? ??? ??  ?`????????????1???????: ???  ?^??Content Transfer Agent??  ???N ? ??$  ?? ?????| ?? ?? ? ?Z?$?"????????1??????? ??$  ?B?? ?  ?? ?? ? ?Z???"????????1??????? ?? ?B?? ?  ?? ?? ? ?Z??"????????1??????? ??? ?B?? ?  ??? ??  ?`???"?????????1??????? ??? ?N??Language? ?  ? ? ??  ?`?d?"?????????1???????. ??  ?u??Presentation Culture Locale?$? ??? ??  ?`?$?"?????????1???????: ???  ?^??Content Transfer Agent??  ??? ? ?  ?`???"?????????1???????9 ?3  ?\??Communication Protocol??  ???@ ?L ?? ?  ?#??L ?? ? ?? ?!? ? ?T????????1???????L ?? ? ??? ?"?  ?f?????????????1???????Y ?? ?  ?\??Resource?&  ?  ?H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ????????( ??? ???l ?? C ?????x8????  ? ? ??l ?? C ??t??????? ? ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ???0????( ?o  ???l ?? C ???}???x8????   ? ??l ?? C ??4~???Sf???  ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ????|??( ??????? ?|?l ?| C ??t ???x8????   ? ??l ?| C ??? ???Sf???  ? ??H ?| ? ?0???@??޽h?? ?? ??????????f????????? ? ????4??( ?% ?4?l ?4 C ??4 ???x8????   ? ??l ?4 C ??? ???Sf???  ? ??H ?4 ? ?0???@??޽h?? ?? ??????????f????????? ? ??? ????( ????? ???l ?? C ??? ???x8????   ? ??l ?? C ??T|???/????  ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ???p????( ? ???l ?? C ??$?"???x8????  " ? ??l ?? C ????"???Sf??? " ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ???@?,??( ? ?,?l ?, C ???~???x8????   ? ??l ?, C ???~???Sf???  ? ??H ?, ? ?0???@??޽h?? ?? ??????????f????????? ? ???P????( ?Xu@?@? ???l ?? C ??4????x8????  ? ? ??l ?? C ???????Sf???  ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ???`?t??( ? ?t?l ?t C ??????x8????   ? ??l ?t C ??t????Sf???  ? ??H ?t ? ?0???@??޽h?? ?? ??????????f????????? ? ???`????( ? ???l ?? C ???????x8????  ? ? ??l ?? C ??t????Sf??? ? ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ?????L??( ? ?L?l ?L C ??????x8????   ? ??l ?L C ??t????Sf???  ? ??H ?L ? ?0???@??޽h?? ?? ??????????f????????? ? ???p?<??( ? ?<?l ?< C ??ԃ???x8????   ? ??l ?< C ??4????Sf???  ? ??H ?< ? ?0???@??޽h?? ?? ??????????f????????? ? ?????@??( ?w@#@ ?@?l ?@ C ??T????x8????   ? ??l ?@ C ???????Sf???  ? ??H ?@ ? ?0???@??޽h?? ?? ??????????f????????? ? ????????( ? ???l ?? C ???????x8????   ? ??l ?? C ??????Sf???  ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ? ????????( ? ???l ?? C ??Đ"???x8????  9 ? ??l ?? C ??$?"???Sf??? 9 ? ??H ?? ? ?0???@??޽h?? ?? ??????????f????????? ?  ????? ??( ???@? ? ?l ?  C ??d????x8????   ? ??l ?  C ?????????`???  ? ??H ?  ? ?0???@??޽h?? ?? ??????????f????????? 0 ??z?????( ????? ???R ?? 3 ??????? ??  ??? ?? C ??????? LL???  ? ? ??  ?H ?? ? ?0???b?f?@??? ?? ??????̙33??????????* 0 ??z?????( ? ???R ?? 3 ??????? ??  ?? ?? C ??????? LL???   ? ??  ?H ?? ? ?0???b?f?@??? ?? ??????̙33??????????+ 0 ??z?????( ? ???R ?? 3 ??????? ??  ?? ?? C ???????? LL???   ? ??  ?H ?? ? ?0???b?f?@??? ?? ??????̙33??????????, 0 ??z@????( ?8 ???R ?? 3 ??????? ??  ?? ?? C ???????? LL???   ? ??  ?H ?? ? ?0???b?f?@??? ?? ??????̙33??????????3 0 ??zP???( ? ??R ? 3 ??????? ??  ?? ? C ??????? LL???   ? ??  ?H ? ? ?0???b?f?@??? ?? ??????̙33??????????7 0 ??z?0??( ? ?0?R ?0 3 ??????? ??  ?? ?0 C ??$????? LL???   ? ??  ?H ?0 ? ?0???b?f?@??? ?? ??????̙33??????????8 0 ??z??8??( ??? ?8?R ?8 3 ??????? ??  ?? ?8 C ??d????? LL???   ? ??  ?H ?8 ? ?0???b?f?@??? ?? ??????̙33??????????9 0 ??z?D??( ? ?D?R ?D 3 ??????? ??  ?? ?D C ???????? LL???   ? ??  ?H ?D ? ?0???b?f?@??? ?? ??????̙33??????????: 0 ??z ?H??( ?????<? ?H?R ?H 3 ??????? ??  ?? ?H C ???????? LL???   ? ??  ?H ?H ? ?0???b?f?@??? ?? ??????̙33??????????; 0 ??z0?P??( ?| ?P?R ?P 3 ??????? ??  ?? ?P C ??D????? LL???   ? ??  ?H ?P ? ?0???b?f?@??? ?? ??????̙33???????r?`?w?~?_?p?0x???{? ?0????y?;?"C?&?[? ???????????k?۵5?4??6???????ۦ??;?؅???{?[??3? ??Ab?x`?(`?????????? ? ?? ????  D?A4 Paper (210x297 mm) ?? ? Times New RomanSymbolMonotype Sorts Courier NewInternational.potTMultilingual Issues in Information Retrieval and Resource Description OverviewIn this presentation>TERENA Multilingual Community and TERENA Technical Programme)Multilinguality: trends and developmentsInternet Applications1I18n and ML issues at IETF and other STD bodies JIETF Architectural Model of Multilingual support in Internet Applications)Content Negotiation Framework (IETF/W3C)Charset and Language taggingWWW: HTTP/HTMLXML: Character Set taggingXML: Language taggingUnicode Technical Reports<Language Definition in DC Metadata set - DC.Language FormatWLanguage Definition in DC Metadata set - Field content language labelling/attributingDC in Multiple Languages8Document Description with Unqualified DC and RDF syntax>Recent Developments in Subject Gateways, Indexing, Searching&Multilingual Subject Gateway (DESIRE);Multilingual provision in popular Internet Search Engines5TUSTEP TUebingen System of Text Processing ProgramsECross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8BREIS Project/Initiative Multilinguality framework - First attemptDMultilinguality Framework for Multilingual Indexing/Search Services  Fonts UsedDesign Template Slide Titles? 6> _PID_GUID?AN{6581DFA0-211F-11D4-9BAB-00104B457DFB}?&_?????g?Yuri DemchenkoD?b??0h?bh?b??0?DSymbolew Roman?b?bD?b??0h?bh?b??0 ?DMonotype Sorts?b?bD?b??0h?bh?b??00?DCourier Newts?b?bD?b??0h?bh?b??01?  ??@  @@``???  @?n???" dd@?????????  @@``?? 2?*?p?m?&*&3% $% !"#  &'),S ?~??????????1???????????0? ??????n?@???????8???????g??42d2dt?b??0l?b>$ ????????p?pp?0 ? <?4BdBd?b?b>$ ???????-?42000. Yu.Demchenko. TERENA ??Multilingual Issues in Information Retrieval and Resource DescriptionO? ?=?~V???SMultilingual Issues in Information Retrieval and Resource Description Overview?dT+' '/+#+??*Yuri Demchenko, TERENA demchenko@terena.nl?++?8A??In this presentation?  ??VMultilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background Data presentation and resource description format Standards Overview Metadata and Cataloging Recent Development in Subject Gateways and SE Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework ?bwEk/wEk/?V  ?4<??=TERENA Multilingual Community and TERENA Technical Programme? > ??:TERENA has 43 members from 34 countries speaking 30 languages Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF?L?LP?LP?    ?=E??(Multilinguality: trends and developments? ) ??'Storing, processing, presentation and exchange of information in many languages Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language unification ??P~"F(?P~"F(?? ( ?6>??Internet Applications?  ???None-interactive Application: Electronic Mail Correct Message Composition and Rendering Interactive applications WWW: HTTP/HTML http-equiv="Content-Type" Content="text/html; charset=euc-jp" Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication Operational Applications XML/DOM (Internationalised) DNS LDAP and X.500 (Language Support ?)??.*??D.*??D?~?, ,? ?7???0I18n and ML issues at IETF and other STD bodies ? 1 ??AIETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 Language and Charset/Encoding tagging Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec) Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension?~U&*G>?nU&*G>?n??U^                 8  0  2  P ?*??IIETF Architectural Model of Multilingual support in Internet Applications???User Interface Presentation Culture Locale Language On-the-wire Coded Character Set - Repertoire of ISO-10646 Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 Transfer Encoding Scheme (Base64, QP) ?Z% ?% ????>F??(Content Negotiation Framework (IETF/W3C)? ) ???Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged Abstract framework for content negotiation (Content) (Transmit.data) (Data document) [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703 Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content negotiation??*?+?ql@W*?+zql@W? #  ? ? +??Charset and Language tagging??MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme base64 quoted-printable Other media attributes and features (e.g., resolution, color, language, etc.) Language RFC 1766 ISO639-2 ?~eN eN ?,!?.?:B??WWW: HTTP/HTML? ??HTTP header includes information about the type of the transferred information and the character encoding for text-based information: http-equiv="Content-Type" Content="text/html; charset=euc-jp" The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: http-equiv="Content-Type" Content-Language=se Character encoding information in the META information of the HTML document: ?X?=?-ND???  ,    ?  ~  ,     ?(8??XML: Character Set tagging???Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16 Character Encoding declaration in XML documents or entities (section 4.3.3) Encod  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghi?klm????opqrstuvwxyz{|}~?????????????????????????????????????????????????????????????????n?????????j???????????????????????????????????????????????????????????????????Root Entrych_htmlDEMCH_~21D'FedesireDESI??????????d?O?????)?1&1?/?ο?@ex20Current UserdocsDOCS?Ni ???E<@>???gga?I?????????????H??X?eQ????9A?.D?#RSummaryInformation`???n???F8??w?'(????????ui*? >???? P?p??? ?›?h?X?(?h?c ?w?PowerPoint Document??G@H?d?}??U?-&G 5??(????K?=.)??*??? ???,????9o???? ????R?QDocumentSummaryInformation?_????8???????????????r??? ?u???:Γu??7L? zf?fW:?lx?/??4??KT?d????D?^?@?^4?֦I$??????YAtc` wRWw?????ݝ;:?r?At?k????????????=????T???????|??f2??3om ??ǥ????q??L??.0??n?5?߷ž????????????????\?pe?cp???D??lynre*?P??M?N?Q?݋?a?(?IT` ? /? 0??0?DTimes New Roman?b?bD?b??0h?bh?b??0?DSymbolew Roman?b?bD?b??0h?bh?b??0 ?DMonotype Sorts?b?bD?b??0h?bh?b??00?DCourier Newts?b?bD?b??0h?bh?b??01?  ??@  @@``???  @?n???" dd@?????????  @@``?? *?"?h?l?&*&3% $% !"#  &'),S ?~??????????1???????????0? ??????n?@???????8???????g??42d2dt?b??0l?b>$ ????????p?pp?0 ? <?4BdBd?b?b>$ ???????-?42000. Yu.Demchenko. TERENA ??Multilingual Issues in Information Retrieval and Resource DescriptionO? ?=?mV???SMultilingual Issues in Information Retrieval and Resource Description Overview?dT+' '/+#+??*Yuri Demchenko, TERENA demchenko@terena.nl?++?8A??In this presentation?  ??VMultilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background Data presentation and resource description format Standards Overview Metadata and Cataloging Recent Development in Subject Gateways and SE Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework ?bwEk/wEk/?V  ?4<??=TERENA Multilingual Community and TERENA Technical Programme? > ??:TERENA has 43 members from 34 countries speaking 30 languages Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF?L?LP?LP?    ?=E??(Multilinguality: trends and developments? ) ??'Storing, processing, presentation and exchange of information in many languages Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language unification ??P~"F(?P~"F(?? ( ?6>??Internet Applications?  ???None-interactive Application: Electronic Mail Correct Message Composition and Rendering Interactive applications WWW: HTTP/HTML http-equiv="Content-Type" Content="text/html; charset=euc-jp" Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication Operational Applications XML/DOM (Internationalised) DNS LDAP and X.500 (Language Support ?)??.*??D.*??D?~?, ,? ?7???0I18n and ML issues at IETF and other STD bodies ? 1 ??AIETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 Language and Charset/Encoding tagging Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec) Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension?~U&*G>?nU&*G>?n??U^                 8  0  2  P ?*??IIETF Architectural Model of Multilingual support in Internet Applications???User Interface Presentation Culture Locale Language On-the-wire Coded Character Set - Repertoire of ISO-10646 Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 Transfer Encoding Scheme (Base64, QP) ?Z% ?% ????>F??(Content Negotiation Framework (IETF/W3C)? ) ???Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged Abstract framework for content negotiation (Content) (Transmit.data) (Data document) [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703 Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content negotiation??*?+?ql@W*?+zql@W? #  ? ? +??Charset and Language tagging??MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme base64 quoted-printable Other media attributes and features (e.g., resolution, color, language, etc.) Language RFC 1766 ISO639-2 ?~eN eN ?,!?.?:B??WWW: HTTP/HTML? ??HTTP header includes information about the type of the transferred information and the character encoding for text-based information: http-equiv="Content-Type" Content="text/html; charset=euc-jp" The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: http-equiv="Content-Type" Content-Language=se Character encoding information in the META information of the HTML document: ?X?=?-ND???  ,    ?  ~  ,     ?(8??XML: Character Set tagging???Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16 Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S  encoding Eq    EncName    |    EncName    ) <? xml encoding+ UTF-8 ?> <? xml encoding+ EUC-JP ?> Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding?l!'mL?V!'mL?V??  G ?;C??XML: Language tagging?  ???Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages LanguageID : : = Langcode ( - Subcode) Langcode : : = ISO639Code | IanaCode | UserCode Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit hei?em Bem?h'n.</l> </sp>?z']Z BG']Z BF? ?          ?           &                          $    ?@H??Unicode Technical Reports?  ??The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1 http://www.stri.is/TC304/p10_1998_05_30.pdf Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags ?dg??%/????B    Q  Z    ^  ? ?&7??;Language Definition in DC Metadata set - DC.Language Format???<meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2 content = "eng"> <meta name = "DC.Language scheme = "rfc1766 content = "en-US"> <meta name = "DC.Language content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language content = "es"> <meta name = "DC.Language content = "german"> <meta name = "DC.Language lang = "fr content = "allemand">??????  l  )  )  *          ?<D??VLanguage Definition in DC Metadata set - Field content language labelling/attributing?V ??8A work in Spanish may be assigned the following metadata: <meta name = "DC.Language scheme = "rfc1766 content = "es"> <meta name = "DC.Title" lang = "es" content = "La Mesa Verde y la Silla Roja"> <meta name = "DC.Title" lang = "en" content = "The Green Table and the Red Chair">?xt           7 ?9@??DC in Multiple Languages?  ???The reference language of Int l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations Formal recognition and standardization procedure?6??1??1?LO  [  ?   P ??G??7Document Description with Unqualified DC and RDF syntax? 8 ??w Das Erdbeben in Chili Heinrich von Kleist XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding?D,+!++!??       P           "  { ?.;??=Recent Developments in Subject Gateways, Indexing, Searching???NRENs projects Subject gateways Commercial Search Engines Multilingual Text Retrieval and Processing TUSTEP system - using  fuzzy multilingual seaching Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST ?6e4Ze4Z? ?  [ ?*9??%Multilingual Subject Gateway (DESIRE)??Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes?Z<??-1<??-1?? k?+:??:Multilingual provision in popular Internet Search Engines???Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages Documents indexed as is Automatic translation - very simple and naive Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages Other sites that have dedicated national sites interface language language resources no special language policy Excite - 11 countries Lycos - 23 countries?~6F?0A+6F?0A+?b' \7 (?? ,??4TUSTEP TUebingen System of Text Processing Programs? $??/1. File structure 2. Multilingual capabilities 3. Internal data presentation 4. Database publishing/output data presentation 5. CGI 6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit Try entries like Smith or Meier or... http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery ?Z?7&5?7&5??? 0  ?AI??DCross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8? E ???TREC - Text REtrieval Conference - http://trec.nist.gov/ Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation Latent Semantic Indexing Generalised Vector Space Model, etc. Computer translation Machine-readable bilingual dictionaries MultilingualThesauri Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others ?Zr2>RQr2>RQ??             "   ?3??AREIS Project/Initiative Multilinguality framework - First attempt???Multiple language indexing multiple language documents/indexes Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using  fuzzy multilingual searching/matching Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like) Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging??$l#B$?"7$l#B$?"7?BJ??2REIS Project/Initiative Multilinguality framework ?2 ??  /?? ????" ?'?)?,?-?/P???b???? ? y?q???? ?( ? ???l ?? C ???"???x8????  " ? ??l ?? C ????"???Sf??? " ? ???? ??  ?`?ę"?????????1?????????~  ?Y??To be developed yet? ?  ?H ?? ? ?0???@??޽h?? ?? ??????????f??????r??Bg4?J? ??87Bb??`?(`??` ? /? 0??0?DTimes New Roman?b?b????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????0123456789:;<=>?@ABCDEFGHIJKL?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????&_??????f?Yuri Demchenkox??? ? ( 4 @ LX`?? HTTP и CGI TP=D:\msoffice\Templates\Presentation Designs\International.potnYuri Demchenkop184Microsoft PowerPoint 7.0sen@???Tv@??Ԗѹ?@`??X7?@@?????G,????oZ  ?.&?????? &????&#????TNPP??0D v & TNPP? &????&TNPP   ?? ????- ????"-- !???-- ?"---- !?S?E---&????t??&????-?-????- $t?t?vv????-? $v?v?xx????-? $x?x?{{????-? ${?{?}}????-? $}?}?????-? $????????-? $??????????-? $??????????-? $??????????-? $??????????-? $???????jj?-? $???????UU?-? $???????CC?-? $???????,,?-? $????????-? $??????-?--&????&????--ZUO7-- ?- ????Times New Roman-? .$2 ??Multilingual Issues3.????Times New Roman?-? . 2 ??in .????Times New Roman?-? ..2 ? Information Retrieval and )# . .%2 D?Resource Description# &.--7 ?j-- ????Times New Roman?-? .(2 ??Yuri Demchenko, TERENA  .--??"System-?&TNPP &????C?? ??՜.??+,??D??՜.??+,??ingDecl : : = S  encoding Eq    EncName    |    EncName    ) <? xml encoding+ UTF-8 ?> <? xml encoding+ EUC-JP ?> Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding?l!'mL?V!'mL?V??  G ?;C??XML: Language tagging?  ???Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages LanguageID : : = Langcode ( - Subcode) Langcode : : = ISO639Code | IanaCode | UserCode Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit hei?em Bem?h'n.</l> </sp>?z']Z BG']Z BF? ?          ?           &                          $    ?@H??Unicode Technical Reports?  ??The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1 http://www.stri.is/TC304/p10_1998_05_30.pdf Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags ?dg??%/????B    Q  Z    ^  ? ?&7??;Language Definition in DC Metadata set - DC.Language Format???<meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2 content = "eng"> <meta name = "DC.Language scheme = "rfc1766 content = "en-US"> <meta name = "DC.Language content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language content = "es"> <meta name = "DC.Language content = "german"> <meta name = "DC.Language lang = "fr content = "allemand">??????  l  )  )  *          ?<D??VLanguage Definition in DC Metadata set - Field content language labelling/attributing?V ??8A work in Spanish may be assigned the following metadata: <meta name = "DC.Language scheme = "rfc1766 content = "es"> <meta name = "DC.Title" lang = "es" content = "La Mesa Verde y la Silla Roja"> <meta name = "DC.Title" lang = "en" content = "The Green Table and the Red Chair">?xt           7 ?9@??DC in Multiple Languages?  ???The reference language of Int l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations Formal recognition and standardization procedure?6??1??1?LO  [  ?   P ??G??7Document Description with Unqualified DC and RDF syntax? 8 ??w Das Erdbeben in Chili Heinrich von Kleist XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding?D,+!++!??       P           "  { ?.;??=Recent Developments in Subject Gateways, Indexing, Searching???NRENs projects Subject gateways Commercial Search Engines Multilingual Text Retrieval and Processing TUSTEP system - using  fuzzy multilingual seaching Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST ?6e4Ze4Z? ?  [ ?*9??%Multilingual Subject Gateway (DESIRE)??Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes?Z<??-1<??-1?? k?+:??:Multilingual provision in popular Internet Search Engines???Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages Documents indexed as is Automatic translation - very simple and naive Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages Other sites that have dedicated national sites interface language language resources no special language policy Excite - 11 countries Lycos - 23 countries?~6F?0A+6F?0A+?b' \7 (?? ,??4TUSTEP TUebingen System of Text Processing Programs? $??/1. File structure 2. Multilingual capabilities 3. Internal data presentation 4. Database publishing/output data presentation 5. CGI 6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit Try entries like Smith or Meier or... http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery ?Z?7&5?7&5??? 0  ?AI??DCross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8? E ???TREC - Text REtrieval Conference - http://trec.nist.gov/ Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation Latent Semantic Indexing Generalised Vector Space Model, etc. Computer translation Machine-readable bilingual dictionaries MultilingualThesauri Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others ?Zr2>RQr2>RQ??             "   ?3??AREIS Project/Initiative Multilinguality framework - First attempt???Multiple language indexing multiple language documents/indexes Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using  fuzzy multilingual searching/matching Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like) Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging??$l#B$?"7$l#B$?"7?BJ??CMultilinguality Framework for Multilingual Indexing/Search Services?C ??  /?? ????" ?'?)?,?-?/P???b?rt7?J? P7 ?Bb?k`?(`??` ? /? 0??0?DTimes New Roman?b?bD?b??0h?bh?b??0?DSymbolew Roman?b?bD?b??0h?bh?b??0 ?DMonotype Sorts?b?bD?b??0h?bh?b??00?DCourier Newts?b?bD?b??0h?bh?b??01?  ??@  @@``???  @?n???" dd@?????????  @@``?? 2?*?p?m?*&3%  $% !"#    S ?~??????????1???????????0? ??????n?@???????8???????g??42d2dt?b??0l?b>$ ????????p?pp?0 ? <?4BdBd?b?b>$ ???????-?42000. Yu.Demchenko. TERENA ??Multilingual Issues in Information Retrieval and Resource DescriptionO? ?=?tV???SMultilingual Issues in Information Retrieval and Resource Description Overview?dT+' '/+#+??*Yuri Demchenko, TERENA demchenko@terena.nl?++?8A??In this presentation?  ??VMultilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background Data presentation and resource description format Standards Overview Metadata and Cataloging Recent Development in Subject Gateways and SE Cross-language Information Retrieval REIS/TAP Initiatives Multilinguality Framework ?`wEk/wEk/?V  ?4<??=TERENA Multilingual Community and TERENA Technical Programme? > ??:TERENA has 43 members from 34 countries speaking 30 languages Multilingual issues always were in the scope of TERENA Technical Program WG-i18n - WG on Internationalisation issues C3 Project on messaging transliteration tools MAITS - initiated by WG-i18n Multilingual E-Mail Agent Testing Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ Liaison with STD bodies CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html IETF?L?LP?LP?    ?=E??(Multilinguality: trends and developments? ) ??'Storing, processing, presentation and exchange of information in many languages Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) Multilingual Search and Retrieval Multilingual Subject Gateways and Search Engines CLIR testing at TREC Data Resource Model and Multilinguality One or Multiple languages Data format Metadata (not part of Data but part of Resource) References, links Professional Thesauri (Resource Context) - base for multiple languages and language unification ??P~"F(?P~"F(?? ( ?6>??Internet Applications?  ???None-interactive Application: Electronic Mail Correct Message Composition and Rendering Interactive applications WWW: HTTP/HTML http-equiv="Content-Type" Content="text/html; charset=euc-jp" Content Negotiation Protocol Media features, attributes Direct and hop-by-hop communication Operational Applications (Internationalised) DNS LDAP and X.500 (Language Support ?)??.*??<.*??<?~?, ,? ?7???0I18n and ML issues at IETF and other STD bodies ? 1 ??AIETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 Language and Charset/Encoding tagging Content negotiation framework (IETF/W3C) Point-to-point vs hop-by-hop Message based vs Interactive vs Streaming Internationalised DNS (IDN) - Internationalised Domain Names vs E-Mail (SMTP, IMAP) vs Routing (Routing Policy Specification Language (RPSL)) vs Network Management (SNMP textual presentation) vs Network Security (TLS and IPSec) Content Encoding normalisation (IETF/Unicode) LSD-2 - Large Scale Services Deployment IMAP language extension?~U&*G>?nU&*G>?n??U^                 8  0  2  P ?*??IIETF Architectural Model of Multilingual support in Internet Applications???User Interface Presentation Culture Locale Language On-the-wire Coded Character Set - Repertoire of ISO-10646 Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 Transfer Encoding Scheme (Base64, QP) ?Z% ?% ????>F??(Content Negotiation Framework (IETF/W3C)? ) ???Content Negotiation covers three elements Expressing the capabilities of the sender and the data resource to be transmitted Expressing the capabilities of a receiver A protocol by which capabilities are exchanged Abstract framework for content negotiation (Content) (Transmit.data) (Data document) [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] Transparent Content Negotiation in HTTP - RFC 2295 Protocol-independent Content Negotiation Framework - RFC 2703 Non-message resource transfer End-to-end vs hop-by-hop negotiation Use of directory and resolution services CC/PP exchange protocol based on HTTP Extension Framework (W3C) Composite Capability/Preference Profile: A user side framework for content negotiation??*?+?ql@W*?+zql@W? #  ? ? +??Charset and Language tagging??MIME types (RFC 2045-2049) text, img, audio, video Charset = Character Set + Character Encoding Scheme Transfer Encoding Scheme base64 quoted-printable Other media attributes and features (e.g., resolution, color, language, etc.) Language RFC 1766 ISO639-2 ?~eN eN ?,!?.?:B??WWW: HTTP/HTML? ??HTTP header includes information about the type of the transferred information and the character encoding for text-based information: http-equiv="Content-Type" Content="text/html; charset=euc-jp" The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: http-equiv="Content-Type" Content-Language=se Character encoding information in the META information of the HTML document: ?X?=?-ND???  ,    ?  ~  ,     ?(8??XML: Character Set tagging???Character is atomic unit of text All ISO 10646 characters + TAB, CR, LF The mechanism for Encoding can vary for different characters All XML processors must accept UTF-8 and UTF-16 Character Encoding declaration in XML documents or entities (section 4.3.3) EncodingDecl : : = S  encoding Eq    EncName    |    EncName    ) <? xml encoding+ UTF-8 ?> <? xml encoding+ EUC-JP ?> Default Character Set Encoding - UTF-8 and UTF-16 Autodetection of Character Encoding?l!'mL?V!'mL?V??  G ?;C??XML: Language tagging?  ???Language identification (section 2.12) Labelling language of the whole document, entity or item Tag for identification of languages LanguageID : : = Langcode ( - Subcode) Langcode : : = ISO639Code | IanaCode | UserCode Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> <p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p> <sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit hei?em Bem?h'n.</l> </sp>?z']Z BG']Z BF? ?          ?           &                          $    ?@H??Unicode Technical Reports?  ??The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html Multilingual European Subsets of ISO/IEC 10646-1 http://www.stri.is/TC304/p10_1998_05_30.pdf Unicode technical Reports UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst UTR #17: Character Encoding Model UTR #16: UTF-EBCDIC UTR #10: Unicode Collation Algorithm UTR #7: Plane 14 Characters for Language Tags ?dg??%/????B    Q  Z    ^  ? ?&7??;Language Definition in DC Metadata set - DC.Language Format???<meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2 content = "eng"> <meta name = "DC.Language scheme = "rfc1766 content = "en-US"> <meta name = "DC.Language content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language content = "es"> <meta name = "DC.Language content = "german"> <meta name = "DC.Language lang = "fr content = "allemand">??????  l  )  )  *          ?<D??VLanguage Definition in DC Metadata set - Field content language labelling/attributing?V ??8A work in Spanish may be assigned the following metadata: <meta name = "DC.Language scheme = "rfc1766 content = "es"> <meta name = "DC.Title" lang = "es" content = "La Mesa Verde y la Silla Roja"> <meta name = "DC.Title" lang = "en" content = "The Green Table and the Red Chair">?xt           7 ?9@??DC in Multiple Languages?  ???The reference language of Int l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language The versions of DC elements in various languages should share a single name space using tokens that l    ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????ook like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) Linkage to and from central DC namespace server Registry as Dictionary/Thesauri - use Interlinguas to link different translations Formal recognition and standardization procedure?6??1??1?LO  [  ?   P ??G??7Document Description with Unqualified DC and RDF syntax? 8 ??w Das Erdbeben in Chili Heinrich von Kleist XML Encoding (Character set) declaration UTF-8/UTF-16 as default encoding?D,+!++!??       P           "  { ?.;??=Recent Developments in Subject Gateways, Indexing, Searching???NRENs projects Subject gateways Commercial Search Engines Multilingual Text Retrieval and Processing TUSTEP system - using  fuzzy multilingual seaching Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST ?6e4Ze4Z? ?  [ ?*9??%Multilingual Subject Gateway (DESIRE)??Developing multilingual subject gateways (SOSIG as example) SOSIG accept any languages evaluated for quality Translation should be coherent and checked Different language version should be equally well maintained SOSIG Cataloguing rules TITLE will be displayed in the first language ALTERNATIVE TITLE in other languages DESCRIPTION will mention different languages in which resource is available URI of all language versions Labeling URI language Library standards for multilingual provision NISO Z39.53 Language codes USMARC Language codes?Z<??-1<??-1?? k?+:??:Multilingual provision in popular Internet Search Engines???Multilingual SE AltaVista - http://www.altavista.com/ - 28 languages Documents indexed as is Automatic translation - very simple and naive Euroseek - http://www.euroseek.com/ - 30 languages FAST Advanced Search - http://www.alltheweb.com - 31 languages Google - http://www.google.com/ - 11 languages Other sites that have dedicated national sites interface language language resources no special language policy Excite - 11 countries Lycos - 23 countries?~6F?0A+6F?0A+?b' \7 (?? ,??4TUSTEP TUebingen System of Text Processing Programs? $??/1. File structure 2. Multilingual capabilities 3. Internal data presentation 4. Database publishing/output data presentation 5. CGI 6. Sample implementation http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit Try entries like Smith or Meier or... http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery ?Z?7&5?7&5??? 0  ?AI??DCross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8? E ???TREC - Text REtrieval Conference - http://trec.nist.gov/ Cross-Language Information Retrieval (CLIR) technologies Using Intermediary or Interlingual representation Latent Semantic Indexing Generalised Vector Space Model, etc. Computer translation Machine-readable bilingual dictionaries MultilingualThesauri Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others ?Zr2>RQr2>RQ??             "   ?3??AREIS Project/Initiative Multilinguality framework - First attempt???Multiple language indexing multiple language documents/indexes Cross-language Searching Automatic Query forwarding based on thesauri or ML dictionary Using  fuzzy multilingual searching/matching Multilingual information retrieval Automatic translation (if requested) Translation Request Protocol Internal Data/Indexes presentation Language and Character Encoding tagging XML as internal presentation of data and XML language and charset tagging Text/Charset normalisation (Unicode or TUSTEP-like) Metadata and Resource Description DC.Language definition and XML/RDF/DC Language tagging??$l#B$?"7$l#B$?"7?BJ??CMultilinguality Framework for Multilingual Indexing/Search Services?C ??  /??? ???" ?'?)?,?-?/P???b?rA??*? ???Bb?? ?????Oh??+'??0? px??? ? ( 4 @ LX`?? HTTP и CGI TP=D:\msoffice\Templates\Presentation Designs\International.potnYuri Demchenkop184Microsoft PowerPoint 7.0sen@???Tv@??Ԗѹ?@`??X7?@@?????G,????oZ  ?o&?????? &????&#????TNPP??0?v & TNPP? &????&TNPP   ?? ????- ????"-- !???-- ?"---- !?S?E---&????t??&????-?-????- $t?t?vv????-? $v?v?xx????-? $x?x?{{????-? ${?{?}}????-? $}?}?????-? $????????-? $??????????-? $??????????-? $??????????-? $??????????-? $???????jj?-? $???????UU?-? $???????CC?-? $???????,,?-? $????????-? $??????-?--&????&????&????Gk?&???? ?- ????Times New Roman-? ???..2 ?Q2000. Yu.Demchenko. TERENA        .&????Gpy?&???? ??? ???.o2 ??EMultilingual Issues in Information Retrieval and Resource Description            .&?????n??&???? ??? ???.2 ?ZSlide 2  . ???. 2 ??_ . ???. 2 ??1 .--ZUO7-- ???????Times New Roman?-? .$2 ??Multilingual Issues/ .????Times New Roman?-? . 2 ??in .????Times New Roman?-? . 2 ?? .????Times New Roman?-? .-2 ?Information Retrieval and       . .%2  Resource Description    .????Times New Roman?-? .2 OpOverview .--7 ?j-- ????Times New Roman?-? .(2 ??Yuri Demchenko, TERENA  . .$2 demchenko@terena.nl    .--??"System-?&TNPP &????Root Entrych_htmlDEMCH_~21D'FedesireDESI??????????d?O?????)?1&1??q???@ex20Current User????????????/SummaryInformation`???n???F8??w?'(????????ui*? >???? P?p??? ?›?h?X?(?h?c ?w?PowerPoint Document??G@H?d?}??U?-&G 5??(????K?=.)??*??? ???,????9o???? ????R?Q    ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????n????????????j???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????0123456789:;<=>?@ABCDEFGHIJKL?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????_???????demchdemchx??? ? ( 4 @ LX`?? HTTP и CGI TP=D:\msoffice\Templates\Presentation Designs\International.potnYuri Demchenkop184Microsoft PowerPoint 7.0sen@???Tv@??Ԗѹ?@`??X7?@@?????G,????oZ  ?.&?????? &????&#????