Home About us Products Services Contact us Bookmark
:: wikimiki.org ::
Baseball

Baseball

Baseball (z anglického base=meta, ball=míč) je kolektivní pálkovací míčová hra, při níž se útočící hráč (pálkař) snaží pálkou zasáhnout míč nadhozený bránícím nadhazovačem. Pokud pálkař oběhne všechny 4 mety, aniž by byl vyautován, jeho družstvo získává bod. Baseball je populární hlavně v Jižní a především v Severní Americe a ve východní Asii (v Japonsku a Korejské republice). Kategorie:Sporty ko:야구 ja:野球 simple:Baseball th:เบสบอล

Jižní Amerika

Jižní Amerika je kontinent ležící větší částí na jižní polokouli. Jižní Amerikou prochází rovník; kontinent je ohraničen Tichým oceánem ze západu a Atlantským oceánem z východu. Jižní Amerika je spojena se Severní Amerikou úzkou pevninskou úžinou obvykle nazývanou Střední Amerika. K tomuto propojení došlo z geologického hlediska velmi nedávno. Po celé délce západního pobřeží Jižní Ameriky se vypíná pohoří And. Na východ od And, směrem do vnitrozemí, jsou rozlehlé oblasti tropického deštného pralesa. Většina jihoamerické pevniny východně od And patří do povodí Amazonky. K prvnímu osídlení Jižní Ameriky došlo zřejmě přechodem lidí přes pevninský most v dnešním Beringovu průlivu. Existují ovšem i náznaky migrace z jižního Tichého oceánu. Jihoamerický kontinent byl od 16. století do počátku 19. století téměř zcela rozdělen do kolonií pod španělskou a portugalskou správou. V současnosti zde existuje několik nezávislých republik a pár zbývajících kolonií: Francouzská Guyana, Falklandy a přilehlé ostrovy. K jihoamerickému kontinentu se počítají také přilehlé ostrovy, z nichž většina je kontrolována zeměmi na kontinentu. Ostrovy v Karibiku se přiřazují k Severní nebo Střední Americe.

Státy Jižní Ameriky


- Argentina
- Bolívie
- Brazílie
- Ekvádor
- Falklandy (Malvíny) (Velká Británie)
- Francouzská Guyana (Francie)
- Guyana
- Chile
- Jižní Georgie a Jižní Sandwichovy ostrovy (Velká Británie)
- Kolumbie
- Paraguay
- Peru
- Surinam
- Uruguay
- Venezuela

Externí odkazy


- [http://www.geographicguide.com/south-america.htm Jižní Amerika] (anglicky)
- [http://www.brasil-fotos.com Brazílie ve fotografii] (portugalsky) Category: Jižní Amerika ja:南アメリカ ko:남아메리카 simple:South America th:ทวีปอเมริกาใต้ zh-min-nan:Lâm Bí-chiu

Asie

Asie je kontinent tvořící součást větší Eurasie. Ačkoliv hranice Asie nejsou přesně určitelné, všeobecně se jako hranice udává Ural na západě, pak dále směrem k jihu Kaspické moře, pohoří Kavkaz, Černé moře, Marmarské moře, Dardanely a Suezský kanál. Vše směrem na východ od této pomyslné hranice se považuje za asijský subkontinent.

Seznam států

right
- Afghánistán
- Arménie
- Ázerbájdžán
- Bahrajn
- Bangladéš
- Bhútán
- Brunej
- Čína
- Filipíny
- Gruzie
- Indie
- Indonésie
- Irák
- Írán
- Izrael
- Japonsko
- Jemen
- Jižní Korea
- Jordánsko
- Kambodža
- Katar
- Kazachstán
- Kuvajt
- Kypr
- Kyrgyzstán
- Laos
- Libanon
- Malajsie
- Maledivy
- Mongolsko
- Myanmar (dříve Barma)
- Nepál
- Omán
- Pákistán
- Rusko
- Saúdská Arábie
- Severní Korea
- Singapur
- Spojené arabské emiráty
- Srí Lanka
- Sýrie
- Tádžikistán
- Thajsko
- Turecko
- Turkmenistán
- Uzbekistán
- Vietnam
- Východní Timor

Podívejte se také na


- Tři království Koreje Kategorie:Asie ja:アジア ko:아시아 ms:Asia simple:Asia th:ทวีปเอเชีย zh-min-nan:A-chiu

Jižní Korea

Jižní Korea, úředním názvem Korejská republika (), je stát ve východní Asii, zaujímající jižní polovinu Korejského poloostrova. Jeho sousedem je Severní Korea. Kategorie:Jižní Korea ja:大韓民国 ko:대한민국 ms:Korea Selatan simple:South Korea th:ประเทศเกาหลีใต้ zh-min-nan:Hân-kok

Utf-8

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It is able to represent any universal character in the Unicode standard, yet is backwards compatible with ASCII. For this reason, it is steadily becoming the preferred encoding for email, web pages, and other places characters are stored or streamed. UTF-8 uses one to four bytes (strictly, octets) per character, depending on the Unicode symbol. For example, only one byte is needed to encode the 128 US-ASCII characters in the Unicode range U+0000 to U+007F. Four bytes may seem like a lot for one character (code point); however, this is required only for code points outside the Basic Multilingual Plane, which are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. However, the differences between different encoding schemes can become negligible with the use of traditional compression systems like DEFLATE. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead. The IETF (Internet Engineering Task Force) requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding. The IMC (Internet Mail Consortium) [http://www.imc.org/mail-i18n.html recommends] that all email programs must be able to display and create mail using UTF-8.

Description

There are several current, slightly different definitions of UTF-8 in various standards documents:
- RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
- The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
- ISO/IEC 10646-1:2000 Annex D (2000) They supersede the definitions given in the following obsolete works:
- ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
- The Unicode Standard, Version 2.0, Appendix A (1996)
- RFC 2044 (1996)
- RFC 2279 (1998)
- The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1: UTF-8 Shortest Form (2000)
- Unicode Standard Annex #27: Unicode 3.1 (2001) They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input. The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters, particularly characters with code points lower than U+0020, traditionally called control characters, for example, carriage return. For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:
- It falls into the range of U+0080 to U+07FF. The table shows it will be encoded using two bytes, 110xxxxx 10xxxxxx.
- Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
- The eleven bits are put in their order into the positions marked by "x"-s: 11010111 10010000.
- The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That is the letter alef in UTF-8. So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes. By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering the whole area U+0000 to U+7FFFFFFF (31 bits). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, and 0xF5 to 0xFF.

Modified UTF-8

The Java programming language, which uses UTF-16 for its internal text representation, supports a non-standard modification of UTF-8 for string serialization. This encoding is called [http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 modified UTF-8]. There are two differences between modified and standard UTF-8. The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, presumably to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated. The second difference is in the way characters outside the BMP are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8. The reason for this modification is more subtle. In Java a character is 16 bits long; therefore some Unicode characters require two Java characters in order to be represented. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one UTF-16 code unit at a time, rather than one Unicode code point at a time. Unfortunately, this also means that characters requiring four bytes in UTF-8 require six bytes in modified UTF-8.

Rationale behind UTF-8's mechanics

The encoding of UTF-8 is based loosely on Huffman coding, a way of representing frequency-sorted binary trees. As a consequence of the exact mechanics of UTF-8, the following properties of multi-byte sequences hold:
- The most significant bit of a single-byte character is always 0.
- The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
- The remaining bytes in a multi-byte sequence have 10 as their two most significant bits. UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage. Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 1 in 32 for a 2 byte sequence, 5 in 256 for a 3 byte sequence and even lower for longer sequences. Whilst natural languages encoded in traditional encodings are far from random byte sequences they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted (obviously pure ASCII text would pass a UTF-8 validity test but provided the legacy encodings under consideration are also ASCII based this is not a problem). For example for ISO-8859-1 text to be misrecognized as UTF-8 the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. You can use the bit patterns to identify UTF-8 characters. If the byte's first hex code begins with 0-7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes.) If it begins with E, it is 16 bit (expressed in 3 bytes,) and if it begins with F, it is 21 bits (expressed in 4 bytes.) 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, you can tell at a glance that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" is a valid UTF-8 character.

Overlong forms, invalid input, and security considerations

The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence: # Insert a replacement character (e.g. '?', '�'). # Ignore the bytes. # Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map). # Not notice and decode as if the bytes were some similar bit of UTF-8 (this would indicate the decoder is buggy). # Stop decoding and report an error. It is possible for a decoder to behave in different ways for different types of invalid input. RFC 3629 only requires that UTF-8 decoders must not decode "overlong sequences" (where a character is encoded in more bytes than needed but still adheres to the forms above). The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." Overlong forms are one of the most troublesome types of data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, care must be taken to avoid security issues if validation is performed before conversion from UTF-8. To maintain security in the case of invalid input there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless.

Advantages and disadvantages


- General
  - Advantages
    - UTF-8 is a superset of ASCII. A plain ASCII string is also a valid UTF-8 string (this backwards-compatibility means that no conversion needs to be done for ASCII text).
    - Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.)
    - UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration. [http://www.w3.org/TR/REC-xml/#charencoding]
    - The Boyer-Moore string search algorithm can be used with UTF-8 data.
    - UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the [http://www.w3.org/International/questions/qa-forms-utf-8 W3 FAQ: Multilingual Forms] for a Perl regular expression to validate a UTF-8 string).
  - Disadvantages
    - A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.
- Compared to legacy encodings
  - Advantages
    - UTF-8 can encode any Unicode character. In most cases, legacy encodings can be converted to Unicode and back with no loss and—as UTF-8 is an encoding of Unicode—this applies to it too.
    - Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle, only the first character is lost if the stream happened to start in the middle of a multibyte sequence. Similarly, if a number of bytes are corrupted or dropped then correct decoding can resume on the next character boundary. Many legacy multi-byte encodings are much harder to resynchronise.
    - A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift-JIS (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
    - The first byte of a multibyte sequence is enough to determine the length of the multibyte sequence. This makes it extremely simple to extract a substring from a given string without elaborate parsing. This was often not the case in legacy multibyte encodings.
    - Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1 encoding).
  - Disadvantages
    - UTF-8 is generally larger than the appropriate legacy encoding for everything except diacritic-free, Latin-alphabet text. Most alphabetic scripts had only a single byte per character in legacy encodings but their letters take at least two bytes in UTF-8. Ideographic scripts generally had two bytes per character in their legacy encodings yet take three bytes per character in UTF-8.
    - Legacy encodings for almost all non-ideographic scripts use a single byte per character making string cutting and joining easy.
- Compared to UTF-7
  - Advantages
    - UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
    - UTF-8 encodes "+" as itself whereas UTF-7 encodes it as "+-".
  - Disadvantages
    - UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64. This extra stage of encoding carries a significant size penalty. For base64, the overhead is 33⅓%, while for quoted printable the overhead varies depending on how ASCII-heavy the text is; for French the overhead is about 14%, for non-Roman scripts containing no ASCII characters the overhead is 200%! For any text that contains more ascii characters than + signs UTF-7 will be smaller than the combination of UTF-8 with quoted printable with the difference increasing with the amount of non-ascii characters (even for ASCII-heavy languages such as French, UTF-7 is about 4% smaller than UTF-8 quoted printable) and for many texts it will beat the combinaion of UTF-8 with base64.
- Compared to UTF-16
  - Advantages
    - Byte values of 0 (The ASCII NUL character) do not appear in the encoding unless U+0000 (the Unicode NUL character) is represented. This means that legacy C library string functions (such as strncpy()) that use a null-terminator will not incorrectly truncate strings.
    - Since ASCII characters can be represented in a single byte, text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in UTF-16. Text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces.
    - Most existing computer programs (including operating systems) were not written with Unicode in mind, and using UTF-16 with them would create major compatibility issues as it is not a superset of ASCII. UTF-8 allows programs to treat ASCII as they always did, and changes behaviour only for non-ASCII characters that were different by location anyway.
    - In UTF-8, characters outside the basic multilingual plane are not a special case.
    - UTF-8 uses a byte as its atomic unit whilst UTF-16 uses a 16-bit word which is generally represented by a pair of bytes. This representation raises a couple of potential problems of its own.
      - When representing a word in UTF-16 as two bytes, the order of those two bytes becomes an issue. A variety of mechanisms can be used to deal with this issue, but they still present an added complication for software and protocol design.
      - If an odd number of bytes are removed from the beginning of UTF-16-encoded text, the result will be either invalid UTF-16 or completely meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text.
  - Disadvantages
    - UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user. While UTF-16 is technically also variable length many people do not know this or simply do not care about the rarely used code points outside the BMP.
    - Chinese, Japanese, and Korean (CJK) ideographs use three bytes in UTF-8, but only two in UTF-16. So CJK text takes up more space when represented in UTF-8. There are a few other less-well-known groups of code points that this also applies to.

History

UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout. UTF-8 was first officially presented at the USENIX conference in San Diego January 25-29 1993.

See also


- ASCII
- ISO 8859
- GB18030
- Universal Character Set
- Byte Order Mark
- Unicode and HTML
- Character encodings in HTML
- Unicode and e-mail
- Comparison of Unicode encodings

External links


- [http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt Rob Pike tells the story of UTF-8's creation]
- [http://www.cs.bell-labs.com/sys/doc/utf.pdf Original UTF-8 paper]
- RFC 3629, the UTF-8 standard
- RFC 2277, IETF policy on character sets and languages
- [http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 and Unicode FAQ] for Unix/Linux
- [http://www.utf-8.com/ UTF-8]
- [http://www.ccss.de/slovo/testuni.htm a UTF-8 test page]
- [http://www.unics.uni-hannover.de/nhtcapri/multilingual1.html another UTF-8 test page]
- [http://www.melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/HOWTO/howto.html UTF-8 and Debian] and [http://www.linux.org/docs/ldp/howto/Unicode-HOWTO.html Linux UTF-8 How-To] UTF-08 Category:Character sets ko:UTF-8 ja:UTF-8

accommodation in Glasgow Karty grafiki pharmacy narty austria narty francja










































:: RELATED NEWS ::
All Rights Reserved 2005 wikimiki.org