Fix PHP cURL: parser error: Document labelled UTF-16 but has UTF-8 content
What?
This is an article with notes for me on how to convert some received XML encoded in UTF-16 to some JSON in UTF-8. If it were entirely in UTF-8, I would simply load the received XML with SimpleXML and use the built-in PHP JSON_encode function. I ran into the following errors:
Warning: SimpleXMLElement::__construct() [<a href='simplexmlelement.--construct'>simplexmlelement.--construct</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###
Warning: simplexml_load_string() [<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###Why?
So I've googled, binged and yahoo'd for this and although there are some solutions that deal with loading UTF16 content into SimpleXMLElement or simplexml_load_string, it doesn't solve my problem. I'm receiving XML data within a cURL result but I get the above error with using either "SimpleXMLElement" or "simplexml_load_string". Returning the XML with cURL isn't a problem, but I want to convert it to JSON and I usually use a PHP function to load the data into an XML array and use the built-in PHP function: "json_encode".
How?
So here's what I tried and ended up with:
If your XML is UTF-8
This is the basic code and will work to fetch some XML and return it in JSON formatting as long as the XML is encoded in UTF-8.
// set headers for JSON file // header('Content-Type: application/json'); // seems to cause 500 Internal Server Error header('Content-Type: text/javascript'); header('Access-Control-Allow-Origin: http://api.joellipman.com/'); header('Access-Control-Max-Age: 3628800'); header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE'); // open connection $ch = curl_init(); // set the cURL options curl_setopt($ch, CURLOPT_URL, $api_url); // where to send the variables to curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml')); // specify content type of what we're sending curl_setopt($ch, CURLOPT_HEADER, 0); // hide header info !!SECURITY WARNING!! curl_setopt($ch, CURLOPT_POST, TRUE); // TRUE to do a regular HTTP POST. curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml); // In my case, the XML form that will be submitted curl_setopt($ch, CURLOPT_TIMEOUT, 15); // Target API has a 15 second timeout curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly. // store the response $ch_result = curl_exec($ch); // close connection curl_close($ch); // convert the response to xml $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object"); // convert the xml to json $json_result = json_encode($xml_result); // print the json echo $json_result; // [OPTIONAL] convert it to an array // $array = json_decode($json_result,TRUE); // yields <?xml version="1.0" encoding="utf-8"?> ... ... ...
- // set headers for JSON file
- // header('Content-Type: application/json'); // seems to cause 500 Internal Server Error
- header('Content-Type: text/javascript');
- header('Access-Control-Allow-Origin: http://api.joellipman.com/');
- header('Access-Control-Max-Age: 3628800');
- header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
- // open connection
- $ch = curl_init();
- // set the cURL options
- curl_setopt($ch, CURLOPT_URL, $api_url);  // where to send the variables to
- curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
- curl_setopt($ch, CURLOPT_HEADER, 0);  // hide header info !!SECURITY WARNING!!
- curl_setopt($ch, CURLOPT_POST, true);  // TRUE to do a regular HTTP POST.
- curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);  // In my case, the XML form that will be submitted
- curl_setopt($ch, CURLOPT_TIMEOUT, 15);  // Target API has a 15 second timeout
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
- // store the response
- $ch_result = curl_exec($ch);
- // close connection
- curl_close($ch);
- // convert the response to xml
- $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
- // convert the xml to json
- $json_result = json_encode($xml_result);
- // print the json
- echo $json_result;
- // [OPTIONAL] convert it to an array
- // $array = json_decode($json_result,true);
- // yields <?xml version="1.0" encoding="utf-8"?> ... ... ...
Without cURL
You'll have seen this all over the Internet as the accepted solution... Doesn't work for me because I'm using cURL but it's a first point of reference. This will work if the received XML is a string.
// setting XML value $string = '<?xml version="1.0" encoding="utf-16"?> <Response Version="1.0"> <DateTime>2/13/2013 10:37:24 PM
- // setting XML value
- $string = '<?xml version="1.0" encoding="utf-16"?>
- <Response Version="1.0">
- <DateTime>2/13/2013 10:37:24 PM
With cURL: Other things I tried
ERROR: Using the above preg_replace function
/* Replace UTF-16 with UTF-8 */ $xml_utf8 = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $ch_result); $xml_result = simplexml_load_string($xml_utf8); // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###' // to catch error use: $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
- /* Replace UTF-16 with UTF-8 */
- $xml_utf8 = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $ch_result);
- $xml_result = simplexml_load_string($xml_utf8);
- // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
- // to catch error use: $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
ERROR: Using built-in function mb_convert_encoding
/* Convert the UTF-16 to UTF-8: Using function mb_convert_encoding */ $xml_utf8 = mb_convert_encoding($ch_result, 'UTF-8', 'UTF-16'); // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
- /* Convert the UTF-16 to UTF-8: Using function mb_convert_encoding */
- $xml_utf8 = mb_convert_encoding($ch_result, 'UTF-8', 'UTF-16');
- // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
ERROR: Using built-in function utf8_encode
/* Convert the UTF-16 to UTF-8 using a function */ $xml_utf8 = utf8_encode($ch_result); // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = utf8_encode($ch_result);
- // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
ERROR: A potential function to re-encode it from Craig Lotter
/* Convert the UTF-16 to UTF-8 using a function */ $xml_utf8 = utf16_to_utf8($ch_result); // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###' // also yields: ??? 呭㤳䥆汶摓䉄套㑧唲噬䥅ㅬ䥑㴽
- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = utf16_to_utf8($ch_result);
- // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
- // also yields: ??? '㤳䥆汶'"䉄--'"噬......'㴽
ERRORS: A 2-Hour play around
/* Encode received cURL result in a JSON feed */ $json_encoded_str = json_encode($ch_result); /* Convert the UTF-16 to UTF-8 using a function */ $json_encoded_str_8 = (string) utf8_encode($json_encoded_str); /* In the XML, replace the UTF-16 with UTF-8 */ $json_encoded_str = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $json_encoded_str_8); /* In the XML, replace the UTF-16 with UTF-8 */ $json_encoded = json_encode($json_encoded_str); // yields escaped JSON: "<?xml version=\"1.0\" encoding=\"utf-16\"?><soap:Envelope
- /* Encode received cURL result in a JSON feed */
- $json_encoded_str = json_encode($ch_result);
- /* Convert the UTF-16 to UTF-8 using a function */
- $json_encoded_str_8 = (string) utf8_encode($json_encoded_str);
- /* In the XML, replace the UTF-16 with UTF-8 */
- $json_encoded_str = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $json_encoded_str_8);
- /* In the XML, replace the UTF-16 with UTF-8 */
- $json_encoded = json_encode($json_encoded_str);
- // yields escaped JSON: "<?xml version=\"1.0\" encoding=\"utf-16\"?><soap:Envelope
ERROR: Using built-in function iconv. Another 4-hour saga
/* Convert the UTF-16 to UTF-8 using a function */ $xml_utf8 = iconv('UTF-16', 'UTF-8', $ch_result); // $xml_utf8 = iconv('UTF-16BE', 'UTF-8', $ch_result); // same result specifying Big-Endian // yields error 'error on line 1 at column 1: Document is empty' // but view the source: 㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽瑵ⵦ㘱㼢㰾潳灡䔺癮汥灯浸湬㩳獸㵩栢瑴㩰⼯睷㍷漮杲㈯ // OTHER ERRORS: // error on line 1 at column 1: Document is empty // error on line 2 at column 1: Extra content at the end of the document // error on line 2 at column 1: Encoding error // error on line 1 at column 491: xmlParseEntityRef: no name // this is because you need to escape the 5 characters (", ', <, >, &) in XML
- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = iconv('UTF-16', 'UTF-8', $ch_result);
- // $xml_utf8 = iconv('UTF-16BE', 'UTF-8', $ch_result); // same result specifying Big-Endian
- // yields error 'error on line 1 at column 1: Document is empty'
- // but view the source: 㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽'ⵦ㘱㼢㰾潳灡"癮汥灯浸湬㩳獸㵩栢'㩰⼯睷㍷漮杲㈯
- // OTHER ERRORS:
- // error on line 1 at column 1: Document is empty
- // error on line 2 at column 1: Extra content at the end of the document
- // error on line 2 at column 1: Encoding error
- // error on line 1 at column 491: xmlParseEntityRef: no name
- // this is because you need to escape the 5 characters (", ', <, >, &) in XML
NOT-QUITE-RIGHT: Use a Parser and re-Output the XML
// Create an XML parser $parser = xml_parser_create(); // Stop returning elements in UPPERCASE xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0); // Parse XML data into an array structure xml_parse_into_struct($parser, str_replace(array("\n", "\r", "\t"), '', $ch_result), $structure); // Free the XML parser xml_parser_free($parser); // create XML string from parsed XML $xml_string = ''; $xml_escaped_chars = array('"', '\'', '<', '>', '&'); $xml_escaped_chars_rep = array('"', ''', '<', '>', '&'); foreach($structure as $xml_element){ $this_value = (isset($xml_element['value'])) ? str_replace($xml_escaped_chars, $xml_escaped_chars_rep, trim($xml_element['value'])) : ''; $this_attr = (isset($xml_element['attributes'])) ? $xml_element['attributes'] : array(); $this_attr_str = ''; if (count($this_attr)>0){ foreach($this_attr as $attr_key => $attr_value){ $this_attr_str.= ' '.$attr_key.'="'.$attr_value.'"'; } } if ($xml_element['type']=='open'){ $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'; } else if ($xml_element['type']=='complete'){ $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'.$this_value.'</'.$xml_element['tag'].'>'; } else if ($xml_element['type']=='close'){ $xml_string.='</'.$xml_element['tag'].'>'; } } // $simple_xml = simplexml_load_string($xml_string); // still fails (not UTF-8) echo '<?xml version="1.0" encoding="utf-8"?>'.utf8_encode($xml_string); // yields <?xml version="1.0" encoding="utf-8"?> ... ... ... (corrupted?)
- // Create an XML parser
- $parser = xml_parser_create();
- // Stop returning elements in UPPERCASE
- xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
- // Parse XML data into an array structure
- xml_parse_into_struct($parser, str_replace(array("\n", "\r", "\t"), '', $ch_result), $structure);
- // Free the XML parser
- xml_parser_free($parser);
- // create XML string from parsed XML
- $xml_string = '';
- $xml_escaped_chars = array('"', '\'', '<', '>', '&');
- $xml_escaped_chars_rep = array('"', ''', '<', '>', '&');
- foreach($structure as $xml_element){
- $this_value = (isset($xml_element['value'])) ? str_replace($xml_escaped_chars, $xml_escaped_chars_rep, trim($xml_element['value'])) : '';
- $this_attr = (isset($xml_element['attributes'])) ? $xml_element['attributes'] : array();
- $this_attr_str = '';
- if (count($this_attr)>0){
- foreach($this_attr as $attr_key => $attr_value){
- $this_attr_str.= ' '.$attr_key.'="'.$attr_value.'"';
- }
- }
- if ($xml_element['type']=='open'){
- $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>';
- } else if ($xml_element['type']=='complete'){
- $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'.$this_value.'</'.$xml_element['tag'].'>';
- } else if ($xml_element['type']=='close'){
- $xml_string.='</'.$xml_element['tag'].'>';
- }
- }
- // $simple_xml = simplexml_load_string($xml_string); // still fails (not UTF-8)
- echo '<?xml version="1.0" encoding="utf-8"?>'.utf8_encode($xml_string);
- // yields <?xml version="1.0" encoding="utf-8"?> ... ... ... (corrupted?)
So...
With cURL - a solution with a compromise
After many more hours, a solution to convert XML in UTF-16 from a cURL source and convert it to JSON. The output isn't necessarily in UTF-8 so I'll update this article if the mobile app has problems reading the JSON feed. When writing the loop of the "not-quite-right" solution above, I found the following function in a discussion thread: Integrating symphony website with external api [whmcs]
// set headers for JSON file header('Content-Type: text/javascript; charset=utf8'); header('Access-Control-Allow-Origin: http://api.joellipman.com/'); header('Access-Control-Max-Age: 3628800'); header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE'); // the function that will convert our XML to an array function whmcsapi_xml_parser($rawxml) { $xml_parser = xml_parser_create(); xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0); // stop elements being converted to UPPERCASE xml_parse_into_struct($xml_parser, $rawxml, $vals, $index); xml_parser_free($xml_parser); $params = array(); $level = array(); $alreadyused = array(); $x=0; foreach ($vals as $xml_elem) { if ($xml_elem['type'] == 'open') { if (in_array($xml_elem['tag'],$alreadyused)) { $x++; $xml_elem['tag'] = $xml_elem['tag'].$x; } $level[$xml_elem['level']] = $xml_elem['tag']; $alreadyused[] = $xml_elem['tag']; } if ($xml_elem['type'] == 'complete') { $start_level = 1; $php_stmt = '$params'; while($start_level < $xml_elem['level']) { $php_stmt .= '[$level['.$start_level.']]'; $start_level++; } $php_stmt .= '[$xml_elem[\'tag\']] = $xml_elem[\'value\'];'; @eval($php_stmt); } } return($params); } // open connection $ch = curl_init(); // set the cURL options curl_setopt($ch, CURLOPT_URL, $api_url); // where to send the variables to curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml')); // specify content type of what we're sending curl_setopt($ch, CURLOPT_HEADER, 0); // hide header info !!SECURITY WARNING!! curl_setopt($ch, CURLOPT_POST, TRUE); // TRUE to do a regular HTTP POST. curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml); // In my case, the XML form that will be submitted curl_setopt($ch, CURLOPT_TIMEOUT, 15); // Target API has a 15 second timeout curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly. // store the response $ch_result = curl_exec($ch); // close connection curl_close($ch); // parse XML with the whmcsapi_xml_parser function $whmcsapi_arr = whmcsapi_xml_parser($ch_result); // Output returned value as Array // print_r($whmcsapi_arr); // Encode in JSON $json_whmcsapi = json_encode((array) $whmcsapi_arr); echo $json_whmcsapi;
- // set headers for JSON file
- header('Content-Type: text/javascript; charset=utf8');
- header('Access-Control-Allow-Origin: http://api.joellipman.com/');
- header('Access-Control-Max-Age: 3628800');
- header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
- // the function that will convert our XML to an array
- function whmcsapi_xml_parser($rawxml) {
- $xml_parser = xml_parser_create();
- xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);  // stop elements being converted to UPPERCASE
- xml_parse_into_struct($xml_parser, $rawxml, $vals, $index);
- xml_parser_free($xml_parser);
- $params = array();
- $level = array();
- $alreadyused = array();
- $x=0;
- foreach ($vals as $xml_elem) {
- if ($xml_elem['type'] == 'open') {
- if (in_array($xml_elem['tag'],$alreadyused)) {
- $x++;
- $xml_elem['tag'] = $xml_elem['tag'].$x;
- }
- $level[$xml_elem['level']] = $xml_elem['tag'];
- $alreadyused[] = $xml_elem['tag'];
- }
- if ($xml_elem['type'] == 'complete') {
- $start_level = 1;
- $php_stmt = '$params';
- while($start_level < $xml_elem['level']) {
- $php_stmt .= '[$level['.$start_level.']]';
- $start_level++;
- }
- $php_stmt .= '[$xml_elem[\'tag\']] = $xml_elem[\'value\'];';
- @eval($php_stmt);
- }
- }
- return($params);
- }
- // open connection
- $ch = curl_init();
- // set the cURL options
- curl_setopt($ch, CURLOPT_URL, $api_url);  // where to send the variables to
- curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
- curl_setopt($ch, CURLOPT_HEADER, 0);  // hide header info !!SECURITY WARNING!!
- curl_setopt($ch, CURLOPT_POST, true);  // TRUE to do a regular HTTP POST.
- curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);  // In my case, the XML form that will be submitted
- curl_setopt($ch, CURLOPT_TIMEOUT, 15);  // Target API has a 15 second timeout
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
- // store the response
- $ch_result = curl_exec($ch);
- // close connection
- curl_close($ch);
- // parse XML with the whmcsapi_xml_parser function
- $whmcsapi_arr = whmcsapi_xml_parser($ch_result);
- // Output returned value as Array
- // print_r($whmcsapi_arr);
- // Encode in JSON
- $json_whmcsapi = json_encode((array) $whmcsapi_arr);
- echo $json_whmcsapi;
Off-Topic
But good snippet for cURL by David Walsh
// set POST variables $url = 'http://domain.com/get-post.php'; $fields = array( 'lname' => urlencode($last_name), 'fname' => urlencode($first_name), 'title' => urlencode($title), 'company' => urlencode($institution), 'age' => urlencode($age), 'email' => urlencode($email), 'phone' => urlencode($phone) ); // url-ify the data for the POST foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; } rtrim($fields_string, '&'); // open connection $ch = curl_init(); // set the url, number of POST vars, POST data curl_setopt($ch,CURLOPT_URL, $url); curl_setopt($ch,CURLOPT_POST, count($fields)); curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string); // execute post $result = curl_exec($ch); // close connection curl_close($ch);
- // set POST variables
- $url = 'http://domain.com/get-post.php';
- $fields = array(
- 'lname' => urlencode($last_name),
- 'fname' => urlencode($first_name),
- 'title' => urlencode($title),
- 'company' => urlencode($institution),
- 'age' => urlencode($age),
- 'email' => urlencode($email),
- 'phone' => urlencode($phone)
- );
- // url-ify the data for the POST
- foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
- rtrim($fields_string, '&');
- // open connection
- $ch = curl_init();
- // set the url, number of POST vars, POST data
- curl_setopt($ch,CURLOPT_URL, $url);
- curl_setopt($ch,CURLOPT_POST, count($fields));
- curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);
- // execute post
- $result = curl_exec($ch);
- // close connection
- curl_close($ch);
Things I stumbled upon regarding SSL and cURL
Posted data for third-party apps is often required via SSL so this may come in handy
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); // TRUE to output SSL certification information to STDERR on secure transfers. curl_setopt($ch, CURLOPT_CERTINFO, TRUE); curl_setopt($ch, CURL_SSLVERSION_SSLv3, TRUE);
- curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
- curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
- // TRUE to output SSL certification information to STDERR on secure transfers.
- curl_setopt($ch, CURLOPT_CERTINFO, true);
- curl_setopt($ch, CURL_SSLVERSION_SSLv3, true);
Future Considerations
The data still hasn't been properly decoded from UTF-16 and encoded to UTF-8
- Test writing to a file, re-encoding the file then reading from it.
Helpful Links Well this is my stop. It's being several hours that for others could have taken a several minutes if you knew where to look. My aim was to convert UTF-16 received XML to UTF-8 in order to convert XML to JSON and that has been achieved in part. It's 6am and I'm off to bed.