I’m trying to scrape a page in japanese using python, curl, and BeautifulSoup. I then save the text to a MySQL database that’s using utf-8 encoding, and display the resulting data using Django.
Here is an example URL:
https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180
I have a function I use to extract the HTML as a string:
def get_html(url):
c = Curl()
storage = StringIO()
c.setopt(c.URL, str(url))
cookie_file = ‘cookie.txt’
c.setopt(c.COOKIEFILE, cookie_file)
c.setopt(c.COOKIEJAR, cookie_file)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
return storage.getvalue()
I then pass it to BeautifulSoup:
html = get_html(str(scheduled_import.url))
soup = BeautifulSoup(html)
It is then parsed and saved it to a database. I then use Django to output the data to json. Here is the view I’m using:
def get_jobs(request):
jobs = Job.objects.all().only(*fields)
joblist = []
for job in jobs:
job_dict = {}
for field in fields:
job_dict[field] = getattr(job, field)
joblist.append(job_dict)
return HttpResponse(dumps(joblist), mimetype=’application/javascript’)
The resulting page displays bytecode such as:
xe3\x82\xb7\xe3\x83\xa3\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88\xe8\x81\xb7\xe5\x8b\x99\xe5\x86\x85\xe5\xae\xb9\xe3\x82\xb7\xe3\x82\xb9\xe3\x82\xb3\xe3\x82\xb7\xe3\x82\xb9\xe3\x83\x86\xe3\x83\xa0\xe3\x82\xba\xe3\x81\xae\xe3\x82\xb3\xe3\x83\xa9\xe3\x83\x9c\xe3\x83\xac\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe4\xba\x8b\xe6\xa5\xad\xe9\x83\xa8\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe4\xba\xba\xe3\x82\x92\xe4\xb8\xad\xe5\xbf\x83\xe3\x81\xa8\xe3\x81\x97\xe3\x81\x9f\xe3\x82\xb3\xe3\x83\x9f\xe3\x83\xa5\xe3\x83\x8b\xe3\x82\xb1\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe3\x81\xab\xe3\x82\x88\xe3\x82\x8a\xe3\
Instead of japanese.
I’ve been researching all day and have converted my DB to utf-8, tried decoding the text from iso-8859-1 and encoding to utf-8.
Basically I have no idea what I’m doing and would appreciate any help or suggestions I can get so I can avoid spending another day trying to figure this out.
Okay, this is a classic encoding problem, and it can be tricky. Let\’s break it down and create a systematic approach to solve it. Here\’s a comprehensive best answer addressing your issue:\n\n**Understanding the Problem**\n\nYou\’re seeing `\\xe3\\x82\\xb7` and similar sequences because you\’re displaying the *byte representation* of UTF-8 encoded Japanese characters instead of the characters themselves. This means somewhere in your pipeline, the data is being treated as bytes when it should be treated as Unicode strings. Essentially, the decoding isn\’t happening correctly, or is happening with the wrong encoding.\n\n**Troubleshooting Steps and Solution**\n\nHere\’s a step-by-step guide to identify and fix the problem, along with explanations:\n\n1. **Verify the HTML Encoding:**\n\n * **Inspect the Page Source:** The *most reliable* way to determine the encoding is to look at the `` tag in the `
` section of the HTML source code directly from the website. Specifically, look for something like:\n\n “`html\n\n “`\n\n or\n\n “`html\n\n “`\n\n If it specifies UTF-8 (or `utf-8`), that\’s a good sign. If it specifies something else (like `Shift_JIS` or `EUC-JP`), you\’ll need to use that encoding instead of UTF-8 in your Python code. If the meta tag is missing or ambiguous, HTTP headers might provide encoding information.\n\n * **Check HTTP Headers:** You can access the headers using your `Curl` object (or `requests` library if you switch to it, see recommendations below). After `c.perform()`, access the `EFFECTIVE_URL` and `RESPONSE_CODE`\n\n “`python\n c.perform()\n effective_url = c.getinfo(c.EFFECTIVE_URL)\n response_code = c.getinfo(c.RESPONSE_CODE)\n print(f\”Effective URL: {effective_url}\”)\n print(f\”Response Code: {response_code}\”)\n #access the headers with this\n header_size = c.getinfo(c.HEADER_SIZE)\n header_content = c.recv(header_size).decode(\’utf-8\’, \’ignore\’) #or whatever encoding\n print (header_content)\n “`\n\n Inspect the `Content-Type` header in the output. It might contain a `charset` parameter, such as `Content-Type: text/html; charset=UTF-8`.\n\n2. **Correctly Decode the HTML Content:**\n\n * **Specify Encoding in BeautifulSoup:** BeautifulSoup can often detect the encoding, but it\’s best to be explicit. Pass the detected encoding to the BeautifulSoup constructor:\n\n “`python\n html = get_html(str(scheduled_import.url))\n soup = BeautifulSoup(html, \’html.parser\’, from_encoding=\’utf-8\’) # Replace \’utf-8\’ with the correct encoding\n “`\n\n If the encoding is `Shift_JIS`, use `from_encoding=\’shift_jis\’`. Lowercase is important. Also, ensure you\’re using `html.parser` or `lxml` as the parser for better performance and encoding handling.\n\n3. **Handle Encoding During Data Extraction:**\n\n * **Ensure all extracted text is Unicode:** When you extract text from the BeautifulSoup object, make sure you\’re working with Unicode strings. BeautifulSoup\’s `get_text()` method generally returns Unicode, but double-check. If you\’re manually accessing `.string` or other attributes that might return byte strings, decode them immediately:\n\n “`python\n title = soup.find(\’h1\’).get_text() # Example: Extracting a title\n # If title *might* be a byte string:\n if isinstance(title, bytes):\n title = title.decode(\’utf-8\’, \’ignore\’) # Or the appropriate encoding\n “`\n\n The `ignore` argument will skip any characters it can\’t decode. Consider `errors=\’replace\’` to replace undecodable characters with a replacement character (like �) for debugging purposes.\n\n4. **Database Encoding:**\n\n * **Verify MySQL Connection Encoding:** Double-check that your MySQL connection is also using UTF-8. The way to configure this depends on the MySQL library you\’re using (e.g., `MySQLdb`, `mysql.connector.python`). Here\’s an example using `mysql.connector.python`:\n\n “`python\n import mysql.connector\n\n mydb = mysql.connector.connect(\n host=\”your_host\”,\n user=\”your_user\”,\n password=\”your_password\”,\n database=\”your_database\”,\n charset=\’utf8\’ # Important!\n )\n “`\n\n * **Database Table Encoding:** Confirm that the table where you\’re storing the data is also set to UTF-8. You can check this with a query like:\n\n “`sql\n SHOW CREATE TABLE your_table_name;\n “`\n\n Look for `DEFAULT CHARSET=utf8` in the output. If it\’s not UTF-8, alter the table:\n\n “`sql\n ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;\n “`\n\n `utf8mb4` is a more complete UTF-8 encoding that supports a wider range of characters, including emojis. The `utf8mb4_unicode_ci` collation provides case-insensitive comparison.\n\n5. **Django JSON Serialization:**\n\n * **Ensure Django is Serving UTF-8:** Django should generally handle UTF-8 correctly by default, but double-check your `settings.py`:\n\n “`python\n # settings.py\n FILE_CHARSET = \’utf-8\’\n “`\n\n * **JSON Encoding:** The `json.dumps()` function in Python should handle Unicode strings correctly. However, explicitly setting the `ensure_ascii` parameter to `False` is good practice:\n\n “`python\n import json\n from django.http import HttpResponse\n\n def get_jobs(request):\n jobs = Job.objects.all().only(*fields)\n joblist = []\n for job in jobs:\n job_dict = {}\n for field in fields:\n job_dict[field] = getattr(job, field)\n return HttpResponse(json.dumps(joblist, ensure_ascii=False), content_type=\’application/json; charset=utf-8\’)\n “`\n\n Setting `ensure_ascii=False` prevents `dumps` from escaping Unicode characters as `\\uXXXX` sequences, which can cause issues with some JavaScript clients. Also, include the `charset=utf-8` in the content type header. Using `content_type` instead of `mimetype` is preferred in modern Django.\n\n**Complete Example Code (with `requests` library)**\n\n“`python\nimport requests\nfrom bs4 import BeautifulSoup\nimport json\nfrom django.http import HttpResponse\n\ndef get_html(url):\n try:\n response = requests.get(url)\n response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)\n return response.text, response.encoding # Return the content and the encoding\n except requests.exceptions.RequestException as e:\n print(f\”Error fetching URL: {e}\”)\n return None, None\n\ndef get_jobs(request):\n jobs = Job.objects.all().only(*fields)\n joblist = []\n for job in jobs:\n job_dict = {}\n for field in fields:\n job_dict[field] = getattr(job, field)\n joblist.append(job_dict)\n return HttpResponse(json.dumps(joblist, ensure_ascii=False), content_type=\’application/json; charset=utf-8\’)\n\ndef scrape_and_save(url):\n html, encoding = get_html(url)\n if not html:\n return # Handle the error appropriately\n\n soup = BeautifulSoup(html, \’html.parser\’, from_encoding=encoding)\n\n # Example extraction (adapt to your actual page structure)\n job_title = soup.find(\’h1\’).get_text()\n if isinstance(job_title, bytes):\n job_title = job_title.decode(encoding, \’ignore\’)\n\n # Example: Extract all the text from the page and print to console\n #raw_text = soup.get_text()\n #print (raw_text)\n\n # Create job object and save it\n job = Job(title=job_title)\n job.save()\n return job\n\n# Example usage\n#new_job = scrape_and_save(\”https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180\”)\n“`\n\n**Key Improvements and Explanations:**\n\n* **`requests` Library:** Replaced `pycurl` with the `requests` library. `requests` is generally easier to use and handles encoding more gracefully. It automatically detects the encoding from the HTTP headers.\n* **Error Handling:** Added basic error handling to the `get_html` function to catch potential network issues.\n* **Explicit Encoding:** The `from_encoding` argument in `BeautifulSoup` is crucial.\n* **`ensure_ascii=False`:** Ensures that `json.dumps` doesn\’t escape Unicode characters.\n* **Content Type Header:** The `content_type` in `HttpResponse` is set to `application/json; charset=utf-8` to explicitly tell the browser that the response is UTF-8 encoded JSON.\n* **Database Connection:** Showed how to set the `charset` when connecting to MySQL using `mysql.connector.python`.\n* **Inspecting the website content**: Added a way to access the header information from the website so you can determine the correct encoding.\n* **Return the encoding from the GET request**: The `get_html` function now returns the `response.encoding` so it can be reused in the `BeautifulSoup` constructor\n\n**Debugging Tips:**\n\n* **Print Statements:** Use `print()` statements to inspect the data at each stage of the process. Print the raw HTML, the extracted text, and the JSON output. This will help you pinpoint where the encoding is going wrong.\n* **Isolate the Problem:** Try saving the scraped data to a simple text file instead of the database. This will help you determine if the issue is with the scraping or with the database interaction.\n* **Simplify:** Start with a very small, simple example and gradually add complexity.\n* **Browser Developer Tools:** Use your browser\’s developer tools (Network tab) to inspect the HTTP headers and response content directly from the server. This will help you confirm the encoding that the server is sending.\n* **Virtual Environment**: Use a virtual environment for your Python project to manage dependencies and avoid conflicts.\n\nBy following these steps systematically, you should be able to identify the encoding issue and resolve it, ensuring that your Japanese text is displayed correctly. Remember to start by verifying the HTML encoding and then work your way through the data pipeline, ensuring that each step handles Unicode correctly.\n