0

I’m trying to scrape a page in japanese using python, curl, and BeautifulSoup. I then save the text to a MySQL database that’s using utf-8 encoding, and display the resulting data using Django.
Here is an example URL:
https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180
I have a function I use to extract the HTML as a string:
def get_html(url):
c = Curl()
storage = StringIO()
c.setopt(c.URL, str(url))
cookie_file = ‘cookie.txt’
c.setopt(c.COOKIEFILE, cookie_file)
c.setopt(c.COOKIEJAR, cookie_file)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
return storage.getvalue()

I then pass it to BeautifulSoup:
html = get_html(str(scheduled_import.url))
soup = BeautifulSoup(html)

It is then parsed and saved it to a database. I then use Django to output the data to json. Here is the view I’m using:
def get_jobs(request):
jobs = Job.objects.all().only(*fields)
joblist = []
for job in jobs:
job_dict = {}
for field in fields:
job_dict[field] = getattr(job, field)
joblist.append(job_dict)
return HttpResponse(dumps(joblist), mimetype=’application/javascript’)

The resulting page displays bytecode such as:
xe3\x82\xb7\xe3\x83\xa3\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88\xe8\x81\xb7\xe5\x8b\x99\xe5\x86\x85\xe5\xae\xb9\xe3\x82\xb7\xe3\x82\xb9\xe3\x82\xb3\xe3\x82\xb7\xe3\x82\xb9\xe3\x83\x86\xe3\x83\xa0\xe3\x82\xba\xe3\x81\xae\xe3\x82\xb3\xe3\x83\xa9\xe3\x83\x9c\xe3\x83\xac\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe4\xba\x8b\xe6\xa5\xad\xe9\x83\xa8\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe4\xba\xba\xe3\x82\x92\xe4\xb8\xad\xe5\xbf\x83\xe3\x81\xa8\xe3\x81\x97\xe3\x81\x9f\xe3\x82\xb3\xe3\x83\x9f\xe3\x83\xa5\xe3\x83\x8b\xe3\x82\xb1\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe3\x81\xab\xe3\x82\x88\xe3\x82\x8a\xe3\
Instead of japanese.
I’ve been researching all day and have converted my DB to utf-8, tried decoding the text from iso-8859-1 and encoding to utf-8.
Basically I have no idea what I’m doing and would appreciate any help or suggestions I can get so I can avoid spending another day trying to figure this out.

Kuldeep Baberwal Changed status to publish February 17, 2025