Monday, April 4, 2022

[FIXED] Importing csv returns encoding identifier in initial position Julia in Jupyter notebook

April 04, 2022 csv, julia, jupyter-notebook No comments

Issue

I am trying to learn Julia, but am having trouble importing data from a csv into a Jupyter notebook. This seems basic and I'm sure I'm missing something easy, but I haven't been able to find a solution that works.

When I run the import script:

using DelimitedFiles
println("Download from https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths")
println("Downloaded and saved to csv on 6-25-2020")
wikiEVDraw = DelimitedFiles.readdlm("wikipediaEVDraw.csv", ',', header = true)

# Download from https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths
# Downloaded and saved to csv on 6-25-2020
# (Any["25-Nov-15" "28,637" … "14,122" "3,955"; "18-Nov-15" "28,634" … "14,122" "3,955"; … ; "31-Mar-14" # 130 … "–" "–"; "22-Mar-14" 49 … "–" "–"], 
# AbstractString["\ufeffDate" "Total_cases" … "Sierra Leone_Cases" "Sierra Leon_Deaths"])

My problem is the leading characters ("\ufeff") in the first position. As a workaround, I can edit the source csv in an external program to add an extra line, then skip the first line with skipstart = 1 I've tried specifying the encoding based on a suggestion elsewhere, but adding encoding = :utf8 threw an error.

I think I could also split the string in the first header, but it seems like this should be standard.

The csv was created by copying data from a web table to Excel, then saving as csv. I've looked at the file in several other programs (R, notepad, Atom, notepad++), and don't see the leading character.

Solution

This is a BOM (byte-order-marked) the first byte that denotes character encoding. \ufeff denotes here UTF-16 for more details see https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16

When reading the file you should skip it. However, CSV.jl does that automatically:

shell> more "C:\temp\f.txt"
ï»¿a    b
1       2
3       4

julia> CSV.read(raw"C:\temp\f.txt")
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

If you want to keep using DelimitedFiles just skip the first three bytes:

julia> open(raw"C:\temp\f.txt") do io
           read(io,3)
           readdlm(io)
       end
3×2 Array{Any,2}:
  "a"   "b"
 1     2
 3     4

In some scenarios you will actually have UTF-16 encoded character, in that case you will need to decode:

julia> using StringEncodings

julia> open(raw"C:\temp\f2.txt", enc"UTF-16") do io
           readdlm(io)
       end
3×2 Array{Any,2}:
  "a"   "b"
 1     2
 3     4

Answered By - Przemyslaw Szufel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, April 4, 2022

[FIXED] Importing csv returns encoding identifier in initial position Julia in Jupyter notebook

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels