Issue
I am trying to learn Julia, but am having trouble importing data from a csv into a Jupyter notebook. This seems basic and I'm sure I'm missing something easy, but I haven't been able to find a solution that works.
When I run the import script:
using DelimitedFiles
println("Download from https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths")
println("Downloaded and saved to csv on 6-25-2020")
wikiEVDraw = DelimitedFiles.readdlm("wikipediaEVDraw.csv", ',', header = true)
# Download from https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic_timeline_of_reported_cases_and_deaths
# Downloaded and saved to csv on 6-25-2020
# (Any["25-Nov-15" "28,637" … "14,122" "3,955"; "18-Nov-15" "28,634" … "14,122" "3,955"; … ; "31-Mar-14" # 130 … "–" "–"; "22-Mar-14" 49 … "–" "–"],
# AbstractString["\ufeffDate" "Total_cases" … "Sierra Leone_Cases" "Sierra Leon_Deaths"])
My problem is the leading characters ("\ufeff") in the first position. As a workaround, I can edit the source csv in an external program to add an extra line, then skip the first line with skipstart = 1
I've tried specifying the encoding based on a suggestion elsewhere, but adding encoding = :utf8
threw an error.
I think I could also split the string in the first header, but it seems like this should be standard.
The csv was created by copying data from a web table to Excel, then saving as csv. I've looked at the file in several other programs (R, notepad, Atom, notepad++), and don't see the leading character.
Solution
This is a BOM (byte-order-marked) the first byte that denotes character encoding. \ufeff
denotes here UTF-16 for more details see https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16
When reading the file you should skip it.
However, CSV.jl
does that automatically:
shell> more "C:\temp\f.txt"
a b
1 2
3 4
julia> CSV.read(raw"C:\temp\f.txt")
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
If you want to keep using DelimitedFiles
just skip the first three bytes:
julia> open(raw"C:\temp\f.txt") do io
read(io,3)
readdlm(io)
end
3×2 Array{Any,2}:
"a" "b"
1 2
3 4
In some scenarios you will actually have UTF-16 encoded character, in that case you will need to decode:
julia> using StringEncodings
julia> open(raw"C:\temp\f2.txt", enc"UTF-16") do io
readdlm(io)
end
3×2 Array{Any,2}:
"a" "b"
1 2
3 4
Answered By - Przemyslaw Szufel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.